agent runaway costsllm agent budget limitsprevent ai agent overspendingai-agentscost-controlrelayplane

Agent Runaway Costs: How to Set LLM Budget Limits Before Costs Spiral

Matt Turley··7 min read

A developer on r/AI_Agents recently described watching their agent rack up $15 in API costs in under 10 minutes before they caught it. The comments were full of developers sharing similar stories. Fifteen dollars might not sound like much, but multiply it across a fleet of agents running in production, 24/7, and you start to see the problem. Agent runaway costs are not a theoretical risk. They are happening right now to teams that ship AI agents without budget guardrails.

This post covers why it happens, how to set LLM agent budget limits in your code, and how to enforce them at the infrastructure layer so a single broken loop never drains your API credits again.

Why Runaway Costs Happen

Most agent architectures follow some version of a loop: observe, think, act, repeat. The agent keeps calling the LLM until it decides it is done. That “until” is where the money goes.

Here are the most common failure modes:

Token loops. The agent gets stuck retrying a failed tool call or rephrasing the same query. Each iteration burns input and output tokens. Without a loop counter or token ceiling, this runs until your API key hits its rate limit, or your wallet hits zero.

No max_tokens set. If you do not cap output tokens per request, the model will generate as much as it wants. A single response can run to 8,000+ tokens if the model decides to be thorough. Across hundreds of requests, that adds up fast.

No circuit breaker. The agent has no concept of a session budget. It does not know or care that the last 30 calls cost $4. It just keeps going.

Parallel agents. If you are running multiple agents concurrently (common in agentic workflows that fan out subtasks), each one independently accumulates cost. A bug in one agent's logic multiplies across every instance.

The Hidden Math

Let's look at real numbers. Claude Sonnet 4.5 costs $3 per million input tokens and $15 per million output tokens. That sounds cheap until you do the math on an agentic loop.

A typical agent turn might include 2,000 input tokens (system prompt, conversation history, tool results) and 800 output tokens. That is a fraction of a cent per turn. But an agent that takes 50 turns on a complex task hits 100,000 input tokens and 40,000 output tokens, costing roughly $0.90 per session. Run 100 of those sessions per hour, and you are looking at $90/hour, or over $2,100/day.

Now imagine one of those sessions gets stuck in a loop and runs 500 turns instead of 50. A single broken session can cost $9 or more. If you are running parallel agents and the bug affects multiple instances, that $15-in-8-minutes story starts to look conservative.

The point is not that LLM APIs are expensive. The point is that agent runaway costs scale silently and fast when there is no ceiling.

How to Set Budget Limits in Code

The first layer of defense is in your application code. Here are practical patterns to prevent AI agent overspending.

Cap output tokens per request:

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-5-20250514",
  max_tokens: 1024, // Hard cap per response
  messages: conversationHistory,
});

Add a turn counter with a hard stop:

const MAX_TURNS = 25;
let turnCount = 0;

while (!agent.isDone() && turnCount < MAX_TURNS) {
  await agent.step();
  turnCount++;
}

if (turnCount >= MAX_TURNS) {
  logger.warn("Agent hit turn limit, terminating session");
}

Track cumulative token usage:

let totalInputTokens = 0;
let totalOutputTokens = 0;
const SESSION_BUDGET_CENTS = 50; // $0.50 per session

function checkBudget(usage: { input_tokens: number; output_tokens: number }) {
  totalInputTokens += usage.input_tokens;
  totalOutputTokens += usage.output_tokens;

  const costCents =
    (totalInputTokens / 1_000_000) * 300 +
    (totalOutputTokens / 1_000_000) * 1500;

  if (costCents > SESSION_BUDGET_CENTS) {
    throw new Error(`Session budget exceeded: ${costCents.toFixed(1)}¢`);
  }
}

These patterns work. But they have a weakness: they live in your application code. If the agent crashes and restarts, the counters reset. If a developer forgets to add the check in a new agent, there is no safety net. And if you need to change limits across a fleet, you are deploying code changes to every service.

Budget Limits at the Infrastructure Layer

This is where RelayPlane fits in. Instead of relying on every agent implementation to track its own spend, you enforce LLM agent budget limits at the proxy layer, before requests ever reach the LLM provider.

RelayPlane sits between your agents and the API. Every request passes through it, which means budget enforcement happens in one place, consistently, regardless of which agent sent the request.

Here is what that looks like in practice:

Per-request token caps. RelayPlane enforces max_tokens on every request, even if the calling code forgot to set it. You define the default in your proxy config, not in each agent's source code.

Daily and hourly budget caps. You set a dollar ceiling per day, per hour, or per request. When the budget is hit, RelayPlane returns an error to the agent instead of forwarding the request. The agent gets a clear signal to stop, and your bill stops growing.

Automatic cutoff with alerts. RelayPlane tracks spend against your configured thresholds and alerts you before costs spiral, not after.

Per-request cost tracking. Every request is logged with its token count and cost. You can see exactly which agent, which session, and which prompt consumed what. No more guessing why your monthly bill jumped.

The setup is straightforward:

npm install @relayplane/proxy

Then point your agents at the RelayPlane endpoint instead of calling the LLM provider directly. Budget policies are configured via config files, not scattered across your codebase.

For detailed setup and policy configuration, see the RelayPlane documentation.

The Bottom Line

Agent runaway costs are a solved problem if you put the guardrails in the right place. Application-level checks are a good start. Infrastructure-level enforcement is what makes it reliable.

Set max_tokens on every request. Add turn counters and session budgets in your code. Then put a proxy layer in front of your LLM calls that enforces limits consistently, tracks spend per request, and cuts off agents before they blow through your budget.

If you are running AI agents in production and do not have budget limits enforced at the infrastructure layer, it is a matter of time before you have your own $15-in-8-minutes story.

Start here:

npm install @relayplane/proxy

Docs and configuration: relayplane.com.


Matt Turley is a solo developer building RelayPlane. He writes about AI infrastructure, agent workflows, and building in public.