RelayPlane is an npm-native Node.js LLM proxy. Install with npm install @relayplane/proxy, point your ANTHROPIC_BASE_URL or OPENAI_BASE_URL at http://localhost:4100, and get per-request cost tracking, complexity-based model routing, and Ollama local fallback, no Docker required.

How does RelayPlane compare to LiteLLM?

LiteLLM is a Python library with a proxy server option. RelayPlane is npm-native for Node.js developers, no Python, no Docker, just npm install. RelayPlane has per-request cost tracking built in and integrates directly with Claude Code, Cursor, and OpenClaw via a simple base URL override.

How does RelayPlane compare to Portkey or Helicone?

Portkey and Helicone offer cloud observability and are npm-compatible. RelayPlane runs entirely locally (no cloud required), is open source under MIT, and focuses on cost tracking plus complexity-based routing for Node.js developers using Claude Code, Cursor, and OpenClaw.

Does RelayPlane work with Claude Code and Cursor?

Yes. RelayPlane is designed for Claude Code, Cursor, and any tool that supports ANTHROPIC_BASE_URL or OPENAI_BASE_URL. Set ANTHROPIC_BASE_URL=http://localhost:4100 and RelayPlane intercepts every request, tracks costs, and routes to cheaper models for simple tasks.

Yes, completely. RelayPlane is free and open source (MIT). Every feature is included: full proxy, local and cloud dashboard, all 11 providers, budget enforcement, anomaly detection, hard cost caps, auto-kill, and 90-day history. It runs locally on your machine, your keys and prompts never leave your box. No tiers, no paywalls, no credit card. See relayplane.com/pricing.

How do I install the RelayPlane npm LLM proxy?

Run: npm install -g @relayplane/proxy. Then: relayplane init && relayplane start. The local dashboard runs at http://localhost:4100. Set ANTHROPIC_BASE_URL=http://localhost:4100 in your environment and all Claude requests route through RelayPlane automatically.

agent runaway costsllm agent budget limitsprevent ai agent overspendingai-agentscost-controlrelayplane

Agent Runaway Costs: How to Set LLM Budget Limits Before Costs Spiral

Name: RelayPlane
Author: Matt Turley

Matt Turley·March 24, 2026·7 min read

A developer on r/AI_Agents recently described watching their agent rack up $15 in API costs in under 10 minutes before they caught it. The comments were full of developers sharing similar stories. Fifteen dollars might not sound like much, but multiply it across a fleet of agents running in production, 24/7, and you start to see the problem. Agent runaway costs are not a theoretical risk. They are happening right now to teams that ship AI agents without budget guardrails.

This post covers why it happens, how to set LLM agent budget limits in your code, and how to enforce them at the infrastructure layer so a single broken loop never drains your API credits again.

Why Runaway Costs Happen

Most agent architectures follow some version of a loop: observe, think, act, repeat. The agent keeps calling the LLM until it decides it is done. That “until” is where the money goes.

Here are the most common failure modes:

Token loops. The agent gets stuck retrying a failed tool call or rephrasing the same query. Each iteration burns input and output tokens. Without a loop counter or token ceiling, this runs until your API key hits its rate limit, or your wallet hits zero.

No max_tokens set. If you do not cap output tokens per request, the model will generate as much as it wants. A single response can run to 8,000+ tokens if the model decides to be thorough. Across hundreds of requests, that adds up fast.

No circuit breaker. The agent has no concept of a session budget. It does not know or care that the last 30 calls cost $4. It just keeps going.

Parallel agents. If you are running multiple agents concurrently (common in agentic workflows that fan out subtasks), each one independently accumulates cost. A bug in one agent's logic multiplies across every instance.

The Hidden Math

Let's look at real numbers. Claude Sonnet 4.5 costs $3 per million input tokens and $15 per million output tokens. That sounds cheap until you do the math on an agentic loop.

A typical agent turn might include 2,000 input tokens (system prompt, conversation history, tool results) and 800 output tokens. That is a fraction of a cent per turn. But an agent that takes 50 turns on a complex task hits 100,000 input tokens and 40,000 output tokens, costing roughly $0.90 per session. Run 100 of those sessions per hour, and you are looking at $90/hour, or over $2,100/day.

Now imagine one of those sessions gets stuck in a loop and runs 500 turns instead of 50. A single broken session can cost $9 or more. If you are running parallel agents and the bug affects multiple instances, that $15-in-8-minutes story starts to look conservative.

The point is not that LLM APIs are expensive. The point is that agent runaway costs scale silently and fast when there is no ceiling.

How to Set Budget Limits in Code

The first layer of defense is in your application code. Here are practical patterns to prevent AI agent overspending.

Cap output tokens per request:

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-5-20250514",
  max_tokens: 1024, // Hard cap per response
  messages: conversationHistory,
});

Add a turn counter with a hard stop:

const MAX_TURNS = 25;
let turnCount = 0;

while (!agent.isDone() && turnCount < MAX_TURNS) {
  await agent.step();
  turnCount++;
}

if (turnCount >= MAX_TURNS) {
  logger.warn("Agent hit turn limit, terminating session");
}

Track cumulative token usage:

let totalInputTokens = 0;
let totalOutputTokens = 0;
const SESSION_BUDGET_CENTS = 50; // $0.50 per session

function checkBudget(usage: { input_tokens: number; output_tokens: number }) {
  totalInputTokens += usage.input_tokens;
  totalOutputTokens += usage.output_tokens;

  const costCents =
    (totalInputTokens / 1_000_000) * 300 +
    (totalOutputTokens / 1_000_000) * 1500;

  if (costCents > SESSION_BUDGET_CENTS) {
    throw new Error(`Session budget exceeded: ${costCents.toFixed(1)}¢`);
  }
}

These patterns work. But they have a weakness: they live in your application code. If the agent crashes and restarts, the counters reset. If a developer forgets to add the check in a new agent, there is no safety net. And if you need to change limits across a fleet, you are deploying code changes to every service.

Budget Limits at the Infrastructure Layer

This is where RelayPlane fits in. Instead of relying on every agent implementation to track its own spend, you enforce LLM agent budget limits at the proxy layer, before requests ever reach the LLM provider.

RelayPlane sits between your agents and the API. Every request passes through it, which means budget enforcement happens in one place, consistently, regardless of which agent sent the request.

Here is what that looks like in practice:

Per-request token caps. RelayPlane enforces max_tokens on every request, even if the calling code forgot to set it. You define the default in your proxy config, not in each agent's source code.

Daily and hourly budget caps. You set a dollar ceiling per day, per hour, or per request. When the budget is hit, RelayPlane returns an error to the agent instead of forwarding the request. The agent gets a clear signal to stop, and your bill stops growing.

Automatic cutoff with alerts. RelayPlane tracks spend against your configured thresholds and alerts you before costs spiral, not after.

Per-request cost tracking. Every request is logged with its token count and cost. You can see exactly which agent, which session, and which prompt consumed what. No more guessing why your monthly bill jumped.

The setup is straightforward:

npm install @relayplane/proxy

Then point your agents at the RelayPlane endpoint instead of calling the LLM provider directly. Budget policies are configured via config files, not scattered across your codebase.

For detailed setup and policy configuration, see the RelayPlane documentation.

The Bottom Line

Agent runaway costs are a solved problem if you put the guardrails in the right place. Application-level checks are a good start. Infrastructure-level enforcement is what makes it reliable.

Set max_tokens on every request. Add turn counters and session budgets in your code. Then put a proxy layer in front of your LLM calls that enforces limits consistently, tracks spend per request, and cuts off agents before they blow through your budget.

If you are running AI agents in production and do not have budget limits enforced at the infrastructure layer, it is a matter of time before you have your own $15-in-8-minutes story.

Start here:

npm install @relayplane/proxy

Docs and configuration: relayplane.com.

Matt Turley is a solo developer building RelayPlane. He writes about AI infrastructure, agent workflows, and building in public.

All posts