Claude API Behind a Proxy: Cost Control, Fallbacks, and Routing for Production Apps
Running Claude in production is different from running it in a demo. Demos don't have rate limits that spike at 9am. Demos don't have users who discover they can paste 50,000 tokens into a text field. Demos don't need to stay online when Anthropic has an incident.
This post covers three production concerns that come up when shipping Claude-based features: rate limits, cost overruns, and provider fallbacks. Each one has a clean solution when you're running traffic through @relayplane/proxy.
Concern 1: Rate Limits That Kill User Requests
Anthropic's rate limits are per API key. If your app sends 50 concurrent requests and you hit the token-per-minute ceiling, users get errors. The naive fix is to tell users to try again. The production fix is to queue, retry with backoff, and make the failure invisible.
With a raw Anthropic client, you're writing this yourself:
// What retry logic looks like without a proxy
async function callClaude(messages, retries = 0) {
try {
return await anthropic.messages.create({
model: 'claude-opus-4-6',
max_tokens: 1024,
messages,
});
} catch (err) {
if (err.status === 429 && retries < 3) {
const delay = (retries + 1) * 2000;
await new Promise(r => setTimeout(r, delay));
return callClaude(messages, retries + 1);
}
throw err;
}
}That handles the simple case. The real problem is that 429 from Anthropic comes in two flavors: requests-per-minute (RPM) and tokens-per-minute (TPM). The correct retry delay for each is different. Anthropic includes a retry-after header for RPM limits but not always for TPM limits. You also need to handle 529 (overloaded) separately.
With @relayplane/proxy, you install it and point your client at localhost:4100. The proxy handles rate limit detection, retry timing, and backoff. Your application code never sees 429s.
npm install -g @relayplane/proxy
relayplane init
relayplane startimport Anthropic from '@anthropic-ai/sdk';
const anthropic = new Anthropic({
baseURL: 'http://localhost:4100',
});
// Exactly the same code. Rate limit handling is in the proxy layer.
const response = await anthropic.messages.create({
model: 'claude-opus-4-6',
max_tokens: 1024,
messages: [{ role: 'user', content: userMessage }],
});Concern 2: Cost Overruns from Long Contexts
Claude Opus is significantly more expensive per token than Sonnet or Haiku. A user who pastes a 100,000-token document into your app can generate a substantial per-request cost. If you have 1,000 users doing that, you have a problem before end of day.
The typical response is to add token counting before every request and reject anything over a threshold. That works, but token counting isn't free, and you end up with token budget logic scattered across every call site.
The better move is to get visibility first. Every request through the proxy gets logged to local SQLite with token counts and cost estimates based on maintained pricing tables. No separate accounting layer. Check your current spend at any time:
relayplane statsOr open http://localhost:4100 for the full dashboard with spend by model, request history, and latency breakdowns.
With that data in front of you, you can make informed decisions: switch long-context requests to Sonnet instead of Opus, add input truncation before requests that consistently blow your budget, or route high-volume low-complexity tasks to a cheaper model. The proxy's routing config handles the model-switching part without touching your application code.
// Query Claude spend this week by model
import Database from 'better-sqlite3';
import { homedir } from 'os';
import { join } from 'path';
const db = new Database(join(homedir(), '.relayplane', 'data.db'));
const claudeSpend = db.prepare(`
SELECT
model,
SUM(tokens_in) as total_tokens_in,
SUM(tokens_out) as total_tokens_out,
SUM(cost_usd) as total_cost_usd,
COUNT(*) as request_count
FROM runs
WHERE model LIKE 'claude-%'
AND created_at >= date('now', '-7 days')
GROUP BY model
`).all();Cost tracking works across all 11 supported providers, so the same queries work whether traffic is going to Claude, GPT-4o, or Gemini.
Concern 3: Staying Online When Anthropic Has an Incident
Anthropic has an uptime page. It's not 100% all the time. When you're running Claude as a core part of your product, a provider incident means your product is down. The user-facing experience depends entirely on one vendor's reliability.
The fix is automatic fallback routing: if Claude returns a 5xx or times out, route the request to GPT-4o and return that response instead. Most language model outputs are interchangeable enough that users don't notice the difference on general tasks.
With @relayplane/proxy, routing is configuration. The proxy supports complexity-based routing (route by inferred prompt complexity) and cascade mode (start cheap, escalate if the model signals uncertainty or refusal):
{
"routing": {
"mode": "complexity",
"complexity": {
"simple": "claude-haiku-3-5",
"moderate": "claude-sonnet-4-5",
"complex": "claude-opus-4-6"
},
"cascade": {
"enabled": true,
"escalateOn": "rate_limit",
"maxEscalations": 2
}
}
}In complexity mode, the proxy classifies the incoming prompt (using regex pattern matching on content) and routes to the appropriate model tier. A simple FAQ answer goes to Haiku. A multi-step code analysis goes to Opus. Your application always calls the same endpoint and gets a response back from whichever model handled it.
Cascade mode starts with the cheapest model and escalates automatically on configured triggers (like rate limits). Most requests resolve at the cheaper tier. Escalations get logged so you can review the distribution over time.
When Anthropic returns a 529 (overloaded), the proxy's circuit breaker engages and can route to a configured alternative provider during the cooldown window. Provider cooldowns are handled automatically.
The Proxy as Infrastructure
These three patterns — retry handling, cost controls, and provider fallbacks — show up in every production Claude deployment eventually. The question is whether you build and maintain that logic in your application code or put it in a dedicated layer.
The dedicated layer approach (the proxy) keeps your application code focused on what it's supposed to do. Rate limit handling isn't a business logic problem. Neither is provider failover. Putting that in a proxy and owning a one-line config change per policy is cleaner than spreading retry logic across 20 call sites.
@relayplane/proxy is version 1.8.10, MIT licensed, 4 dependencies, runs locally with SQLite. You can inspect the source, run it in your own infrastructure, and keep your API keys out of any third-party service.
Get Started
npm install -g @relayplane/proxy
relayplane init
relayplane startPoint your Anthropic client at http://localhost:4100. All existing Anthropic SDK code works without changes.
Open http://localhost:4100 in your browser to see your current Claude traffic patterns, or run relayplane stats for a CLI summary. Then configure routing in ~/.relayplane/config.json to match your production requirements.
Full documentation at relayplane.com. If you're running Claude in a product used by real users, the proxy takes about 10 minutes to set up and removes several categories of production incidents from your incident backlog.
Matt Turley is a solo developer building RelayPlane. He writes about AI infrastructure, agent workflows, and building in public.