Agent Cost Benchmarks 2026
Real cost data for AI agent workflows. Coding agents, research agents, and support bots measured across Claude, GPT-4o, and Gemini with actual token counts from production runs.
How these numbers were collected
Token counts are medians from real RelayPlane proxy runs across coding agents (Claude Code, Cursor, custom LangChain agents), research workflows, and customer support bots. Costs use March 2026 list pricing with no volume discounts. The "routed" column shows what cost-optimized routing achieves by sending simple steps to cheaper models and escalating only when needed. Results will vary based on your prompts and task complexity.
Cost per task by workflow type
All costs are per task completion (not per API call). Multi-turn workflows include the full conversation context.
| Workflow | Turns | Sonnet 4.6 | GPT-4o | Haiku 4.5 | Gemini Flash | Routed |
|---|---|---|---|---|---|---|
CodingSingle-file code edit ~4,200 in / ~850 out tokens (median) | 2 | $0.031 | $0.019 | $0.0050 | $0.0010 | $0.0050-84% |
CodingMulti-file refactor ~18,500 in / ~3,200 out tokens (median) | 5 | $0.17 | $0.10 | $0.026 | $0.0060 | $0.042-75% |
CodingCode review (PR) ~9,800 in / ~1,400 out tokens (median) | 1 | $0.062 | $0.038 | $0.010 | $0.0020 | $0.010-84% |
CodingNew feature (end-to-end) ~42,000 in / ~8,500 out tokens (median) | 12 | $0.53 | $0.33 | $0.084 | $0.019 | $0.12-78% |
CodingBug investigation ~11,000 in / ~2,100 out tokens (median) | 4 | $0.089 | $0.055 | $0.014 | $0.0030 | $0.016-82% |
ResearchResearch summary (web + docs) ~28,000 in / ~2,800 out tokens (median) | 3 | $0.23 | $0.14 | $0.036 | $0.0080 | $0.036-84% |
ResearchDocument Q&A (RAG) ~7,500 in / ~600 out tokens (median) | 1 | $0.042 | $0.026 | $0.0070 | $0.0010 | $0.0070-83% |
SupportCustomer support ticket ~1,800 in / ~420 out tokens (median) | 1 | $0.013 | $0.0080 | $0.0020 | $0.0010 | $0.0020-85% |
SupportMulti-turn support chat ~6,200 in / ~1,500 out tokens (median) | 6 | $0.049 | $0.030 | $0.0080 | $0.0020 | $0.0080-84% |
AutomationData extraction (structured) ~5,500 in / ~900 out tokens (median) | 1 | $0.038 | $0.023 | $0.0060 | $0.0010 | $0.0060-84% |
"Routed" uses RelayPlane cost-optimized routing. Simple steps sent to Gemini Flash or Haiku; complex reasoning escalated to Sonnet.
Model pricing reference
March 2026 list prices per 1M tokens. No volume discounts applied.
| Model | Provider | Input / 1M | Output / 1M | Context | Best for |
|---|---|---|---|---|---|
| claude-sonnet-4-6 | Anthropic | $3.00 | $15.00 | 200K | Complex reasoning, large codebases |
| gpt-4o | OpenAI | $2.50 | $10.00 | 128K | General tasks, vision, broad compatibility |
| claude-haiku-4-5 | Anthropic | $0.800 | $4.00 | 200K | Fast, cheap tasks, high volume |
| gemini-2.0-flash | $0.075 | $0.30 | 1M | Lowest cost, massive context | |
| gpt-4o-mini | OpenAI | $0.150 | $0.60 | 128K | Low cost OpenAI-compatible workloads |
| claude-opus-4-6 | Anthropic | $15.00 | $75.00 | 200K | Hardest tasks, maximum capability |
Monthly cost projections for coding agents
Solo developer
~200 tasks/month
Mix of quick edits, bug fixes, and occasional features
Small team (5 devs)
~1,000 tasks/month
Active development, regular PR reviews, daily agent usage
Engineering org (50 devs)
~10,000 tasks/month
Heavy agent usage, CI pipelines, automated code review
Estimates based on median task costs above, assuming a typical mix of 40% simple edits, 40% medium tasks, and 20% complex features.
What makes agent costs spike
Context stuffing on every turn
Agents that reload the full codebase into context on every step are the single biggest source of runaway spend. A 10-turn task with 50K tokens of context per turn costs 10x more than one that carries only the relevant diff forward. RelayPlane flags context bloat in real time.
Model mismatches: using Sonnet for everything
Routing every request to the most capable model, regardless of task complexity, is easy to implement and expensive to run. A grep or a docstring rewrite does not need Sonnet. Routing those to Haiku or Gemini Flash reduces per-task cost by 80-95% with no quality loss.
Retry loops and runaway agents
An agent that retries a failing tool call 20 times before giving up generates 20 full-context requests. Without loop detection, one stuck agent can generate hundreds of dollars of spend in minutes. RelayPlane detects repeated identical requests and stops them after a configurable threshold.
No per-agent visibility
When all LLM traffic is billed to one API key, it is impossible to know which agent or feature is responsible for a spike. RelayPlane fingerprints system prompts to attribute every token to its source, so you can see exactly which workflow is burning the budget.
How RelayPlane routing achieves these savings
RelayPlane sits between your agent and the upstream provider as a localhost proxy on port 4100. Every request is classified by complexity before being forwarded. Short, predictable tasks are routed to Gemini Flash or Claude Haiku. Requests that require nuanced reasoning, large context, or multi-step tool use are escalated to Sonnet or GPT-4o.
The routing decision happens in under 2ms and does not require any changes to your agent code. You point your existing OpenAI-compatible client at localhost:4100 and the proxy handles the rest.
npm install -g @relayplane/proxy
relayplane start
# Your agent config (no other changes needed)
OPENAI_BASE_URL=http://localhost:4100/v1