How to Estimate LLM API Costs Before You Build
I got a $340 bill from Anthropic once.
Not because I was running production traffic. Not because thousands of users were hammering my app. I was building a side project, testing an agent, and I walked away from my laptop for a few days. The agent kept running. The context window kept growing. The bill kept climbing.
$340 for a prototype I hadn't even shown anyone yet.
That's the thing nobody tells you about estimating AI agent costs upfront: you genuinely cannot know until you're running. And by the time you're running, it's already too late to avoid the first surprise.
This article is about fixing that. How to get a real estimate before you build, how to understand the variables that kill your budget, and how to instrument your app so you see costs in real time before they get out of hand.
The Estimation Problem
With most APIs, cost estimation is simple. You pay per request, the requests are predictable, and you can math it out in a spreadsheet before you write a line of code.
LLMs don't work like that.
The cost of a single LLM call depends on a half-dozen variables that interact in non-obvious ways. Your first test might cost $0.002. Your tenth, after you've added tool calls and retry logic and a growing conversation history, might cost $0.40. At scale, that difference wipes out your margin.
The problem compounds with agents. A single user action in an agentic system can trigger 5, 10, or 20 LLM calls. Each one carries context from previous calls. If your agent retries on failure (it should), the cost multiplies again.
You can't just price it per “interaction.” You have to understand what's actually happening at the token level.
The Variables That Matter
Here's what drives your LLM API costs:
Model choice. The gap between a cheap model and an expensive one is enormous. GPT-4o Mini costs roughly 60x less than GPT-4o. Claude Haiku is a fraction of the cost of Claude Opus. Choosing the wrong model for a task is the fastest way to blow your budget.
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Claude Haiku 3.5 | $0.80 | $4.00 |
| Claude Sonnet 3.7 | $3.00 | $15.00 |
| Claude Opus 4 | $15.00 | $75.00 |
| GPT-4o Mini | $0.15 | $0.60 |
| GPT-4o | $2.50 | $10.00 |
| Gemini 2.0 Flash | $0.10 | $0.40 |
| Gemini 2.5 Pro | $1.25 | $10.00 |
Prices approximate as of early 2026. Always check the provider's current pricing page.
Prompt length. System prompts add up fast. A 2,000-token system prompt sent on every request costs real money at scale. 10,000 requests/day at $3.00/M input tokens is $0.06 per request just for the system prompt. That's $600/day before your users say a word.
Context size. Conversation history is the silent killer. Each turn appends to the context. A 20-turn conversation might have 40,000+ tokens in context by the end. Most developers don't account for this when estimating AI agent costs.
Tool calls. Every tool call is another round trip. If your agent calls three tools, it makes at least four LLM requests. Each one costs money. Tool outputs get appended to context, making subsequent calls more expensive too.
Retry loops. Agents fail. They get rate-limited, they hit malformed output, they time out. Good agents retry. Every retry is a full-cost duplicate request. Budget for it.
How to Pre-Estimate (Before You're Live)
You don't need to fly blind. There are practical ways to get a rough number before you commit to a model or architecture.
Token calculators. Most providers have tokenizers you can run locally. Anthropic's Python library includes one. OpenAI has a JavaScript package called tiktoken. Count the tokens in your typical system prompt, a sample user message, and a sample response. Multiply by your expected request volume.
# Python - count tokens for Anthropic
pip install anthropic
python3 -c "
import anthropic
client = anthropic.Anthropic()
# Use the tokenizer to count tokens in your prompt
text = open('your_system_prompt.txt').read()
print(f'Tokens: ~{len(text.split()) * 1.3:.0f}') # rough estimate
"For a more accurate count, send a single real request, log the usage fields from the response, and use that as your per-request baseline.
Dry-run mode. Build a test harness that fires a representative sample of your real use cases. Log every request and response. Don't optimize yet. Just measure. A 50-request sample will tell you more than any spreadsheet estimate.
Prompt compression. Before you scale, look at your system prompt critically. Can you say the same thing in half the tokens? A tight 500-token system prompt is almost always better than a sprawling 3,000-token one anyway. This is free cost reduction.
The Only Reliable Approach: Instrument Before You Scale
Here's the uncomfortable truth about LLM API cost estimation: pre-build estimates are educated guesses. The real number only shows up when real users hit your real code with real inputs.
The safe path is to instrument your app before you scale, not after. You need per-request cost data, in real time, so you can catch runaway costs before they become a $340 lesson.
This is what I built RelayPlane for.
RelayPlane is a local proxy that sits between your app and the LLM provider. It tracks cost per request, logs token usage, and lets you see what's actually happening at the API level. It runs on your machine with zero network hop latency, so there's no performance penalty.
It supports Anthropic, OpenAI, OpenRouter, and Ollama. If you're already using one of those, you can be instrumented in about five minutes.
5-Minute Setup
Install the proxy globally:
npm install -g @relayplane/proxyStart it:
relayplane startThen point your app at the local proxy instead of the provider directly. For Anthropic:
export ANTHROPIC_BASE_URL=http://localhost:4100For OpenAI:
export OPENAI_BASE_URL=http://localhost:4100That's it. Your existing code doesn't change. The proxy intercepts the requests, forwards them to the real provider, and records cost and token data for every single call.
Start it with --verbose to see cost-per-request in real time:
relayplane start --verbose
# [2026-03-10 04:12:33] POST /v1/messages → claude-sonnet-4-6
# 1,847 in / 312 out · $0.0102
# [2026-03-10 04:12:41] POST /v1/messages → claude-haiku-4-5
# 892 in / 156 out · $0.0013Or check your aggregate stats any time:
relayplane statsWhen you can see cost-per-request in real time, the surprise bill disappears. You know immediately if your retry loop ran 20 times instead of 3. You know which user actions are expensive before they're expensive at scale.
Putting It Together
The workflow that works:
- Estimate before you build. Count tokens in your prompts. Pick a model appropriate to the task, not the most powerful one available. Budget for context growth and retries.
- Measure before you optimize. Dry-run with real inputs. Log the usage fields from every response. Build a real baseline.
- Instrument before you scale. Get per-request cost visibility from day one. RelayPlane handles this with a two-line setup.
- Optimize once you have data. Now you can make real decisions: switch models for specific task types, compress prompts, cap context windows, add smarter retry logic.
The $340 bill happened because I was flying blind. I had no visibility into what the agent was doing, how many tokens it was burning, or what it was costing per cycle. Once I had that data in front of me, the behavior became obvious and fixable in about ten minutes.
Don't estimate in a spreadsheet and hope for the best. Instrument early, see the real numbers, and make decisions based on data.
That's the job.
RelayPlane is an open npm package. Install it with npm install -g @relayplane/proxy. It runs locally, keeps your API keys on your machine, and supports Anthropic, OpenAI, OpenRouter, and Ollama.