Your AI Agent Bill Tripled This Quarter. Here's What's Actually Burning Tokens.
Your AI agent bill went vertical this quarter. Here's exactly what's burning tokens, the three biggest leaks, and the 4-knob system that cuts cost 60-80%.
A single chatbot reply used to cost about $0.04. The same task wrapped in an agent — one that searches, reasons, retries, and calls tools — now runs about $1.20. That's 30x, from one EY benchmark published this month.[1]
If your AI line item went vertical in Q2, you didn't get more expensive AI. You got an agent. And nobody warned you what an agent actually does to your bill.
Two days ago, Gartner said the quiet part out loud: by 2028, the average AI coding agent will cost more per developer than the developer's salary.[2] Goldman Sachs projects total token consumption goes from 5 quadrillion to 120 quadrillion per month between now and 2030 — 24x — and explicitly blames agents for it.[3] Uber and Microsoft have already burned through their entire 2026 AI budgets in months, not years.[4]
Here's what nobody is telling operators running $1M–$20M businesses: the bill isn't going to come down. The math of agents won't let it. But most of the budget is being lit on fire by three specific things you can fix this week, with no new vendor and no new platform.
Why agents burn 5-30x more tokens than chat
A chat call is one round trip. You ask, it answers. Maybe 1,500 tokens in, 500 out, billed once.
An agent is a loop. It plans, calls a tool, reads the result, plans again, calls another tool, retries when something fails, summarizes, then writes the answer. Every loop iteration re-sends the entire growing context. Gartner pegs agentic workflows at 5-30x more tokens per task than chat — and that's the conservative number.[5] Galileo's benchmark of agents on SWE-Bench and GAIA found high-performing agents using 10-50x more tokens, mostly from iterative reasoning loops they can't shut off.[6]
LeanOps ran a real production audit on 1,127 agent runs across Claude and OpenAI models. The result is the most useful number I've seen all year: 62% of the bill was re-sent context.[7] Not new work. Not new thinking. Just the agent re-pasting its own previous messages into the next call because it had no other way to remember them.
That's the first leak. We'll get to fixing it in a minute.
The three things actually burning your money
I've watched a few operator stacks now and the pattern doesn't move. Almost every blown-out token bill has the same three causes.
1. Wrong model for the wrong job
This is the biggest one. Anthropic's current pricing is brutal once you do the math: Claude Opus 4.7 is $5 input / $25 output per million tokens. Sonnet 4.6 is $3 / $15. Haiku 4.5 is $1 / $5.[8] Same family, but Opus is 5x Haiku on input and 5x on output. You can argue Opus is worth the premium for hard reasoning — and sometimes it is — but most agent steps are not hard reasoning. Most steps are: read this email, decide if it's a lead, write a Slack message. Haiku eats that for breakfast at 20% of the cost.
The "tier-1 model for everything" pattern is what blows the budget. LeanOps measured a 20x spread between the 10th and 90th percentile developer using the same coding tool — same prompts, same tasks — driven almost entirely by which default model they picked.[7] That's not a productivity gap. That's a routing gap.
2. Re-sent context (your agent has goldfish memory)
When an agent runs a 10-step task, it usually re-sends every previous step to every subsequent call. If step 1 had 2,000 tokens of context, step 10 has 20,000+. You're paying for the agent to re-read its own diary on every step.
Prompt caching fixes most of this — Anthropic charges roughly 10% of the input rate for cached tokens — and most agents I see have it turned off.[8] OpenAI auto-caches identical prefixes for 50% off. If you're running an agent and you haven't audited what's cacheable in your system prompt and tool definitions, that's the single highest-ROI thing you can do this week. Not next sprint. This week.
3. Runaway loops nobody monitors
Agents fail silently expensive. A retry loop that goes 47 turns deep because a flaky API returned a malformed JSON three times costs the same as the agent doing real work — except it produces nothing.
EY's number — $1.20 per "agentic interaction" — is the average.[1] The tail is much worse. I've seen single failed runs cost $40 because no one set a hard token ceiling or a retry cap. Multiply that by 200 attempts a day and you have a $24K monthly invoice for an agent that completed maybe 12% of its work.
That 12% number isn't pessimism, by the way. a16z's agent infrastructure survey this year found only 12% of enterprise agent pilots make it to production.[9] The other 88% die — and they die expensive.
What I'd actually do — the 4-knob system
For a $1M–$20M business running 1-3 agents in production, here's the system. Four knobs. No new platforms. You can ship this in a week.
Knob 1 — Tiered routing. Stop sending every agent step to your most expensive model. Use Haiku 4.5 (or GPT-4o-mini, or Gemini Flash) as the default. Promote to Sonnet only when the step requires reasoning over 3+ documents. Promote to Opus only for the final synthesis on high-stakes outputs. This single change typically cuts cost 60-80%. TechCrunch this month profiled Factory and several other startups now selling exactly this as a "model router" service.[4] You don't need their product. You need a 30-line function with an if/elif.
Knob 2 — Caching audit. Take every system prompt, tool schema, and few-shot example in your agent. Anything that doesn't change between runs goes in a cacheable prefix. Anthropic's prompt cache cuts that portion to ~10% of the input rate. Most teams I've seen recover 40-60% of input spend the day they turn this on.
Knob 3 — Hard ceilings. Set three limits in code: max tokens per run, max tool calls per run, max cost per run in dollars. When the agent hits any of them, it stops, logs, and pings a human. No exceptions. This is the single highest-impact governance change you can make. Bigeye's tracking guide has the pattern.[10]
Knob 4 — Token-per-outcome dashboards. Stop tracking "how many tokens did we use." Start tracking "tokens per successful outcome." If a successful lead qualification costs you 8,000 tokens today and 14,000 next month, your agent is degrading and you need to know before the bill arrives. Andrew Macdonald, Uber's COO, said it cleanly last month: token usage doesn't correlate with useful features.[11] Track the outcome, not the consumption.
Why this matters more than people think
The cheap-AI era is over. Inference is now 85% of enterprise AI budgets, according to AnalyticsWeek's 2026 report.[12] Token prices won't fall fast enough to save anyone — Gartner thinks LLM inference cost-per-token drops 90% by 2030, but volume grows 24x in the same window.[3] You can do the math: total spend goes up, not down.
This is the moment small operators have a real edge over enterprises. A $5M Shopify brand can rebuild its agent stack with cost discipline in a week. JPMorgan can't. By the time the org chart has had three meetings about "AI FinOps," the operator next door has already migrated 80% of their agent traffic to Haiku and shipped caching.
The companies that win the next two years won't be the ones using the most expensive model. They'll be the ones whose agents complete work for $0.12 while their competitors pay $1.20 for the same outcome.
That gap compounds fast.
If you want this built
I build these systems for operators — the routing layer, the cache audit, the ceilings, the dashboards. Same approach I'm describing here, your data, your stack. If your AI line item just tripled and nobody on your team can tell you why in 10 seconds, that's exactly what the audit call is for. 30 minutes, no pitch, you leave with a written breakdown of what your token bill should actually look like.
Book it at zerocam.studio.
-
Agentic AI Enterprise Token Cost↩
Single chatbot interaction ~$0.04 vs orchestrated agentic interaction ~$1.20 — 30x cost increase.
-
Gartner Predicts AI Coding Costs Will Surpass Average Developer's Salary by 2028↩
By 2028 the average AI coding agent will cost more per developer than the developer's salary.
-
Uber, Microsoft, and Others Burning Through AI Budgets↩
Goldman Sachs projects token consumption multiplies 24x to 120 quadrillion tokens/month between 2026 and 2030.
-
The token bill comes due: Inside the industry scramble to manage AI's runaway costs↩
Industry scramble to manage AI token costs; Factory and others shipping model router products.
-
LLM inference costs to fall 90% by 2030 (Gartner)↩
Agentic models can consume between 5 and 30 times more tokens per task than a standard chatbot.
-
AI Agents Burn 50x More Tokens Than Chats↩
Production audit of 1,127 agent runs found re-sent context accounts for 62% of the token bill; 20x cost spread between p10 and p90 developers.
-
Anthropic Claude API Pricing In 2026: Every Model, Token Rate, And Cost Lever↩
Claude Opus 4.7 at $5/$25, Sonnet 4.6 at $3/$15, Haiku 4.5 at $1/$5 per million tokens; prompt caching reduces input cost to ~10% of base rate.
-
AI Agent Productivity Statistics 2026: 100+ ROI Data Points↩
Forrester puts eval-and-integration costs at 28-44% of total agent program cost in mature deployments.
-
How to track AI agent costs and token usage↩
Gartner estimates agentic models require 5-30x more tokens per task than chatbots due to multi-step planning and context resending.
-
AI costs begin to bite as agents may increase token demand by 24 times↩
Uber COO Andrew Macdonald says token usage doesn't directly correlate with useful consumer features.
-
AI Inference Cost Crisis 2026: Why Your AI Bill Is Exploding↩
AI inference now represents 85% of enterprise AI budgets per AnalyticsWeek 2026 Inference Economics report.
Ready to build your own AI system?
Book a Free Audit Call →Keep Reading
46% Of Customers Hate Your AI Support Bot. Here's What To Build Instead.
46% of customers say AI support rarely works. Cursor's bot invented a refund policy and tanked subscriptions. Here's the 4-part build that fixes it.
Claude Can Build n8n Flows Now. Should You Still Pay For n8n?
Claude can now write n8n workflows directly via MCP. The takes are wrong: n8n isn't the IDE, it's the runtime — and the math says it's about to get bigger, not smaller.
Why Your AI Agent Pilot Won't Survive Production
74% of enterprises have rolled back a live AI agent after launch. The model isn't why. Here's the operations layer most vendors don't sell — and the 5-step playbook I'd actually run.