AI · Automation · Engineering

Token Rationing: Guardrails for AI Spend

By Lazar MilicevicJune 26, 20269 min read

Dark server room with illuminated racks symbolizing token rationing and AI spend guardrails

Last quarter a friend running engineering at a Series B showed me their Anthropic invoice. Six figures, monthly, and climbing. The kicker: about 40% of the spend was developers asking Claude to rename variables, write one-off bash one-liners, and explain stack traces they could have grepped in five seconds. The tokenmaxxing era was short. Token rationing is here, and most teams are doing it badly — either no limits at all, or a blunt monthly cap that breaks production the day someone ships a big batch job.

I've been putting these guardrails into client systems and my own BizFlowAI ContentStudio for the last year. Here's the architecture that actually works.

Why naive budget caps fail

A single org-wide spend cap is the wrong primitive. It conflates three completely different workloads — interactive developer use, automated agent workflows, and customer-facing inference — and when the cap trips, the most expensive thing (your production agents serving paying customers) dies first because automated systems retry hardest.

The right model is a hierarchical budget with per-workload routing. Three tiers, three different policies:

Tier	Examples	Policy
Interactive (devs)	IDE chat, ad-hoc CLI	Per-user daily soft cap, route cheap by default
Internal automation	Cron jobs, batch pipelines	Per-job budget, cached aggressively
Production inference	Customer-facing agents	Reserved capacity, never throttled by other tiers

Production must be isolated. Everything else fights over what's left. I learned this the hard way watching an internal "summarize all our Notion docs" script eat the budget the day a customer demo was scheduled.

The four-layer routing rule

Before a single token hits Anthropic or OpenAI, requests pass through four gates. This is the single highest-ROI thing you can build, and most teams skip it.

1. Cache layer. Exact-match and semantic cache. Exact-match is just a hash of the prompt + model + parameters → response, with a TTL. Semantic cache uses embeddings + pgvector with a similarity threshold around 0.92 for "essentially the same question." For developer chat traffic specifically, I see 25-40% cache hit rates because devs ask the same things repeatedly ("how do I configure CORS in Next.js middleware" gets asked weekly).

2. Model routing. Default to the cheapest model that can do the job. A small classifier (or a rules engine if you want zero overhead) decides between Haiku-class, Sonnet-class, and Opus-class. For 80% of dev questions, Haiku is fine. The router should be opt-in upgradable — devs can force a bigger model with a flag — but the default is cheap.

3. Local LLM offramp. Anything that doesn't need frontier reasoning — code formatting, simple rewrites, classification, extraction with a fixed schema — goes to a local Ollama instance. I run Qwen 2.5 Coder and Llama 3.x for these. Zero per-token cost, and for structured extraction with a constrained grammar the local model is often better than a frontier model because it doesn't hallucinate fields.

4. Quota check. Only after the first three gates fail to handle it cheaply do we check the per-user / per-job budget and emit a real API call.

def route(request, user, job_ctx):
    if hit := exact_cache.get(request.hash()):
        return hit
    if hit := semantic_cache.get(request.embedding(), threshold=0.92):
        return hit
    if local_model.can_handle(request):  # classifier
        return local_model.complete(request)
    model = pick_model(request)  # haiku/sonnet/opus
    if not budget.allow(user, job_ctx, est_cost(model, request)):
        raise BudgetExceeded(user, job_ctx)
    return frontier.complete(request, model=model)

That's the whole router. About 200 lines of real code, deployed as a sidecar or a thin proxy. Every team I've helped saves 50-70% on the first month after this lands, before any prompt optimization.

Per-user budgets that don't feel like punishment

Developers will route around any guardrail that makes their day harder. Token budgets feel like surveillance unless you do two things right.

First, publish the budget and the consumption in real time. A Slack bot that says /tokens and shows me my daily usage, my team's usage, and how I compare. Not to shame anyone — to make the cost visible. When devs can see "that one prompt cost me $4," they self-regulate. I've watched a team cut their per-engineer daily spend from $35 to $9 in two weeks with no policy change, just a dashboard.

Second, make the cap a soft cap, not a hard one. Default daily budget is, say, $20 of frontier-model spend per engineer. Hitting it doesn't block — it downgrades you to Haiku + cache only for the rest of the day. Need Opus for a hard problem? Type /budget extend reason="debugging prod incident" and you get it logged. The audit trail is what matters; the friction is just enough to make people think.

The mistake I see is hard daily caps that block engineers entirely at 3pm. That's how you teach them to use personal Anthropic accounts on the side, and now you have a security problem on top of a cost problem.

Caching that actually compounds

Prompt caching from Anthropic is a separate, powerful lever — different from the response cache above. It reduces input token cost by ~90% when you reuse a long context across requests. The catch: most teams enable it and never measure whether they're actually hitting the cache.

Things that broke prompt caching for clients I've worked with:

Putting a timestamp or session ID at the top of the system prompt. Cache invalidated every request.
Dynamically reordering tool definitions. Order matters for the cache key.
Building system prompts by concatenating strings in non-deterministic dict iteration order (Python 3.6+ preserves insertion order, but if you're merging dicts from configs, watch out).
Cache TTL is 5 minutes by default. Long-running agents that pause for human review blow past it; use the 1-hour cache tier for those.

Verify your cache is working by checking the cache_read_input_tokens field in the response. If it's near zero and you think you're caching, you're not. Log it, alert on regressions. I put a Grafana panel showing cache-hit ratio per service. When the ratio drops 10% week-over-week, somebody changed a prompt and didn't realize.

Per-job budgets for autonomous agents

Interactive dev use is a paper cut. Autonomous agents are where you lose a mortgage payment in an afternoon. An agent that retries on failure, with a tool loop that doesn't terminate cleanly, can burn $2,000 in tokens overnight. I've seen it more than once.

Every agent run gets a hard token budget passed in at invocation time and checked between tool calls:

class TokenBudget:
    def __init__(self, max_tokens, max_cost_usd):
        self.max_tokens = max_tokens
        self.max_cost_usd = max_cost_usd
        self.used_tokens = 0
        self.used_cost = 0.0

    def charge(self, usage, model):
        self.used_tokens += usage.total_tokens
        self.used_cost += cost_of(usage, model)
        if self.used_cost > self.max_cost_usd:
            raise BudgetExhausted(self.used_cost, self.max_cost_usd)

Defaults I use for the ContentStudio pipeline:

Research agent: $0.50 per article
Outline + draft: $1.20 per article
Edit + SEO pass: $0.40 per article
Total ceiling: $3.00 per published piece (with $0.50 of headroom)

If an agent hits the ceiling, it doesn't retry — it writes the partial result, the trace, and the reason to a dead-letter queue and exits. A human looks at it the next morning. Better to ship 95% of the queue and review 5% than to silently burn the budget on a poison input.

The other lever: circuit breakers on tool loops. If an agent calls the same tool with similar arguments more than N times in a window, kill the run. Infinite loops are the #1 cause of agent budget incidents I've debugged.

Observability: what you actually have to log

You can't ration what you can't measure. Minimum logging per request:

Request ID, parent trace ID, user ID, team ID, job ID
Model, input tokens, output tokens, cache-read tokens, cache-write tokens
Computed USD cost (do the math at log time, prices change)
Latency, tool calls made (if agent), final outcome
Prompt template version (so you can diff cost when prompts change)

Store this in a real data warehouse, not just CloudWatch. I dump it to Postgres or BigQuery and build a daily rollup. The single most valuable view is cost per outcome — cost per closed support ticket, cost per published article, cost per generated lead. Cost per token is the wrong unit; nobody cares about tokens, they care about whether the AI is paying for itself.

When a CFO asks "is this AI spend justified," you need to show: we spent $X this month, we produced Y outcomes, the unit cost is dropping/stable/rising. If you can't answer that in 30 seconds, you don't have a token problem — you have a measurement problem, and the token problem is downstream of it.

What I'd do this quarter

If you're staring at an Anthropic bill that's growing faster than your usage justifies, here's the order of operations:

Week 1: Stand up a proxy in front of the API. Log everything. Don't change behavior yet — just measure. You'll be shocked at the top 10 most expensive prompts.
Week 2: Add exact-match response cache and verify prompt caching is actually hitting. This alone is typically 30-50% reduction.
Week 3: Ship model routing with a sane default (Haiku or equivalent for most dev use). Add a Slack /tokens bot.
Week 4: Introduce per-user soft caps with downgrade-not-block behavior. Set per-job hard budgets on every autonomous agent.
Ongoing: Move structured-extraction and classification workloads to a local model. Track cost per outcome, not cost per token.

What I would not do: announce a company-wide AI budget freeze, or build a complicated approval workflow for every prompt. Both kill the productivity gains that justified the spend in the first place. The point of guardrails isn't to use less AI. It's to use AI on the things that actually return more than they cost, and not on the things that don't.

The teams I see winning here treat token spend the way good ops teams treat AWS spend — as an engineering problem with dashboards, budgets, alerts and post-mortems, not as a procurement problem with quarterly committee meetings.

If you're trying to put real guardrails around AI spend without breaking your developers' workflow — or you're scaling agents into production and need the budget math to hold up — I help engineering teams design this kind of architecture. Get in touch at lazar-milicevic.com/#contact, or read more on the blog where I write up the systems I'm building.

Frequently asked questions

How do I stop AI API costs from spiraling out of control without breaking production?

Don't use a single org-wide spend cap — it conflates interactive dev use, internal automation, and production inference, and the cap always trips on your customers first because automated systems retry hardest. Instead, build a hierarchical budget with three isolated tiers: per-user daily soft caps for developers, per-job budgets for batch automation, and reserved capacity for production inference that never gets throttled by the other tiers. Production must be walled off completely so an internal 'summarize all our Notion docs' script can't eat the budget right before a customer demo. This separation is the single most important architectural decision in AI cost control.

What's the most effective architecture for routing LLM requests to minimize cost?

I use a four-gate router that every request passes through before any frontier API call. Gate one is an exact-match plus semantic cache (pgvector with ~0.92 similarity threshold) which hits 25-40% on developer traffic. Gate two routes to the cheapest model that can handle the job — Haiku-class for ~80% of dev questions, with opt-in upgrades. Gate three sends anything not needing frontier reasoning (formatting, classification, schema-constrained extraction) to a local Ollama instance running Qwen 2.5 Coder or Llama 3. Only after those three fail do you check the per-user budget and call the frontier API. This is roughly 200 lines of code and typically cuts spend 50-70% in the first month.

What cache hit rate can I realistically expect on developer AI traffic?

For interactive developer chat traffic, I consistently see 25-40% cache hit rates when using a combined exact-match plus semantic cache. The reason is simple: developers ask the same questions repeatedly — 'how do I configure CORS in Next.js middleware' gets asked weekly across a team. Exact-match cache is just a hash of prompt plus model plus parameters mapped to the response with a TTL. Semantic cache uses embeddings with a similarity threshold around 0.92 to catch rephrased versions of the same question. This single layer alone often pays for the entire routing infrastructure.

Should I use hard or soft budget caps for developers using AI tools?

Always soft caps. Hard daily caps that block engineers at 3pm teach them to use personal Anthropic accounts on the side, giving you a security problem on top of a cost problem. A soft cap downgrades the user to Haiku plus cache-only for the rest of the day when they hit, say, $20 of frontier spend — with an override command like '/budget extend reason=...' that's logged for audit. Pair this with real-time visibility (a Slack '/tokens' command showing personal and team usage) and developers self-regulate. I've seen per-engineer daily spend drop from $35 to $9 in two weeks just from making cost visible, with no policy change.

Why isn't Anthropic prompt caching saving me money even though I enabled it?

Enabling prompt caching and actually hitting it are two different things, and most teams never measure the hit rate. The common breakers I see in client systems are: putting a timestamp or session ID at the top of the system prompt (invalidates every request), dynamically reordering tool definitions (order is part of the cache key), and building system prompts by concatenating from dicts whose iteration order isn't stable across config merges. The default cache TTL is also only 5 minutes, so low-frequency workloads miss constantly. Audit your actual cached_input_tokens metric — if it's not ~90% of your repeated context, something upstream is mutating the prompt.

Lazar Milicevic

Senior Technical Engineer. I build AI automation, GenAI/LLM systems and cloud architecture — autonomous systems that run while you sleep. Founder of BizFlowAI.

Work with me →

Building something hard with AI or automation? I am open to talk.

Get in touch

← All posts