AI · Automation · Engineering

Cutting Agent Token Use 99% With Tool Routing

By Lazar MilicevicJuly 3, 202610 min read
Network cables routed through a switch, illustrating tool routing to cut agent token use by 99%

Last quarter I audited an agent that had accumulated 147 tools. Every request stuffed all 147 JSON schemas into the system prompt. Average input size before the user even said hello: 38,000 tokens. The team wondered why their Claude bill had tripled and why the agent kept calling send_slack_message when the user asked for a Jira ticket.

Alibaba's new SkillWeaver paper (via VentureBeat) hits this exact nerve. They build an execution graph up front and only load the skills the graph needs. It is the same problem I have been solving in production for a while now, and the fix is almost always some flavor of the same idea: stop pretending the LLM should see every tool on every turn.

The real cost of "just give it all the tools"

The default pattern in most agent tutorials is to register every tool with the model and let it pick. This works fine at 5 tools. It falls apart somewhere between 30 and 60. Here is what actually goes wrong, in the order it usually shows up:

  1. Token cost explodes linearly. Every tool schema, description, and parameter list ships on every call. With 100+ tools you are paying five figures per month just to remind the model what exists.
  2. Selection accuracy drops. Anthropic and others have shown that models degrade at tool selection as the tool count grows. Long tail tools get ignored. Similar tools get confused (update_ticket vs edit_ticket vs patch_ticket).
  3. Latency creeps up. More input tokens equals more time to first token. On a multi-step agent this compounds: 8 steps at +400ms each is a user-visible problem.
  4. Debugging becomes archaeology. When the agent picks the wrong tool, was it the description, the ordering, the name collision, or the context poisoning from tool 94? Good luck.

For the 4-system automation ecosystem I run (the one saving me 73+ hours a month), I hit this wall at around tool 40. That is when I stopped registering tools statically and started routing them.

What SkillWeaver actually does

If you have not read the paper, the short version: SkillWeaver treats the task as a graph. It plans the workflow first, decides which skills each node needs, then only exposes those skills to the executor at each step. The routing is done by a lighter model working on task semantics, not by loading every skill into the reasoner.

This is a specific instance of a broader pattern I would call two-tier tool selection: a cheap, fast router decides what could possibly be relevant, and the expensive reasoner decides what to actually do from a short list.

The claim of 99% token reduction is not marketing. If you go from 100 tool schemas at 300 tokens each (30,000 tokens) down to 3 tool schemas (900 tokens), that is 97% right there. Add the shorter system prompt and you clear 99% comfortably. I have measured similar numbers on my own systems.

The routing setup I actually run

Before SkillWeaver had a name, I had built roughly the same thing for BizFlowAI ContentStudio, because publishing content across sites requires calling a lot of different services (WordPress, Ghost, image generation, SEO validators, internal analytics, CMS-specific quirks). Here is the architecture, stripped to essentials:

User request
   |
   v
[Intent classifier] --> lightweight LLM or embeddings
   |
   v
[Tool retriever]  --> pgvector search over tool descriptions
   |                  (top-k = 5 to 8)
   v
[Executor agent]  --> Claude / GPT with ONLY the retrieved tools
   |
   v
[Result + audit log]

The pgvector store holds one embedding per tool. The document is not the schema. It is a hand-written, example-rich description of when to use this tool. That distinction matters more than anything else in this whole design, so I will come back to it.

Step by step: retrieval-based tool routing

Here is how I would build this today from scratch. Assume you have somewhere between 30 and 500 tools.

1. Rewrite every tool description as a retrieval document

Do not use the JSON schema for retrieval. Write a natural-language document for each tool that includes:

  • What the tool does in one sentence
  • 3 to 5 example user requests that should trigger it
  • What it is NOT for (the anti-examples matter a lot)
  • Key entities it operates on (Jira, Salesforce, S3, etc.)

Example:

Tool: create_jira_ticket

Use this when a user asks to open, file, create, or log a new
issue, bug, task, or story in Jira. Also triggers on "make a
ticket for" or "add to the backlog".

Do NOT use for: updating an existing ticket, adding a comment,
or querying Jira. Use update_jira_ticket, comment_jira, or
search_jira for those.

Operates on: Jira projects, issue types, assignees.

That anti-example line is the single highest-leverage thing you can add. It cuts confusion between similar tools by roughly half in my testing.

2. Embed and store

I use text-embedding-3-small for cost, pgvector for storage, and a hybrid search with BM25 for the exact-name cases (users often say "run the S3 sync" and you want a lexical hit, not just semantic). Reciprocal Rank Fusion to combine them:

def rrf_merge(semantic_hits, lexical_hits, k=60):
    scores = {}
    for rank, tool_id in enumerate(semantic_hits):
        scores[tool_id] = scores.get(tool_id, 0) + 1 / (k + rank)
    for rank, tool_id in enumerate(lexical_hits):
        scores[tool_id] = scores.get(tool_id, 0) + 1 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)

RRF is boring and it works. I have tried learned rerankers on this and the gain over RRF plus a good retrieval doc was not worth the operational cost.

3. Pick top-k, not top-1

Route to a shortlist, not a single tool. My default is k=6. The LLM still does the final selection from that shortlist, which is what it is good at. This preserves the model's ability to handle ambiguous requests and multi-tool workflows without you having to solve NLU perfectly at the router layer.

4. Handle the multi-step case

For a workflow that needs several tools, you have two choices:

  • Plan first, then retrieve per step (SkillWeaver's approach). A planner LLM writes the steps, each step triggers its own retrieval. Best accuracy, more latency.
  • Retrieve once with a wider net. Take the user's full request, retrieve top-15 tools, hand them all to the executor. Faster, less accurate on long chains.

I mix these based on task complexity. If the intent classifier tags the request as "single-step", I skip the planner. If it tags "multi-step" or "workflow", I plan first. This saves roughly 40% of planner calls in my traffic mix.

5. Keep an escape hatch

Sometimes the router misses. Every executor gets one meta-tool called request_more_tools that takes a natural-language description of what the agent thinks it needs. If it fires, we log it (this is training data for the retriever) and do a second retrieval round with the agent's own words. About 2% of requests use this. Without it, you eventually hit a task where the router is wrong and the agent has no recovery path.

The numbers, on my own traffic

For one of my production agents, before and after moving to retrieval-based routing:

Metric Static tool list Retrieval routing Change
Tools registered 89 89 same
Avg tools in prompt 89 6.2 -93%
Avg input tokens 24,800 3,100 -87%
Tool selection accuracy (eval set) 71% 88% +17pp
P50 latency to first token 1.9s 0.8s -58%
Monthly LLM cost baseline ~11% of baseline -89%

The token savings are close to the SkillWeaver claim. The accuracy gain surprised me the first time I measured it. The model is not distracted by 83 irrelevant tools, so it picks better from 6 relevant ones. That is the underrated benefit.

Where this approach breaks

I am not going to pretend retrieval routing is free. Real failure modes I have hit:

  • Cold tools starve. A tool nobody has ever used has no query history to tune against. Its retrieval doc is the only signal. Invest in it or the tool never gets picked.
  • Description drift. Engineers update the tool's code, forget to update the retrieval doc. Now the doc lies. I run a weekly job that flags tools where the doc has not been touched in 60+ days.
  • Router as single point of failure. If your embedding model is down or slow, your whole agent stalls at step zero. Have a fallback: for cases with high semantic ambiguity or router timeout, fall back to a hand-curated shortlist based on the intent class.
  • Multi-tenant tool sets. If different customers see different tools, you need per-tenant filtering in the vector query. Easy to forget, painful to debug. Add a tenant_id filter on the ANN search and test it.
  • Similar-name tools still collide. Retrieval will happily return both send_email and send_email_v2. Deprecate aggressively or the model will pick the wrong one 30% of the time.

Where SkillWeaver goes further

The part of SkillWeaver I like most is not the token savings. It is the execution graph as a first-class artifact. Once you have a graph, you can:

  • Cache subgraphs across similar requests
  • Parallelize independent nodes (huge latency win)
  • Replay a failed workflow from the last successful node instead of restarting
  • Show the user what the agent plans to do before it acts (which is what enterprises actually want for governance)

I have been moving my own systems in this direction. The graph does not need to be fancy. A simple JSON DAG with {node_id, tool, inputs, depends_on} gets you 80% of the benefit. The Morgan Stanley lesson applies here: less autonomy, more value. A visible plan the human can approve beats a black-box agent every time in an enterprise context.

What I'd do if you are starting this today

  1. Do not build a fancy planner yet. Start with retrieval + top-k + shortlist. This gets you 80% of the token savings for 20% of the effort.
  2. Write retrieval docs by hand for your top 20 tools. Do not auto-generate them from schemas. The quality of these docs is the whole ballgame.
  3. Measure tool selection accuracy with an eval set of 100+ real requests before you start optimizing. You need to know what you are moving.
  4. Add the request_more_tools escape hatch on day one. It is your safety net and your training data source.
  5. Only add the execution graph layer once you have workflows longer than 4 steps. Below that, the overhead is not worth it.
  6. Log everything. The router's decisions are your gold mine for improving descriptions, catching drift, and deprecating dead tools.

The 99% number in the headline is real, but it is not what makes this pattern important. The important part is that your agent gets smarter when it sees fewer tools, not just cheaper. That is the counter-intuitive lesson worth internalizing.

If you are dealing with an agent that has grown past 30 tools and the bill or the accuracy is hurting, this is a solved problem now. Happy to talk through the specifics if you want a second set of eyes: lazar-milicevic.com/#contact. More on production agent architecture on the blog if you want to keep reading.

Frequently asked questions

Why does giving an LLM agent access to too many tools break it?

When you register every tool statically with an agent, four things break in roughly this order. First, token costs explode linearly because every tool schema ships on every call, which can easily hit 30,000+ input tokens before the user says hello. Second, tool selection accuracy drops as count grows, especially for long-tail and similarly-named tools like `update_ticket` vs `edit_ticket`. Third, latency compounds across multi-step workflows because larger inputs slow time to first token. Fourth, debugging becomes archaeology because you cannot tell if a wrong choice came from description, ordering, name collision, or context poisoning. In my experience the wall hits somewhere between 30 and 60 tools.

How can I reduce agent token usage by up to 99% with many tools?

The fix is two-tier tool selection: a cheap, fast router narrows the tool universe down to a shortlist, and the expensive reasoner picks the actual tool from that shortlist. Concretely, if you go from 100 tool schemas at ~300 tokens each (30,000 tokens) to 3 schemas (900 tokens), that is already 97% savings, and a shorter system prompt clears 99% comfortably. I have measured these numbers in production, and Alibaba's SkillWeaver paper describes the same idea using an execution graph that only loads skills each node needs. The key mindset shift is to stop pretending the LLM should see every tool on every turn.

What is the best architecture for routing tools to an LLM agent?

I run a four-stage pipeline: an intent classifier (lightweight LLM or embeddings), a tool retriever using pgvector search over tool descriptions returning top 5 to 8 candidates, an executor agent (Claude or GPT) that only sees the retrieved tools, and an audit log. The critical detail is that the retrieval documents are not JSON schemas but hand-written, example-rich descriptions of when to use each tool. This setup scales cleanly from about 30 to 500 tools and keeps the reasoner's context small enough that selection accuracy stays high. It also makes debugging tractable because you can inspect which tools were retrieved and why.

How should I write tool descriptions for retrieval-based routing?

Write a natural-language document per tool, not a schema, and include four things: a one-sentence description of what it does, 3 to 5 example user requests that should trigger it, explicit anti-examples of what it is NOT for (pointing to the correct sibling tool), and the key entities it operates on like Jira, Salesforce, or S3. The anti-example line is the highest-leverage thing you can add, in my testing it cuts confusion between similar tools like `create_jira_ticket` vs `update_jira_ticket` by roughly half. This distinction between retrieval doc and schema matters more than model choice or embedding model. Treat these docs as first-class artifacts and version them.

Should I use semantic search or keyword search for tool retrieval?

Use both, combined with Reciprocal Rank Fusion (RRF). Pure semantic search misses cases where users say things like "run the S3 sync" and you want a lexical hit on the exact tool name, while pure BM25 misses paraphrased intents. I use `text-embedding-3-small` for the semantic side, BM25 for lexical, and merge with RRF using k=60, then pass the top 6 tools to the executor LLM for final selection. RRF is boring and it works, I tried learned rerankers on top and the gain was not worth the operational cost. Routing to a shortlist rather than top-1 also preserves the model's ability to handle ambiguous or multi-tool requests.

Lazar Milicevic

Lazar Milićević

Senior Technical Engineer. I build AI automation, GenAI/LLM systems and cloud architecture — autonomous systems that run while you sleep. Founder of BizFlowAI.

Building something hard with AI or automation? I am open to talk.

Get in touch

← All posts