118K vs 3.26M Tokens: Memory Design for Agents

A team at the National University of Singapore published numbers on MRAgent that stopped me mid-scroll: 118K tokens per query versus 3.26M for LangMem on the same long-horizon reasoning task. That is a 27x gap, and it maps almost perfectly to what I see when I audit production RAG agents that "work in demo but cost a fortune in prod."
The interesting part is not the framework name. It is what the number gap tells you about how most teams are still building agent memory: like a search engine, when it should behave like a working scientist's notebook.
Why agent memory is the bottleneck, not the model
Long-horizon agents fail on token economics before they fail on reasoning. A typical production RAG agent I inspect pulls top-k chunks on every step, re-injects the full conversation, and re-reads the same source documents across turns. On a 20-step task, you re-pay for the same context 20 times. That is how a "simple support agent" ends up billing $4 per resolved ticket and taking 90 seconds per turn.
The static "retrieve-then-reason" pipeline has three concrete failure modes I keep seeing:
- Retrieval returns noise, not signal. Cosine similarity on chunk embeddings picks lexically close but semantically wrong context. Your agent then reasons over garbage and confidently hallucinates.
- Context accumulates without pruning. Every tool call, every observation, every intermediate thought lives in the window forever. By step 15, the model is paying attention to step 2's tool output that no longer matters.
- Memory is write-only or read-only, never both intelligently. LangMem-style frameworks store everything, then retrieve broadly. There is no consolidation step where the agent decides what is worth remembering as a fact versus what was a dead end.
MRAgent's contribution, from what I've read in the paper and VentureBeat's write-up, is that memory is developed dynamically during the task, not queried statically from a pre-built index. The agent writes distilled summaries as it goes, and reasons over those, not over raw retrieved chunks. That is the same pattern I have been shipping in client work for the last 18 months, just without the paper.
The three memory layers I use in production
Every long-horizon agent I build has three distinct memory layers with different retention, different retrieval, and different token budgets. This is the design that gets me sub-200K token budgets on tasks where a naive LangGraph + LangMem setup burns millions.
Layer 1: Working memory (the scratchpad)
- What it holds: the current step's plan, the immediate tool output, the next action.
- Retention: flushed every 2-3 steps.
- Budget: 4K to 8K tokens, hard cap.
- How it's built: the agent writes a one-paragraph "state summary" at the end of each step. Everything older than that summary gets dropped from the window.
This is the layer teams skip. They think of context as append-only. In practice, the model does not need the raw JSON of a tool call from 12 steps ago; it needs your one-sentence summary of what that call proved or disproved.
Layer 2: Episodic memory (the notebook)
- What it holds: distilled facts the agent has learned during this run. "Customer X has plan tier 3." "Endpoint /v2/users returned 429 twice." "The invoice ID for July is INV-4471."
- Retention: whole task duration.
- Budget: 20K to 40K tokens.
- How it's built: a dedicated
memorize(fact, tag)tool the agent calls when it decides something is worth keeping. Retrieval is by tag first, embedding second.
The key design choice: the agent, not the framework, decides what to memorize. This is exactly the MRAgent insight. Static extractors ("summarize every 5 turns") pollute the memory with irrelevant summaries. Letting the reasoning step choose what deserves persistence gives you a 5-10x reduction in stored tokens for the same task completion rate.
Layer 3: Semantic memory (the library)
- What it holds: the RAG corpus. Product docs, past tickets, code repos.
- Retention: persistent across all runs.
- Budget: query-time budget of 6K to 12K tokens.
- How it's built: hybrid search (pgvector + BM25, fused with reciprocal rank fusion), then a rerank pass with a small model like
bge-reranker-v2, then a compression step where a cheap LLM extracts only the sentences relevant to the current sub-question.
That compression step is where most RAG pipelines leave 80% of their token savings on the table. Retrieving 8 chunks of 500 tokens gives you 4K tokens. Compressing them down to the 400 tokens that actually matter for this specific query cuts your per-step cost by 10x with no measurable quality loss.
Where the 3.26M tokens actually go
I ran the math on a LangMem-style setup I inherited from a client last quarter. Support agent, 22 average steps to resolve a ticket, using GPT-4o. Here is roughly where the tokens went:
| Bucket | Tokens per run | Share |
|---|---|---|
| Full conversation history re-sent each step | 1,850,000 | 57% |
| Retrieved chunks (top-10, no compression) | 780,000 | 24% |
| Tool outputs re-injected verbatim | 410,000 | 12% |
| System prompt + few-shot | 220,000 | 7% |
| Total | ~3.26M |
After migrating to the three-layer design:
| Bucket | Tokens per run | Share |
|---|---|---|
| Working memory summaries | 32,000 | 27% |
| Episodic facts (retrieved) | 28,000 | 24% |
| Compressed RAG output | 41,000 | 35% |
| System prompt + few-shot | 17,000 | 14% |
| Total | ~118,000 |
Same task, same success rate (within 2 percentage points on our eval set), 27x cheaper. This is not a benchmark trick. This is what happens when you stop treating the context window as free storage.
The five-step redesign, concretely
If you have a LangGraph or LangChain agent burning tokens right now, this is the order I would fix it in. Do not do these in parallel; each step's savings compound the next one's measurement.
- Instrument first. Log tokens by bucket (history, retrieval, tool output, system) per step. You cannot optimize what you cannot see. LangSmith or a homegrown logger both work. Budget one afternoon.
- Cap the working window. Add a summarize-then-truncate step every 3 turns. Keep the last 2 raw turns plus a rolling summary. Expect a 40-60% token drop immediately.
- Add a
memorizetool. Give the agent explicit control over what persists. Retrieval by tag beats retrieval by embedding for structured facts (IDs, decisions, constraints). - Compress retrieved chunks. Add a small-model post-retrieval step: "Given this question, extract only the sentences in these chunks that answer it." Use
gpt-4o-minior a local Llama 3.1 8B; it costs almost nothing and cuts RAG tokens 5-10x. - Cache the system prompt. Anthropic's prompt caching and OpenAI's automatic prefix caching both give you 50-90% discount on the stable prefix. If your system prompt is 4K tokens and you run 20 steps, you are paying for 80K tokens of the same text. Cache it.
The order matters. If you compress retrieval before capping the window, you will not notice the retrieval savings because history dominates. If you add memory tools before instrumenting, you will not know if they helped.
The trap of "just use a bigger context window"
Every few weeks a founder tells me the memory problem is solved because Gemini has 2M tokens or Claude has 1M. It is not solved. It is displaced.
Two things happen when you rely on a giant context:
- Attention degrades in the middle. The "lost in the middle" effect is real and measurable on every current frontier model. Fill 800K tokens and the model will miss the fact you put at position 400K, no matter how important it was. Papers keep confirming this on needle-in-haystack variants.
- Cost scales linearly, and you re-pay every step. A 500K-token context at $3 per million input tokens is $1.50 per call. On a 20-step agent, that is $30 per task. Nobody's unit economics survive that.
Big context windows are useful for single-shot document analysis. They are the wrong tool for multi-step agents. The right tool is aggressive summarization, structured memory, and paying for less context more times.
What I'd do if I were starting a new agent today
If I were greenfielding an agentic system in July 2026, this is the stack I would default to:
- Orchestration: LangGraph for the state machine, but with all memory logic pulled out into a separate
MemoryManagerclass I own. Never let a framework be the source of truth for what gets remembered. - Working memory: rolling summary, hard 6K cap, refresh every 3 steps with a cheap model.
- Episodic memory: Postgres table with
(run_id, tag, fact, embedding), retrieved by tag primarily. pgvector for the embedding fallback. - Semantic memory: hybrid search with RRF,
bge-reranker-v2-m3, then a compression LLM step. Chunk sizes of 300-500 tokens, not 1000. - Prompt caching: on by default, system prompt structured as a stable prefix + a tiny dynamic suffix.
- Evals: a golden set of 40-60 tasks I run on every memory change. Track completion rate and token cost per task as a Pareto plot. Any change that improves one at the cost of the other gets a decision, not an assumption.
The MRAgent paper is worth reading in full. What I take from it is not the specific mechanism, it is the framing: memory is a first-class part of your agent's behavior, not a plugin you bolt on. The teams that internalize this ship agents that cost 20x less to run and get better with time instead of worse.
If you are staring at a token bill that does not make sense and want a second pair of eyes on the architecture, I take a small number of consulting engagements each quarter. You can reach me at lazar-milicevic.com/#contact, or read more of my production notes on the blog. I would rather see your traces than talk in the abstract.
Frequently asked questions
Why do long-horizon AI agents burn so many tokens in production?
In my audits of production RAG agents, token cost explodes because teams treat the context window like free storage. The agent re-sends the full conversation history on every step, re-injects raw tool outputs, and re-retrieves the same document chunks across turns, so on a 20-step task you pay for the same context 20 times. That is how a naive LangGraph plus LangMem setup can hit 3.26M tokens per run and cost around $4 per resolved support ticket. The bottleneck is memory design, not the model itself.
What is the difference between static RAG memory and dynamic agent memory like MRAgent?
Static RAG pipelines query a pre-built index on every step and dump raw retrieved chunks into the context, which leads to noisy retrieval, unbounded context growth, and no consolidation of what was learned. Dynamic memory, as in MRAgent from the National University of Singapore, has the agent write distilled summaries during the task and reason over those summaries instead of raw chunks. In benchmarks this drops per-query cost from about 3.26M tokens to roughly 118K, a 27x reduction, with comparable task success. The key shift is letting the agent decide what to remember, not the framework.
How should I structure memory for a long-running LLM agent?
I use three distinct layers with separate retention and token budgets. Working memory (4-8K tokens) holds the current step's plan and gets flushed every 2-3 steps via one-paragraph state summaries. Episodic memory (20-40K tokens) stores agent-selected facts through a memorize(fact, tag) tool for the whole task. Semantic memory (6-12K tokens per query) is the persistent RAG corpus, accessed via hybrid search, reranking, and a compression step. This design keeps total budgets under 200K tokens on tasks where naive setups burn millions.
How can I reduce token cost in a RAG pipeline without losing answer quality?
The biggest lever most teams miss is a compression step after retrieval and reranking. Instead of dumping 8 retrieved chunks of 500 tokens (4K tokens) into the context, I use a cheap LLM to extract only the sentences that answer the current sub-question, typically cutting it to around 400 tokens. Combined with hybrid search (pgvector plus BM25 with reciprocal rank fusion) and a small reranker like bge-reranker-v2, this reduces per-step RAG cost by roughly 10x with no measurable quality loss on my eval sets.
Should the agent or the framework decide what to store in memory?
The agent should decide, not the framework. Static extractors that summarize every N turns pollute memory with irrelevant summaries and inflate storage without improving success rate. When the reasoning step itself calls a memorize(fact, tag) tool to persist only what it judges important, I see a 5-10x reduction in stored tokens for the same task completion rate. This is the core MRAgent insight and it maps directly to how a working scientist takes notes: selectively, not exhaustively.
Building something hard with AI or automation? I am open to talk.
Get in touch