title: @BO5AMIS:
We cut our AI agent's token costs by ~50%. Here's every technique, and how i...
author: BO5AMIS
content_type: twitter_article
published: 2026-02-08T09:50:53+00:00
source_url: https://x.com/BO5AMIS/status/2020435153034555568
word_count: 1392
We cut our AI agent's token costs by ~50%. Here's every technique, and how it compares to what C
We cut our AI agent's token costs by ~50%. Here's every technique, and how it compares to what Cursor and Claude Code do.
We're building a mobile coding platform. The AI agent reads your codebase, plans changes, writes code, runs tests, and commits — autonomously, from your phone.
One problem: a complex task was burning 30-50K tokens on Claude Sonnet 4.5 at $3/M input and $15/M output. On mobile, where latency kills and every token is real money, that's not sustainable.
Here's what we built, why it works, and how it compares to what Cursor and Claude Code are doing.
The core idea: stage isolation
Most AI coding agents, like Cursor, Claude Code, Codex, and Windsurf, run in a single context window. The model reads files, thinks, writes code, runs tests, all in one growing conversation. By step 40, it's dragging around 80K tokens of stale file contents from step 3.
When the context fills up, they react: Cursor triggers summarization, Claude Code has `/compact`, Windsurf relies on its RAG engine. But summarization introduces drift, details get lost or distorted with each compression pass.
We took a different approach. Instead of one context window that gets reactively compressed, we use four isolated stages that are proactively structured:
EXPLORE → PLAN → EXECUTE → RESPOND
Each stage gets its own `streamText()` call with a fresh context window. Nothing carries over except a typed handoff summary, typically 300-500 tokens.
The explore stage might read 15 files and accumulate 20K tokens of raw source code. The plan stage never sees any of that. It receives a structured handoff:
300 tokens instead of 20K. No summarization drift — the handoff schema forces the model to distill its work into a predetermined shape, which is more reliable than free-form LLM-generated summaries.
Run exploration on a 6x cheaper model
Not all stages need your best model. Exploration is glorified file reading, the model calls ` file.read `, ` repo.search `, and `git.log`, then summarizes what it found.
We run EXPLORE on Gemini 3 Flash ($0.50/M input, $3/M output via OpenRouter). The expensive model, Claude Sonnet 4.5 ($3/M input, $15/M output), only activates for PLAN and EXECUTE, where reasoning quality actually matters.
That's a 6x reduction on input costs and 5x on output costs for the most token-heavy stage. Exploration typically accounts for 30-40% of total token volume in a task.
For comparison: Cursor's "Auto Mode" routes to cheaper models when credits run low but doesn't structurally assign different models to different phases of reasoning. Claude Code uses Haiku as a sub-agent model for focused subtasks, which is conceptually similar but operates within a single conversation rather than across isolated stages.
Delegated reading with
request_context
Even after stage isolation, the expensive model in PLAN and EXECUTE still needed file contents. Every ` file.read ` dumps raw source into the expensive model's context window.
We replaced direct file reads with a meta-tool called `request_context`. The expensive model asks a natural language question:
This spawns a mini agentic loop on Gemini 3 Flash with 8 read-only tools (` file.read `, ` repo.search `, `file.list`, `file.find_in_files`, `git.status`, `git.diff`, `git.log`, ` git.show `). The cheap model does the actual reads, searches, and cross-referencing, then returns curated code snippets with exact line numbers, not entire files.
The expensive model gets 50 relevant lines instead of 500 lines of raw file content. The mini-agent is capped at 15 steps with a 30-second timeout.
This is the same principle behind Claude Code's sub-agent architecture and Cursor's "files as the primary interface" pattern, but we enforce it structurally. The PLAN stage literally only has 2 tools available: `request_context` and ` todo.read `. It cannot read files directly even if it tries.
Aggressive tool filtering per stage
Most agent frameworks give the model every tool at all times. This wastes tokens in two ways: the model reasons about irrelevant tools, and sometimes calls them anyway.
Our tool counts:
PLAN having only 2 tools is the key insight. It forces the expensive model to think and design rather than impulsively reading files. All reads are delegated to the cheap model through `request_context`.
Claude Code addresses this differently with a "Tool Search Tool", instead of loading all tool definitions into context, the model dynamically searches for relevant tools, which Anthropic reports saves significant context space. Our approach is more rigid but requires zero runtime overhead: the tool set is determined by the stage, not by the model's judgment.
Synchronous commands replaced terminal polling
The old pattern for running a shell command:
4 tool calls. 4 round trips. Each one includes the model's reasoning tokens plus the tool result tokens.
New pattern:
One call. Output is bounded to the last 4,000 characters to prevent a runaway `npm install` from dumping 15K characters into context. Default timeout of 120 seconds, max 5 minutes.
This eliminates roughly 70% of tool calls in command-heavy EXECUTE stages. Each avoided call saves both the model's reasoning about "what should I do next" and the tool definition overhead.
Self-gating: skip stages when possible
Not every task needs 4 stages. "Fix the typo on line 12" doesn't need a planning phase.
The explore model sets flags in its handoff:
`isSimpleTask: true` → skip PLAN, create a minimal handoff from exploration findings, go straight to EXECUTE
Empty plan (no files to modify, no files to create, no commands) → skip EXECUTE, go straight to RESPOND
A question like "What framework is this?" runs EXPLORE → RESPOND. Two LLM calls instead of four. A simple rename runs EXPLORE → EXECUTE → RESPOND. Three calls instead of four.
This is similar to how Claude Code uses plan mode only when needed, but again, the decision is structural (flags in typed handoffs) rather than left to the model's discretion.
Search/replace over unified diffs
We had two file editing tools: `file.apply_diff` (unified diff format) and `file.edit` (search → replace pairs).
Accuracy from the Diff-XYZ benchmark:
`file.edit` (search/replace): 94% exact match
`file.apply_diff` (unified diff): 82% exact match
LLMs are bad at line numbers. Unified diffs require correct line numbers, correct context lines, correct `+`/`-` prefixes. When the model gets any of it wrong, the tool fails, the model retries, and you burn tokens on the retry loop.
Search/replace doesn't need line numbers. The model writes the text it sees and the text it wants instead. The tool tries exact matching first, falls back to whitespace-normalized matching, then fuzzy matching. Only the first occurrence is replaced.
We deleted `file.apply_diff` entirely. Fewer failed edits means fewer retries means fewer wasted tokens.
Temperature 0 everywhere, bounded outputs
Every stage runs at temperature 0. Deterministic outputs mean consistent tool call patterns and no wasted tokens from poor sampling. For coding agents, you want the model to pick the correct tool call and move on — creativity in generation is rarely what you need.
The RESPOND stage has a hard `maxTokens: 4096` cap. `run_command` returns only the last 4,000 characters. These are small guardrails, but they prevent the occasional command that floods your context.
The compound effect
No single technique here is revolutionary. The power is in how they compose:
How this compares:
Cursor recently reported a 46.9% token reduction from their Dynamic Context Discovery approach, retrieving context on demand instead of pre-loading it. Their technique is reactive and operates within a single context window.
Claude Code uses Haiku sub-agents in isolated context windows with condensed summaries back to the main agent, conceptually closest to our approach, but within a single-conversation framework rather than a structured pipeline.
Our pipeline achieves a similar ~40-50% reduction through a fundamentally different mechanism: proactive structural isolation rather than reactive compression. The tradeoff is rigidity, hard stage boundaries can lose nuance that a continuous context preserves. The typed handoff schemas are our mitigation: they force the model to preserve what matters in a predictable shape.
The mental model:
Stop thinking of your AI agent as one smart model in a loop. Think of it as a pipeline of specialists, cheap models doing the reading, expensive models doing only what they're uniquely good at, and typed contracts ensuring nothing leaks between them.
Your most expensive tokens should be spent reasoning about * what to change *, not reading files to understand * what exists *.
Posted: 2026-02-08T09:50:53.000Z
Engagement: 6 likes, 3 retweets, 0 replies