TWITTER_ARTICLE

BO5AMIS says a mobile coding agent cut token costs by roughly 40-50% after…

Brief

BO5AMIS outlines an architecture for reducing AI coding-agent token usage on a mobile development product where complex tasks had been consuming 30-50K tokens on Claude Sonnet 4.5. The core change is a structured pipeline: EXPLORE, PLAN, EXECUTE, and RESPOND each run in separate model calls, with only typed summaries passed between them to avoid stale context and summarization drift. The system further cuts cost by assigning Gemini 3 Flash to exploration and delegated repository reading, while keeping Claude Sonnet 4.5 only for higher-value reasoning and code execution. Additional savings come from rigid per-stage tool filtering, conditional stage skipping for simple tasks, synchronous command execution with bounded output, temperature 0, and replacing fragile unified diffs with search/replace edits. The post compares this approach with Cursor’s 46.9% Dynamic Context Discovery reduction and Claude Code’s sub-agent pattern, arguing that proactive stage isolation can achieve similar savings with more predictable structure but potentially less nuance.

Why it matters

BO5AMIS says a mobile coding agent cut token costs by roughly 40-50% after replacing a single growing conversation with four isolated stages—EXPLORE, PLAN, EXECUTE, RESPOND—where each stage gets a fresh context window and only a typed 300-500 token handoff instead of carrying forward as much as 20K tokens of raw code or 80K tokens of stale context.

Key details

  • The system runs the token-heavy EXPLORE phase on Gemini 3 Flash via OpenRouter at $0.50/M input and $3/M output, while reserving Claude Sonnet 4.5 at $3/M input and $15/M output for PLAN and EXECUTE; the author says exploration accounts for 30-40% of total token volume, making this a 6x input-cost and 5x output-cost reduction for that stage.
  • Direct file reads were replaced with a `request_context` meta-tool that launches a read-only Gemini 3 Flash mini-agent using 8 repo tools and returns curated snippets with exact line numbers, so the expensive model sees about 50 relevant lines instead of 500 lines of full-file content; the helper loop is capped at 15 steps and 30 seconds.
  • The author reduced EXECUTE overhead by switching shell execution from a 4-call polling pattern to a single synchronous command call, bounding output to the last 4,000 characters with a 120-second default timeout and 5-minute maximum; they estimate this removes about 70% of tool calls in command-heavy execution paths.
  • For code edits, the team deleted unified diff editing after citing Diff-XYZ benchmark results of 94% exact match for search/replace (`file.edit`) versus 82% for unified diff (`file.apply_diff`), arguing that fewer failed edits and retries directly lower token burn.
Cleaned source text

title: @BO5AMIS:

We cut our AI agent's token costs by ~50%. Here's every technique, and how i...

author: BO5AMIS

content_type: twitter_article

published: 2026-02-08T09:50:53+00:00

source_url: https://x.com/BO5AMIS/status/2020435153034555568

word_count: 1392

We cut our AI agent's token costs by ~50%. Here's every technique, and how it compares to what C

We cut our AI agent's token costs by ~50%. Here's every technique, and how it compares to what Cursor and Claude Code do.

We're building a mobile coding platform. The AI agent reads your codebase, plans changes, writes code, runs tests, and commits — autonomously, from your phone.

One problem: a complex task was burning 30-50K tokens on Claude Sonnet 4.5 at $3/M input and $15/M output. On mobile, where latency kills and every token is real money, that's not sustainable.

Here's what we built, why it works, and how it compares to what Cursor and Claude Code are doing.

The core idea: stage isolation

Most AI coding agents, like Cursor, Claude Code, Codex, and Windsurf, run in a single context window. The model reads files, thinks, writes code, runs tests, all in one growing conversation. By step 40, it's dragging around 80K tokens of stale file contents from step 3.

When the context fills up, they react: Cursor triggers summarization, Claude Code has `/compact`, Windsurf relies on its RAG engine. But summarization introduces drift, details get lost or distorted with each compression pass.

We took a different approach. Instead of one context window that gets reactively compressed, we use four isolated stages that are proactively structured:

EXPLORE → PLAN → EXECUTE → RESPOND

Each stage gets its own `streamText()` call with a fresh context window. Nothing carries over except a typed handoff summary, typically 300-500 tokens.

The explore stage might read 15 files and accumulate 20K tokens of raw source code. The plan stage never sees any of that. It receives a structured handoff:

300 tokens instead of 20K. No summarization drift — the handoff schema forces the model to distill its work into a predetermined shape, which is more reliable than free-form LLM-generated summaries.

Run exploration on a 6x cheaper model

Not all stages need your best model. Exploration is glorified file reading, the model calls ` file.read `, ` repo.search `, and `git.log`, then summarizes what it found.

We run EXPLORE on Gemini 3 Flash ($0.50/M input, $3/M output via OpenRouter). The expensive model, Claude Sonnet 4.5 ($3/M input, $15/M output), only activates for PLAN and EXECUTE, where reasoning quality actually matters.

That's a 6x reduction on input costs and 5x on output costs for the most token-heavy stage. Exploration typically accounts for 30-40% of total token volume in a task.

For comparison: Cursor's "Auto Mode" routes to cheaper models when credits run low but doesn't structurally assign different models to different phases of reasoning. Claude Code uses Haiku as a sub-agent model for focused subtasks, which is conceptually similar but operates within a single conversation rather than across isolated stages.

Delegated reading with

request_context

Even after stage isolation, the expensive model in PLAN and EXECUTE still needed file contents. Every ` file.read ` dumps raw source into the expensive model's context window.

We replaced direct file reads with a meta-tool called `request_context`. The expensive model asks a natural language question:

This spawns a mini agentic loop on Gemini 3 Flash with 8 read-only tools (` file.read `, ` repo.search `, `file.list`, `file.find_in_files`, `git.status`, `git.diff`, `git.log`, ` git.show `). The cheap model does the actual reads, searches, and cross-referencing, then returns curated code snippets with exact line numbers, not entire files.

The expensive model gets 50 relevant lines instead of 500 lines of raw file content. The mini-agent is capped at 15 steps with a 30-second timeout.

This is the same principle behind Claude Code's sub-agent architecture and Cursor's "files as the primary interface" pattern, but we enforce it structurally. The PLAN stage literally only has 2 tools available: `request_context` and ` todo.read `. It cannot read files directly even if it tries.

Aggressive tool filtering per stage

Most agent frameworks give the model every tool at all times. This wastes tokens in two ways: the model reasons about irrelevant tools, and sometimes calls them anyway.

Our tool counts:

PLAN having only 2 tools is the key insight. It forces the expensive model to think and design rather than impulsively reading files. All reads are delegated to the cheap model through `request_context`.

Claude Code addresses this differently with a "Tool Search Tool", instead of loading all tool definitions into context, the model dynamically searches for relevant tools, which Anthropic reports saves significant context space. Our approach is more rigid but requires zero runtime overhead: the tool set is determined by the stage, not by the model's judgment.

Synchronous commands replaced terminal polling

The old pattern for running a shell command:

4 tool calls. 4 round trips. Each one includes the model's reasoning tokens plus the tool result tokens.

New pattern:

One call. Output is bounded to the last 4,000 characters to prevent a runaway `npm install` from dumping 15K characters into context. Default timeout of 120 seconds, max 5 minutes.

This eliminates roughly 70% of tool calls in command-heavy EXECUTE stages. Each avoided call saves both the model's reasoning about "what should I do next" and the tool definition overhead.

Self-gating: skip stages when possible

Not every task needs 4 stages. "Fix the typo on line 12" doesn't need a planning phase.

The explore model sets flags in its handoff:

`isSimpleTask: true` → skip PLAN, create a minimal handoff from exploration findings, go straight to EXECUTE

Empty plan (no files to modify, no files to create, no commands) → skip EXECUTE, go straight to RESPOND

A question like "What framework is this?" runs EXPLORE → RESPOND. Two LLM calls instead of four. A simple rename runs EXPLORE → EXECUTE → RESPOND. Three calls instead of four.

This is similar to how Claude Code uses plan mode only when needed, but again, the decision is structural (flags in typed handoffs) rather than left to the model's discretion.

Search/replace over unified diffs

We had two file editing tools: `file.apply_diff` (unified diff format) and `file.edit` (search → replace pairs).

Accuracy from the Diff-XYZ benchmark:

`file.edit` (search/replace): 94% exact match

`file.apply_diff` (unified diff): 82% exact match

LLMs are bad at line numbers. Unified diffs require correct line numbers, correct context lines, correct `+`/`-` prefixes. When the model gets any of it wrong, the tool fails, the model retries, and you burn tokens on the retry loop.

Search/replace doesn't need line numbers. The model writes the text it sees and the text it wants instead. The tool tries exact matching first, falls back to whitespace-normalized matching, then fuzzy matching. Only the first occurrence is replaced.

We deleted `file.apply_diff` entirely. Fewer failed edits means fewer retries means fewer wasted tokens.

Temperature 0 everywhere, bounded outputs

Every stage runs at temperature 0. Deterministic outputs mean consistent tool call patterns and no wasted tokens from poor sampling. For coding agents, you want the model to pick the correct tool call and move on — creativity in generation is rarely what you need.

The RESPOND stage has a hard `maxTokens: 4096` cap. `run_command` returns only the last 4,000 characters. These are small guardrails, but they prevent the occasional command that floods your context.

The compound effect

No single technique here is revolutionary. The power is in how they compose:

How this compares:

Cursor recently reported a 46.9% token reduction from their Dynamic Context Discovery approach, retrieving context on demand instead of pre-loading it. Their technique is reactive and operates within a single context window.

Claude Code uses Haiku sub-agents in isolated context windows with condensed summaries back to the main agent, conceptually closest to our approach, but within a single-conversation framework rather than a structured pipeline.

Our pipeline achieves a similar ~40-50% reduction through a fundamentally different mechanism: proactive structural isolation rather than reactive compression. The tradeoff is rigidity, hard stage boundaries can lose nuance that a continuous context preserves. The typed handoff schemas are our mitigation: they force the model to preserve what matters in a predictable shape.

The mental model:

Stop thinking of your AI agent as one smart model in a loop. Think of it as a pipeline of specialists, cheap models doing the reading, expensive models doing only what they're uniquely good at, and typed contracts ensuring nothing leaks between them.

Your most expensive tokens should be spent reasoning about * what to change *, not reading files to understand * what exists *.

Posted: 2026-02-08T09:50:53.000Z

Engagement: 6 likes, 3 retweets, 0 replies