TWITTER_ARTICLE

Amy Tam argues that AI application founders are now focused more on token costs…

Brief

Amy Tam frames token spending as the new cloud-compute accountability problem for AI startups: costs may be falling in absolute terms, but usage is scaling faster and becoming visible enough to threaten unit economics. She argues builders face two distinct optimization problems: making each token cheaper to produce and reducing the number of tokens produced at all. vLLM addresses the first with serving-side efficiency techniques such as PagedAttention and continuous batching, while SGLang targets the second through constrained generation, structured outputs, and early stopping. Tam notes that many teams can cut costs by roughly 3x through infrastructure changes, but far bigger inefficiencies often come from over-generation, oversized contexts, and unnecessary reasoning steps. She believes inference economics will improve quickly, citing Groq at 500+ tokens/sec, speculative decoding with 2-3x gains, and distillation trends, and recommends not over-optimizing prematurely as long as products deliver clear user value. The key discipline is observability: understanding which users and features consume tokens, so teams can keep building ambitious LLM products while monitoring where economics might fail first.

Why it matters

Amy Tam argues that AI application founders are now focused more on token costs than model quality, citing examples such as a $40,000 monthly OpenAI bill, single users costing $100, and unit economics failing around 10,000 users.

Key details

  • The post separates LLM cost optimization into two layers: reducing cost per token with vLLM through serving improvements like PagedAttention and continuous batching, and reducing total tokens generated with SGLang through constrained generation, structured outputs, and early stopping.
  • Tam says teams often get an initial 3x cost reduction by switching serving infrastructure or self-hosting, but larger savings may come from eliminating unnecessary generation, such as agents producing 10x more tokens than needed or bloated context windows.
  • The article points to rapid infrastructure improvements already underway, including Groq exceeding 500 tokens per second, speculative decoding delivering 2-3x speedups, and model distillation approaching GPT-4-class quality at much lower compute cost.
  • Tam predicts LLM inference could become 10x cheaper and faster within two years and perhaps 100x within five, so builders should prioritize products with strong user value while adding instrumentation to track which features, users, and environments drive token spend.
Cleaned source text

title: @amytam01: I've been talking to founders building AI applications lately, and the conversat...

author: amytam01

content_type: twitter_article

published: 2026-02-10T22:04:04+00:00

source_url: https://x.com/amytam01/status/2021344443576746186

word_count: 832

I've been talking to founders building AI applications lately, and the conversations keep coming bac

I've been talking to founders building AI applications lately, and the conversations keep coming back to costs. Not model quality. Not accuracy. Cost. "Our OpenAI bill hit $40k last month." "We had to add rate limiting because a single user could cost us $100." "Our unit economics break down once we hit 10k users."

These aren't complaints about tokens being expensive. Relative to what you can do with them, they're still incredibly cheap. But nobody seems to know where the spend is actually going .

It’s like what happened with cloud computing in the late 2000s.

Before AWS, you bought physical servers: capital expenditure buried in your infrastructure budget. EC2 made compute a line item: $0.10/hour, billed to your credit card monthly. Costs actually went down , but suddenly everyone cared about efficiency. Why? Because it was measurable and accountable . Same with bandwidth when mobile exploded. The resource got cheaper, but usage grew faster, so optimization mattered more.

I think we’re at that inflection point with tokens right now. For 18 months, most teams treated them like the pre-EC2 era; you called an API, got charged, didn’t look too closely. OpenAI credits, beta pricing, VC funding. But now teams are moving to production scale, self-hosting (very measurable GPU costs), and actually caring about unit economics. It’s happening faster than the previous shifts because the cost feedback loop is tighter; you can burn thousands of dollars in a day if you’re not paying attention.

So I started looking at what founders are actually building to address this.

The Two Optimization Problems

Two projects keep coming up in conversations: vLLM and SGLang. At first glance they look similar, both about “making LLMs faster.” But they’re solving different layers of the same fundamental problem:

1. Cost per token — how efficiently can you generate each token?

2. Number of tokens — how many tokens do you actually need to generate?

vLLM: Makes each token cheaper to generate (better serving, PagedAttention, continuous batching)

SGLang: Generates fewer tokens in the first place (constrained generation, structured outputs, early stopping)

Most teams I talk to only notice the first problem initially. Inference costs are climbing, so they optimize serving. Maybe they switch to vLLM, maybe they self-host, costs drop 3x.

But then the second problem quietly becomes the bigger issue with more complex development. That agent that looks great in demos? It’s generating 10x more tokens than it needs to. Your context windows are bloated with redundant information. You’re paying for reasoning steps that don’t improve the output. You’re generating tokens you never should have created in the first place. The cheapest token is the one you never generate.

But What If It Gets Really Fast and Really Cheap?

It will. And that means everything.

Groq is already hitting 500+ tokens/sec. Speculative decoding (using a cheap model to guess ahead, and a smart model to check the work) delivers 2-3x speedups. Model distillation gives GPT-4-class quality at a fraction of the compute.

I think we're looking at 10x cheaper and faster in the next two years. Maybe 100x in five.

Which means if your unit economics are underwater today but you’re creating real value for users, the bet that it’ll work eventually is actually valid.

This is different from other infrastructure shifts. Cloud compute and bandwidth plateaued; they got cheaper, but not orders of magnitude cheaper. Tokens? We’re still in the exponential part of the curve.

So what does that mean for builders?

Don’t over-optimize for today’s costs. If you’re building something users love and you’re spending $10k/month on tokens but your product is clearly working, you’re probably fine. Costs will come down. Speed will go up. Features that are prohibitively expensive (to build or offer) today will be cheap next year.

But do build visibility. Not so you can optimize every token, but so you understand what breaks first if you’re wrong. Know which features are expensive. Know which users cost you more. Know where the spend is going.

The teams I’m excited about aren’t micro-optimizing tokens. They’re building products that would be impossible without LLMs, betting the economics will catch up. And having the instrumentation to know if they need to course-correct.

vLLM and SGLang still matter, but not because tokens are expensive. They matter because they let you do more with the same budget: which means you can ship faster, serve more users, and build more ambitious features while the cost curve drops.

The real question isn’t “can we afford this?” It’s “are we building something valuable enough that it’ll be obvious to keep funding it until the economics work?”

If yes, keep building. The infrastructure will catch up.

I’m practicing writing as a way of thinking: If you’re building on LLMs and have a token cost story (good or bad), I’d love to hear it. What am I missing? Where is your token spend actually concentrated? (Which features, which users?) If you’re spending $100+/day on tokens, how much is dev/testing vs production?

Posted: 2026-02-10T22:04:04.000Z

Engagement: 83 likes, 15 retweets, 8 replies