title: @amytam01: I've been talking to founders building AI applications lately, and the conversat...
author: amytam01
content_type: twitter_article
published: 2026-02-10T22:04:04+00:00
source_url: https://x.com/amytam01/status/2021344443576746186
word_count: 832
I've been talking to founders building AI applications lately, and the conversations keep coming bac
I've been talking to founders building AI applications lately, and the conversations keep coming back to costs. Not model quality. Not accuracy. Cost. "Our OpenAI bill hit $40k last month." "We had to add rate limiting because a single user could cost us $100." "Our unit economics break down once we hit 10k users."
These aren't complaints about tokens being expensive. Relative to what you can do with them, they're still incredibly cheap. But nobody seems to know where the spend is actually going .
It’s like what happened with cloud computing in the late 2000s.
Before AWS, you bought physical servers: capital expenditure buried in your infrastructure budget. EC2 made compute a line item: $0.10/hour, billed to your credit card monthly. Costs actually went down , but suddenly everyone cared about efficiency. Why? Because it was measurable and accountable . Same with bandwidth when mobile exploded. The resource got cheaper, but usage grew faster, so optimization mattered more.
I think we’re at that inflection point with tokens right now. For 18 months, most teams treated them like the pre-EC2 era; you called an API, got charged, didn’t look too closely. OpenAI credits, beta pricing, VC funding. But now teams are moving to production scale, self-hosting (very measurable GPU costs), and actually caring about unit economics. It’s happening faster than the previous shifts because the cost feedback loop is tighter; you can burn thousands of dollars in a day if you’re not paying attention.
So I started looking at what founders are actually building to address this.
The Two Optimization Problems
Two projects keep coming up in conversations: vLLM and SGLang. At first glance they look similar, both about “making LLMs faster.” But they’re solving different layers of the same fundamental problem:
1. Cost per token — how efficiently can you generate each token?
2. Number of tokens — how many tokens do you actually need to generate?
vLLM: Makes each token cheaper to generate (better serving, PagedAttention, continuous batching)
SGLang: Generates fewer tokens in the first place (constrained generation, structured outputs, early stopping)
Most teams I talk to only notice the first problem initially. Inference costs are climbing, so they optimize serving. Maybe they switch to vLLM, maybe they self-host, costs drop 3x.
But then the second problem quietly becomes the bigger issue with more complex development. That agent that looks great in demos? It’s generating 10x more tokens than it needs to. Your context windows are bloated with redundant information. You’re paying for reasoning steps that don’t improve the output. You’re generating tokens you never should have created in the first place. The cheapest token is the one you never generate.
But What If It Gets Really Fast and Really Cheap?
It will. And that means everything.
Groq is already hitting 500+ tokens/sec. Speculative decoding (using a cheap model to guess ahead, and a smart model to check the work) delivers 2-3x speedups. Model distillation gives GPT-4-class quality at a fraction of the compute.
I think we're looking at 10x cheaper and faster in the next two years. Maybe 100x in five.
Which means if your unit economics are underwater today but you’re creating real value for users, the bet that it’ll work eventually is actually valid.
This is different from other infrastructure shifts. Cloud compute and bandwidth plateaued; they got cheaper, but not orders of magnitude cheaper. Tokens? We’re still in the exponential part of the curve.
So what does that mean for builders?
Don’t over-optimize for today’s costs. If you’re building something users love and you’re spending $10k/month on tokens but your product is clearly working, you’re probably fine. Costs will come down. Speed will go up. Features that are prohibitively expensive (to build or offer) today will be cheap next year.
But do build visibility. Not so you can optimize every token, but so you understand what breaks first if you’re wrong. Know which features are expensive. Know which users cost you more. Know where the spend is going.
The teams I’m excited about aren’t micro-optimizing tokens. They’re building products that would be impossible without LLMs, betting the economics will catch up. And having the instrumentation to know if they need to course-correct.
vLLM and SGLang still matter, but not because tokens are expensive. They matter because they let you do more with the same budget: which means you can ship faster, serve more users, and build more ambitious features while the cost curve drops.
The real question isn’t “can we afford this?” It’s “are we building something valuable enough that it’ll be obvious to keep funding it until the economics work?”
If yes, keep building. The infrastructure will catch up.
I’m practicing writing as a way of thinking: If you’re building on LLMs and have a token cost story (good or bad), I’d love to hear it. What am I missing? Where is your token spend actually concentrated? (Which features, which users?) If you’re spending $100+/day on tokens, how much is dev/testing vs production?
Posted: 2026-02-10T22:04:04.000Z
Engagement: 83 likes, 15 retweets, 8 replies