title: @derrickcchoi: I work on the...
author: derrickcchoi
content_type: twitter_article
published: 2026-02-06T16:16:19+00:00
source_url: https://x.com/derrickcchoi/status/2019807376136609804
word_count: 1678
I work on the
Codex
team at
@OpenAI
and I wanted to see what a long-running AI teammate looks like without needing to baby the agent.
Since I have the luxury of being able to use and test GPT-5.3-Codex internally before launch, I decided to run Codex at “Extra High” reasoning and full access mode and asked it to implement a design tool from scratch. It ended up running for ~25 hours uninterrupted, using ~13M total tokens, and generating ~30k LOC.
Reminder that this was more of an experiment, not an enterprise grade rollout, and I wouldn’t ship the initial output without further review. But Codex did really well overall on the parts that matter for a long-run agent: it followed the spec, reasoned on what it needed to do, passing verification steps, and the result exceeded my expectations.
If you're interested in diving deeper yourself, I have the code on GitHub in a public repo (
design-desk
). Built with React, TypeScript, Vite, and Tailwind.
I asked Codex to generate a simple summary page of the session data as well.
High level summary of session
Session data in the CLI
1. Why long-running agents are becoming more real
This isn’t just “models got smarter.” The real shift is time horizon: agents can now stay coherent longer and complete bigger chunks of work end-to-end. In November 2025, we launched GPT-5.1-Codex-Max, which was the first model we trained to compact across multiple context windows.
METR’s time-horizon
evaluations
are a good framing: the length of software tasks frontier agents can complete with ~50% and 80% reliability has been climbing fast, with a rough ~7 month doubling time. And OpenAI models currently sit at the top in both categories.
Our recent GPT-5.3-Codex launch (
announcement
) pushes this further for agent work in two practical ways:
It’s better at multi-step execution (plan → implement → validate → repair).
It’s easier to steer mid-flight without resetting the whole run (course corrections don’t wipe progress).
I was also inspired by Cursor’s recent blog post (
) where they built a browser by running GPT-5.2-Codex uninterrupted for a week which resulted in 3M lines of code written.
The Cursor team wrote that OpenAI models are
> much better at extended autonomous work: following instructions, keeping focus, avoiding drift, and implementing things precisely and completely.
2. The harness matters as much as the model
Long-running work is less about one giant prompt and more about the agent loop the model operates inside.
In Codex, the loop looks like: plan → edit → run tools (tests/build/lint) → observe results → repair → document → repeat. That gives the agent
Real feedback (errors, diffs, logs)
Keeps state externalized (repo + worktrees + docs)
Makes long runs steerable: you can correct direction based on outcomes
This is also why Codex models feel better on Codex surfaces than a generic chat window: the harness supplies structured context (repo metadata, file tree, diffs, command outputs) and enforces a disciplined “done when” routine.
We recently published an
article
about the Codex agent loop that has more details.
To top this off, we also launched the
Codex app
this week makes that loop usable day-to-day:
Parallel threads across projects (long work doesn’t block your day job)
Skills (standardize plan/implement/test/report)
Automations (routine work in the background)
Git worktrees (isolate runs, keep diffs reviewable, reduce thrash)
3. What I did (the setup)
I picked a design tool for this “experiment” because it’s an unforgiving test: UI + data model + editing operations + lots of edge cases. You can’t bluff it. If the architecture is wrong, it breaks quickly.
I gave GPT-5.3-Codex a meaty spec, ran it at “Extra High” reasoning, and it ended up running uninterrupted for ~25 hours and was able to stay coherent and ship quality code. The model also ran verification steps (tests, lint, typecheck) for every milestone it completed.
The key idea was durable project memory. I wrote down the spec, plan, and constraints in markdown files that Codex could repeatedly reference. That prevented drift and kept a stable definition of “done.”
File stack (linked to each file on GitHub):
Prompt.md
(spec + deliverables)
Plans.md
(milestones + validations)
Architecture.md
(principles + constraints)
Implement.md
(execution instructions referencing the plan)
Documentation.md
(status + decisions as it shipped)
Codex doesn't just write code and hope it works. After each milestone, it runs the repository’s verification commands such as build/tests/lint. If something fails, like npm run lint, it stops, fixes the issue, and only then moves on to the next milestone (example below).
Npm run commands that Codex ran
Codex correcting issues after npm run lint
The output is by no means perfect or something I’d call fully production ready, but real enough that “it compiles” isn’t the bar. Like I mentioned above, the bar was: does the application adhere to my instructions and does it actually work?
4. The "memory stack"
(kickoff contract)
Purpose: Freeze the target so the agent doesn’t “build something impressive but wrong.”
Key sections in the file:
Goals + non-goals
Hard constraints (perf, determinism, UX, platform)
Deliverables (what must exist when finished)
“Done when” (checks + demo flow)
As you can see from the screenshot below, this generated the Plans markdown file.
The prompt I gave Codex to kickstart the work
Plans.md (milestones + validation harness)