OpenAI engineer demonstrates GPT-5.3-Codex running autonomously for 25 hours to…

title: @derrickcchoi: I work on the...

author: derrickcchoi

content_type: twitter_article

published: 2026-02-06T16:16:19+00:00

source_url: https://x.com/derrickcchoi/status/2019807376136609804

word_count: 1678

I work on the

Codex

team at

@OpenAI

and I wanted to see what a long-running AI teammate looks like without needing to baby the agent.

Since I have the luxury of being able to use and test GPT-5.3-Codex internally before launch, I decided to run Codex at “Extra High” reasoning and full access mode and asked it to implement a design tool from scratch. It ended up running for ~25 hours uninterrupted, using ~13M total tokens, and generating ~30k LOC.

Reminder that this was more of an experiment, not an enterprise grade rollout, and I wouldn’t ship the initial output without further review. But Codex did really well overall on the parts that matter for a long-run agent: it followed the spec, reasoned on what it needed to do, passing verification steps, and the result exceeded my expectations.

If you're interested in diving deeper yourself, I have the code on GitHub in a public repo (

design-desk

). Built with React, TypeScript, Vite, and Tailwind.

I asked Codex to generate a simple summary page of the session data as well.

High level summary of session

Session data in the CLI

1. Why long-running agents are becoming more real

This isn’t just “models got smarter.” The real shift is time horizon: agents can now stay coherent longer and complete bigger chunks of work end-to-end. In November 2025, we launched GPT-5.1-Codex-Max, which was the first model we trained to compact across multiple context windows.

METR’s time-horizon

evaluations

are a good framing: the length of software tasks frontier agents can complete with ~50% and 80% reliability has been climbing fast, with a rough ~7 month doubling time. And OpenAI models currently sit at the top in both categories.

Our recent GPT-5.3-Codex launch (

announcement

) pushes this further for agent work in two practical ways:

It’s better at multi-step execution (plan → implement → validate → repair).

It’s easier to steer mid-flight without resetting the whole run (course corrections don’t wipe progress).

I was also inspired by Cursor’s recent blog post (

) where they built a browser by running GPT-5.2-Codex uninterrupted for a week which resulted in 3M lines of code written.

The Cursor team wrote that OpenAI models are

> much better at extended autonomous work: following instructions, keeping focus, avoiding drift, and implementing things precisely and completely.

2. The harness matters as much as the model

Long-running work is less about one giant prompt and more about the agent loop the model operates inside.

In Codex, the loop looks like: plan → edit → run tools (tests/build/lint) → observe results → repair → document → repeat. That gives the agent

Real feedback (errors, diffs, logs)

Keeps state externalized (repo + worktrees + docs)

Makes long runs steerable: you can correct direction based on outcomes

This is also why Codex models feel better on Codex surfaces than a generic chat window: the harness supplies structured context (repo metadata, file tree, diffs, command outputs) and enforces a disciplined “done when” routine.

We recently published an

article

about the Codex agent loop that has more details.

To top this off, we also launched the

Codex app

this week makes that loop usable day-to-day:

Parallel threads across projects (long work doesn’t block your day job)

Skills (standardize plan/implement/test/report)

Automations (routine work in the background)

Git worktrees (isolate runs, keep diffs reviewable, reduce thrash)

3. What I did (the setup)

I picked a design tool for this “experiment” because it’s an unforgiving test: UI + data model + editing operations + lots of edge cases. You can’t bluff it. If the architecture is wrong, it breaks quickly.

I gave GPT-5.3-Codex a meaty spec, ran it at “Extra High” reasoning, and it ended up running uninterrupted for ~25 hours and was able to stay coherent and ship quality code. The model also ran verification steps (tests, lint, typecheck) for every milestone it completed.

The key idea was durable project memory. I wrote down the spec, plan, and constraints in markdown files that Codex could repeatedly reference. That prevented drift and kept a stable definition of “done.”

File stack (linked to each file on GitHub):

Prompt.md

(spec + deliverables)

Plans.md

(milestones + validations)

Architecture.md

(principles + constraints)

Implement.md

(execution instructions referencing the plan)

Documentation.md

(status + decisions as it shipped)

Codex doesn't just write code and hope it works. After each milestone, it runs the repository’s verification commands such as build/tests/lint. If something fails, like npm run lint, it stops, fixes the issue, and only then moves on to the next milestone (example below).

Npm run commands that Codex ran

Codex correcting issues after npm run lint

The output is by no means perfect or something I’d call fully production ready, but real enough that “it compiles” isn’t the bar. Like I mentioned above, the bar was: does the application adhere to my instructions and does it actually work?

4. The "memory stack"

(kickoff contract)

Purpose: Freeze the target so the agent doesn’t “build something impressive but wrong.”

Key sections in the file:

Goals + non-goals

Hard constraints (perf, determinism, UX, platform)

Deliverables (what must exist when finished)

“Done when” (checks + demo flow)

As you can see from the screenshot below, this generated the Plans markdown file.

The prompt I gave Codex to kickstart the work

Plans.md (milestones + validation harness)

OpenAI engineer demonstrates GPT-5.3-Codex running autonomously for 25 hours to…

Brief

Why it matters

Key details