Pace says OpenAI’s GPT-5.4 was stress-tested for several months inside legacy…

title: @jamiecuffe: People don’t realize how good AI "computer use" has actually gotten — until they...

author: jamiecuffe

content_type: twitter_article

published: 2026-03-05T18:43:33+00:00

source_url: https://x.com/jamiecuffe/status/2029628903732482163

word_count: 465

People don’t realize how good AI "computer use" has actually gotten — until they see it tackle the h

People don’t realize how good AI "computer use" has actually gotten — until they see it tackle the hardest UIs in existence: legacy insurance portals.

At @pacecom , we build AI agents for insurance workflows like submission intake and first notice of loss. Because we operate in this space, we use legacy enterprise software as our ultimate benchmark. If an AI can reliably navigate a 20-year-old, hyper-dense insurance portal without hallucinating a click, it can navigate anything.

For a long time, the technology just wasn't there. But after spending the last few months working closely with OpenAI to stress-test their new GPT-5.4 model inside these environments, it became clear: the paradigm has shifted.

Here is GPT-5.4 executing a complex workflow inside a real insurance environment:

Insurance turns out to be the ultimate stress test for computer-use models. You have to reason through edge cases and maintain near-perfect accuracy across thousands of tasks.

We've identified four major leaps that are finally making computer-use agents viable in production.

1. Click Accuracy

Historically, the biggest failure point was simply clicking the right thing. Enterprise software is incredibly dense. Layouts are cluttered, buttons are tiny, and the systems were designed decades ago. Earlier models would frequently miss targets. GPT-5.4 is vastly better at grounding itself visually, clicking precisely where it needs to — even on a crowded screen.

2. Long Trajectory Reasoning

Real insurance workflows aren't 5 steps; they are hundreds of steps long. They involve navigating menus, entering structured data, cross-referencing PDFs, and handling exceptions across entirely different systems.

Earlier models lost the plot partway through. GPT-5.4 maintains context across these massive workflows, reasoning through branching steps without drifting off task.

3. Speed and Time to Iteration

When you’re building production agents, speed is a feature. Faster models mean faster evaluation cycles. We can now run thousands of workflow tests rapidly, identify failure points, and shorten the feedback loop. This is how you drive reliability up to enterprise standards.

4. Memory

Agents can now store and reuse context across steps instead of recomputing everything from scratch. Desktop UIs are particularly useful to remember because the position of UI elements rarely changes. Remembering the spatial layout reduces repeated reasoning and dramatically improves consistency during long, repetitive processes.

Bringing This to Insurance

Insurance operations run on some of the most complex, legacy-heavy software environments in the world. Instead of trying to replace those massive systems, our approach at Pace is different: we build AI agents that use the exact same software your human operators use today.

With the launch of GPT-5.4, the ability to navigate these real interfaces reliably is finally here. We’re excited to partner with OpenAI on this launch and keep pushing the boundaries of what AI can do for real operational problems.

Posted: 2026-03-05T18:43:33.000Z

Engagement: 725 likes, 81 retweets, 19 replies

Pace says OpenAI’s GPT-5.4 was stress-tested for several months inside legacy…

Brief

Why it matters

Key details