TWITTER_ARTICLE

Pace says OpenAI’s GPT-5.4 was stress-tested for several months inside legacy…

Brief

Jamie Cuffe argues that AI “computer use” has crossed a practical threshold, based on Pace’s work with OpenAI testing GPT-5.4 in real insurance environments published on 2026-03-05. The claim is that legacy insurance portals—dense, decades-old interfaces with tiny buttons, branching workflows, and cross-system exception handling—are a more meaningful benchmark than polished consumer apps because success requires precision over hundreds of steps. Pace highlights four technical gains: more reliable visual grounding for accurate clicks, stronger long-trajectory reasoning to stay on task through extended workflows, faster inference that allows thousands of evaluation runs and shorter iteration cycles, and memory that preserves spatial knowledge of desktop UIs to improve consistency. In practice, Pace is not trying to replace insurers’ existing systems; it is building agents that use the same software as human operators for tasks like submission intake and first notice of loss.

Why it matters

Pace says OpenAI’s GPT-5.4 was stress-tested for several months inside legacy insurance software used for workflows such as submission intake and first notice of loss, with 20-year-old insurance portals serving as the benchmark for AI “computer use.”

Key details

  • The post identifies four advances that made production use plausible: better click accuracy on crowded enterprise screens, longer-horizon reasoning across workflows that can span hundreds of steps, faster model speed that enables thousands of workflow tests, and memory that reuses spatial UI context across steps.
  • Insurance is framed as an unusually hard environment because agents must maintain near-perfect accuracy across thousands of tasks, navigate dense menus, enter structured data, cross-reference PDFs, and handle exceptions across multiple systems.
  • Rather than replacing legacy core systems, Pace’s approach is to deploy AI agents that operate the same desktop interfaces human insurance staff already use, and the company says GPT-5.4 is the first model reliable enough to make that practical.
Cleaned source text

title: @jamiecuffe: People don’t realize how good AI "computer use" has actually gotten — until they...

author: jamiecuffe

content_type: twitter_article

published: 2026-03-05T18:43:33+00:00

source_url: https://x.com/jamiecuffe/status/2029628903732482163

word_count: 465

People don’t realize how good AI "computer use" has actually gotten — until they see it tackle the h

People don’t realize how good AI "computer use" has actually gotten — until they see it tackle the hardest UIs in existence: legacy insurance portals.

At @pacecom , we build AI agents for insurance workflows like submission intake and first notice of loss. Because we operate in this space, we use legacy enterprise software as our ultimate benchmark. If an AI can reliably navigate a 20-year-old, hyper-dense insurance portal without hallucinating a click, it can navigate anything.

For a long time, the technology just wasn't there. But after spending the last few months working closely with OpenAI to stress-test their new GPT-5.4 model inside these environments, it became clear: the paradigm has shifted.

Here is GPT-5.4 executing a complex workflow inside a real insurance environment:

Insurance turns out to be the ultimate stress test for computer-use models. You have to reason through edge cases and maintain near-perfect accuracy across thousands of tasks.

We've identified four major leaps that are finally making computer-use agents viable in production.

1. Click Accuracy

Historically, the biggest failure point was simply clicking the right thing. Enterprise software is incredibly dense. Layouts are cluttered, buttons are tiny, and the systems were designed decades ago. Earlier models would frequently miss targets. GPT-5.4 is vastly better at grounding itself visually, clicking precisely where it needs to — even on a crowded screen.

2. Long Trajectory Reasoning

Real insurance workflows aren't 5 steps; they are hundreds of steps long. They involve navigating menus, entering structured data, cross-referencing PDFs, and handling exceptions across entirely different systems.

Earlier models lost the plot partway through. GPT-5.4 maintains context across these massive workflows, reasoning through branching steps without drifting off task.

3. Speed and Time to Iteration

When you’re building production agents, speed is a feature. Faster models mean faster evaluation cycles. We can now run thousands of workflow tests rapidly, identify failure points, and shorten the feedback loop. This is how you drive reliability up to enterprise standards.

4. Memory

Agents can now store and reuse context across steps instead of recomputing everything from scratch. Desktop UIs are particularly useful to remember because the position of UI elements rarely changes. Remembering the spatial layout reduces repeated reasoning and dramatically improves consistency during long, repetitive processes.

Bringing This to Insurance

Insurance operations run on some of the most complex, legacy-heavy software environments in the world. Instead of trying to replace those massive systems, our approach at Pace is different: we build AI agents that use the exact same software your human operators use today.

With the launch of GPT-5.4, the ability to navigate these real interfaces reliably is finally here. We’re excited to partner with OpenAI on this launch and keep pushing the boundaries of what AI can do for real operational problems.

Posted: 2026-03-05T18:43:33.000Z

Engagement: 725 likes, 81 retweets, 19 replies