Seeing a number of benchmarks showing Opus is the best model for long-running work.
Five tips for running Opus autonomously for hours/days:
- Use auto mode for permissions, so Claude doesn’t ask for approval
- Use dynamic workflows, to have Claude orchestrate hundreds/thousands of agents to get a task done
- Use /goal or /loop, to nudge Claude to keep going until it’s done
- Use Claude Code in the cloud, so you can close your laptop (easiest way is the desktop or mobile app)
- Make sure Claude has a way to self-verify its work end to end: Claude in Chrome browser extension for web, iOS/Android sim MCP for mobile, a way to start the full web server or service for backend work
Rishi Desai (@rishi_desai2)
Can coding agents stay coherent over a 1 billion token budget?
Can they build Slack from scratch?
Rewrite a JAX codebase in PyTorch?
Build a C compiler in Rust?
Enter SWE-Marathon: a benchmark for autonomous long-horizon software work.
— https://nitter.net/rishi_desai2/status/2062930906818769356#m