The most underrated AI team is the harness team.
A model can sound great in chat and still fail inside the loop that matters: repo, tools, data access, permissions, tests, review, rollback.
The useful eval is the one that turns real work into comparable outcomes.
Start simple: give two agents the same task, compare the artifacts, keep a weekly win rate. Then add traces, scenario libraries, regression gates, and accepted-work metrics.
This is where model progress becomes usable.
The lab with the better harness learns faster because every failure leaves evidence the next run can use.