@boyuan_chen: The most underrated AI team is the harness team. A model can sound great in chat and still fail ins...

The most underrated AI team is the harness team.

A model can sound great in chat and still fail inside the loop that matters: repo, tools, data access, permissions, tests, review, rollback.

The useful eval is the one that turns real work into comparable outcomes.

Start simple: give two agents the same task, compare the artifacts, keep a weekly win rate. Then add traces, scenario libraries, regression gates, and accepted-work metrics.

This is where model progress becomes usable.

The lab with the better harness learns faster because every failure leaves evidence the next run can use.