An unbelievably simple way of evaling a new model/harness for your org:
- Instead of running one implementer agent AFK, run two
- Get an agent to pick the best output, or use human review to pick the best output
- Tally the results at the end of the week
An unbelievably simple way of evaling a new model/harness for your org: