@boyuan_chen: Reward scores are starting to look too thin for agentic RL. RefGRPO uses the gap between agent self...

Reward scores are starting to look too thin for agentic RL.

RefGRPO uses the gap between agent self-reflection on environment feedback and the actual outcome as a calibration bonus. UARM makes reward models expose uncertainty alongside a score. Rollout freshness asks the systems version of the same question: is this trajectory still on-policy enough to train on?

The shared pattern is simple: agent traces become useful training data only after the feedback has provenance.

What happened?
Who or what judged it?
How uncertain was the judgment?
Which policy produced the rollout?
Did the next run actually improve?

Without that, "learn from failures" becomes a nice slogan stapled to a brittle pipeline.

A model can learn from failure only when the failure signal survives the training system.