Twitter/X

@boyuan_chen: Tool-use RL has a data exhaustion problem: once the policy masters static tasks, rollouts stop carry...

Tool-use RL has a data exhaustion problem: once the policy masters static tasks, rollouts stop carrying much gradient.

RODS uses reward variance from GRPO rollouts to find the agent's current learning boundary, then synthesizes new multi-turn tool tasks there. Same loop, better data spend.

arxiv.org/abs/2606.19047