Tool-use RL has a data exhaustion problem: once the policy masters static tasks, rollouts stop carrying much gradient.
RODS uses reward variance from GRPO rollouts to find the agent's current learning boundary, then synthesizes new multi-turn tool tasks there. Same loop, better data spend.
arxiv.org/abs/2606.19047