On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral Paper • 2512.04220 • Published about 1 month ago • 13
Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning Paper • 2510.03669 • Published Oct 4, 2025 • 1