E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models
Abstract
Entropy-aware policy optimization method for reinforcement learning in flow matching models that improves exploration through SDE and ODE sampling strategies.
Recent reinforcement learning has enhanced the flow matching models on human preference alignment. While stochastic sampling enables the exploration of denoising directions, existing methods which optimize over multiple denoising steps suffer from sparse and ambiguous reward signals. We observe that the high entropy steps enable more efficient and effective exploration while the low entropy steps result in undistinguished roll-outs. To this end, we propose E-GRPO, an entropy aware Group Relative Policy Optimization to increase the entropy of SDE sampling steps. Since the integration of stochastic differential equations suffer from ambiguous reward signals due to stochasticity from multiple steps, we specifically merge consecutive low entropy steps to formulate one high entropy step for SDE sampling, while applying ODE sampling on other steps. Building upon this, we introduce multi-step group normalized advantage, which computes group-relative advantages within samples sharing the same consolidated SDE denoising step. Experimental results on different reward settings have demonstrated the effectiveness of our methods.
Community
We propose an entropy aware Group Relative Policy Optimization (E-GRPO) to increase the entropy of SDE sampling steps.
We have integrated a variety of current GRPO-based reinforcement learning methods as well as different image reward models.
Code: https://github.com/shengjun-zhang/VisualGRPO
Model: https://huggingface.co/studyOverflow/E-GRPO
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Expand and Prune: Maximizing Trajectory Diversity for Effective GRPO in Generative Models (2025)
- TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models (2025)
- Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models (2025)
- ESPO: Entropy Importance Sampling Policy Optimization (2025)
- PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling (2025)
- Efficient Reinforcement Learning with Semantic and Token Entropy for LLM Reasoning (2025)
- Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper