DanceGRPO: Unleashing GRPO on Visual Generation

Abstract

Recent breakthroughs in generative models—particularly diffusion models and rectified flows—have revolutionized visual content creation, yet aligning model outputs with human preferences remains a critical challenge. Existing reinforcement learning (RL)-based methods for visual generation face critical limitations: incompatibility with modern Ordinary Differential Equations (ODEs)-based sampling paradigms, instability in large-scale training, and lack of validation for video generation. This paper introduces DanceGRPO, the first framework to adapt Group Relative Policy Optimization (GRPO) to visual generation, unleashing one unified RL algorithm across two generative paradigms (diffusion models and rectified flows), three tasks (text-to-image, text-to-video, image-to-video), four foundation models (Stable Diffusion, HunyuanVideo, FLUX, SkyReel-I2V), and five reward models (image/video aesthetics, text-image alignment, video motion quality, and binary reward). To our knowledge, DanceGRPO is the first RL-based unified framework capable of seamless adaptation across diverse generative paradigms, tasks, foundational models, and reward models. DanceGRPO demonstrates consistent and substantial improvements, which outperform baselines by up to 181% on benchmarks such as HPS-v2.1, CLIP Score, VideoAlign, and GenEval. Notably, DanceGRPO not only can stabilize policy gradient for complex video generation, but also enables generative policy to better capture denoising trajectories for Best-of-N inference scaling and learn from sparse binary feedback. Our results establish DanceGRPO as a robust and versatile solution for scaling Reinforcement Learning from Human Feedback (RLHF) tasks in visual generation, offering new insights into harmonizing reinforcement learning and visual synthesis. The code will be released.

Core Contributions

Unified and Pioneering. To the best of our knowledge, we are the first to extend GRPO to diffusion models and rectified flows, accomplishing multiple visual generation tasks within a unified RL framework. We achieve seamless integration between GRPO and visual generation tasks by carefully refomulate the SDEs, selecting appropriate optimized timesteps, initializing noise and noise scales, and employing efficient sampling strategies.
Generalization and Scalability. To our knowledge, DanceGRPO is the first RL-based unified framework capable of seamless adaptation across diverse generative paradigms, tasks, foundational models, and reward models. Unlike prior RL algorithms—primarily validated on text-to-image diffusion models on small-scale datasets—DanceGRPO demonstrates robust performance on large-scale datasets, showcasing both scalability and practical applicability.
High Effectiveness. Our experiments demonstrate that DanceGRPO achieves significant performance gains, outperforming baselines by up to 181% across multiple academic benchmarks, including HPS-v2.1, CLIP score, VideoAlign, and GenEval, in visual generation tasks. Notably, DanceGRPO also enables models to learn the denoising trajectory in Best-of-N inference scaling. We also make some initial attempts to enable models to capture the distribution of binary (0/1) reward models, showing its ability to capture sparse, thresholding feedback.

Results on Stable Diffusion

This table presents the performance of three Stable Diffusion variants: (1) the base model, (2) the model trained with HPS score, and (3) the model optimized with both HPS and CLIP scores. For evaluation, we report HPS-v2.1 and GenEval scores using their official prompts, while CLIP score and Pick-a-Pic metrics are computed on our test set of 1,000 prompts.

Results on FLUX

In this table, we show the results of FLUX, FLUX trained with HPS score, and FLUX trained with both HPS score and CLIP score.

Results on HunyuanVideo

In this table, we show the results of HunyuanVideo on Videoalign and VisionReward trained with VideoAlign VQ&MQ. "Baseline" denotes the original results of HunyuanVideo. We use the probability version of VisionReward.

Reward Curves on Text-to-Image Generation

We visualize the reward curves of Stable Diffusion, FLUX.1-dev, and HunyuanVideo-T2I on HPS score from left to right. After applying CLIP score, the HPS score decreases, but the generated images become more natural.

Reward Curves on Video Generation

We visualize the training curves of motion quality&visual quality on HunyuanVideo, motion quality on SkyReel-I2V.

Binary Reward & Best-of-N Inference Scaling

(a) Thresholding Binary Reward employs a binary mechanism, where rewards are discretized via a fixed threshold (values exceeding the threshold receive 1, others 0), specifically designed to evaluate generative models’ ability to learn abrupt reward distributions under threshold-based optimization. (b) By training the model on subsets of 16 samples selected from progressively larger pools (16, 64, and 256 samples per prompt), we evaluate the impact of sample curation on convergence dynamics about the stable diffusion.

Human Evaluation

We show the human evaluation results using FLUX (T2I), HunyuanVideo (T2V), and SkyReel (I2V), respectively. Human artists consistently prefer outputs refined with RLHF.

Visualization on Training Process

We visualize the results by selecting FLUX optimized with the HPS score at iterations 0, 60, 120, 180, 240, and 300. The optimized outputs tend to exhibit brighter tones and richer details.

Visualization the Diversity

Visualization of the diversity of the model before and after RLHF. Different seed tends to generate similar images after RLHF.

Influence of CLIP score

This figure demonstrates the impact of CLIP score. The prompt is "A photo of cup". We find the model trained solely with HPS-v2.1 rewards tend to produce unnatural ("oily") outputs, while incorporating CLIP scores helps maintain more natural image characteristics.