DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue1,2, Jie Wu1‡, Yu Gao1, Fangyuan Kong1, Lingting Zhu2, Mengzhao Chen2, Zhiheng Liu2, Wei Liu1, Qiushan Guo1, Weilin Huang1†, Ping Luo2†
1ByteDance Seed, 2The University of Hong Kong
Corresponding authors, Project lead
Code has been released at https://github.com/XueZeyue/DanceGRPO, Paper link here

Abstract

Recent advances in generative AI have revolutionized visual content creation, yet aligning model outputs with human preferences remains a critical challenge. While Reinforcement Learning (RL) has emerged as a promising approach for fine-tuning generative models, existing methods like DDPO and DPOK face fundamental limitations - particularly their inability to maintain stable optimization when scaling to large and diverse prompt sets, severely restricting their practical utility. This paper presents DanceGRPO, a framework that addresses these limitations through an innovative adaptation of Group Relative Policy Optimization (GRPO) for visual generation tasks. Our key insight is that GRPO's inherent stability mechanisms uniquely position it to overcome the optimization challenges that plague prior RL-based approaches on visual generation. DanceGRPO establishes several significant advances: First, it demonstrates consistent and stable policy optimization across multiple modern generative paradigms, including both diffusion models and rectified flows. Second, it maintains robust performance when scaling to complex, real-world scenarios encompassing three key tasks and four foundation models. Third, it shows remarkable versatility in optimizing for diverse human preferences as captured by five distinct reward models assessing image/video aesthetics, text-image alignment, video motion quality, and binary feedback. Our comprehensive experiments reveal that DanceGRPO outperforms baseline methods by up to 181% across multiple established benchmarks, including HPS-v2.1, CLIP Score, VideoAlign, and GenEval. Our results establish DanceGRPO as a robust and versatile solution for scaling Reinforcement Learning from Human Feedback (RLHF) tasks in visual generation, offering new insights into harmonizing reinforcement learning and visual synthesis.

Core Contributions

  • Unified and Pioneering. To the best of our knowledge, we are the first to extend GRPO to diffusion models and rectified flows, accomplishing multiple visual generation tasks within a unified RL framework. We achieve seamless integration between GRPO and visual generation tasks by carefully refomulate the SDEs, selecting appropriate optimized timesteps, initializing noise and noise scales, and employing efficient sampling strategies.
  • Generalization and Scalability. To our knowledge, DanceGRPO is the first RL-based unified framework capable of seamless adaptation across diverse generative paradigms, tasks, foundational models, and reward models. Unlike prior RL algorithms—primarily validated on text-to-image diffusion models on small-scale datasets—DanceGRPO demonstrates robust performance on large-scale datasets, showcasing both scalability and practical applicability.
  • High Effectiveness. Our experiments demonstrate that DanceGRPO achieves significant performance gains, outperforming baselines by up to 181% across multiple academic benchmarks, including HPS-v2.1, CLIP score, VideoAlign, and GenEval, in visual generation tasks. Notably, DanceGRPO also enables models to learn the denoising trajectory in Best-of-N inference scaling. We also make some initial attempts to enable models to capture the distribution of binary (0/1) reward models, showing its ability to capture sparse, thresholding feedback.

Results on Stable Diffusion

This table presents the performance of three Stable Diffusion variants: (1) the base model, (2) the model trained with HPS score, and (3) the model optimized with both HPS and CLIP scores. For evaluation, we report HPS-v2.1 and GenEval scores using their official prompts, while CLIP score and Pick-a-Pic metrics are computed on our test set of 1,000 prompts.

FID

Results on FLUX

In this table, we show the results of FLUX, FLUX trained with HPS score, and FLUX trained with both HPS score and CLIP score.

FID

Results on HunyuanVideo

In this table, we show the results of HunyuanVideo on Videoalign and VisionReward trained with VideoAlign VQ&MQ. "Baseline" denotes the original results of HunyuanVideo. We use the probability version of VisionReward.

FID

Reward Curves on Text-to-Image Generation

We visualize the reward curves of Stable Diffusion, FLUX.1-dev, and HunyuanVideo-T2I on HPS score from left to right. After applying CLIP score, the HPS score decreases, but the generated images become more natural.

FID

Reward Curves on Video Generation

We visualize the training curves of motion quality&visual quality on HunyuanVideo, motion quality on SkyReel-I2V.

FID

Binary Reward & Best-of-N Inference Scaling

(a) Thresholding Binary Reward employs a binary mechanism, where rewards are discretized via a fixed threshold (values exceeding the threshold receive 1, others 0), specifically designed to evaluate generative models’ ability to learn abrupt reward distributions under threshold-based optimization. (b) By training the model on subsets of 16 samples selected from progressively larger pools (16, 64, and 256 samples per prompt), we evaluate the impact of sample curation on convergence dynamics about the stable diffusion.

FID

Human Evaluation

We show the human evaluation results using FLUX (T2I), HunyuanVideo (T2V), and SkyReel (I2V), respectively. Human artists consistently prefer outputs refined with RLHF.

FID

Visualization on Training Process

We visualize the results by selecting FLUX optimized with the HPS score at iterations 0, 60, 120, 180, 240, and 300. The optimized outputs tend to exhibit brighter tones and richer details.

FID

Visualization the Diversity

Visualization of the diversity of the model before and after RLHF. Different seed tends to generate similar images after RLHF.

FID

Influence of CLIP score

This figure demonstrates the impact of CLIP score. The prompt is "A photo of cup". We find the model trained solely with HPS-v2.1 rewards tend to produce unnatural ("oily") outputs, while incorporating CLIP scores helps maintain more natural image characteristics.

FID

More Visualization on Text-to-Image Generation.

FID

More Visualization on Text-to-Video Generation. (Left: original,Right: RLHF)

Prompt: Tobuscus wearing a green shirt, gliding through the sky on magic shoes.

Prompt: realistic scene 4k HD angels in unique armor with gems and gold white on it with huge wings fighting among themselves.

Prompt: the feeling of a mirage in a house, the sunset shining on the lake surface, and the water flowing slowly.

Prompt: woman running down a dimly lit corridor.

More Visualization on Image-to-Video Generation. (Left: original,Right: RLHF)

Prompt: a young black girl walking down a street with alot of huge trees.

Prompt: pencil shavings dancing on a black table.