MotionRAG-Diff: A Retrieval-Augmented Diffusion Framework for Long-Term Music-to-Dance Generation

Tongyi Lab, Alibaba Group

Abstract

Generating long-term, coherent, and realistic music-conditioned dance sequences remains a challenging task in human motion synthesis. Existing approaches exhibit critical limitations: motion graph methods rely on fixed template libraries, restricting creative generation; diffusion models, while capable of producing novel motions, often lack temporal coherence and musical alignment. To address these challenges, we propose MotionRAG-Diff, a hybrid framework that integrates Retrieval-Augmented Generation (RAG) with diffusion-based refinement to enable high-quality, musically coherent dance generation for arbitrary long-term music inputs. Our method introduces three core innovations: (1) A cross-modal contrastive learning architecture that aligns heterogeneous music and dance representations in a shared latent space, establishing unsupervised semantic correspondence without paired data; (2) An optimized motion graph system for efficient retrieval and seamless concatenation of motion segments, ensuring realism and temporal coherence across long sequences; (3) A multi-condition diffusion model that jointly conditions on raw music signals and contrastive features to enhance motion quality and global synchronization. Extensive experiments demonstrate that MotionRAG-Diff achieves state-of-the-art performance in motion quality, diversity, and music-motion synchronization accuracy. This work establishes a new paradigm for music-driven dance generation by synergizing retrieval-based template fidelity with diffusion-based creative enhancement.

Method

The overall framework of our work. It contains three core components: the contrastive learning model, the motion graph, and the diffusion model. This integrated architecture enables the processing of arbitrarily long-term music inputs for coherent and high-quality dance motion generation.

Results

As shown in the following videos, MotionRAG-Diff can generates long dances from given music (Please unmute for music). We present generated motion sequences in Motionmg and Motiondiff, which correspond to outputs from the motion graph and diffusion model components of our framework, respectively. For detailed explanations of these terms, refer to the overall framework described in Section 3.

Motionmg



Motiondiff

Comparison with the SOTAs

We maily compare our work with EDGE, Bailando, Bailando++, Lodge, and Lodge++. (Please unmute for music)



For additional quantitative results, please refer to Section 4.4.

BibTeX

@article{huang2025motionrag-diff,
      title={MotionRAG-Diff: A Retrieval-Augmented Diffusion Framework for Long-Term Music-to-Dance Generation},
      author={Mingyang Huang and Peng Zhang and Bang Zhang},
      website={https://humanaigc.github.io/MotionRAG-Diff/},
      year={2025}
}