SyncAnyone: Implicit Disentanglement via Progressive Self-Correction for Lip-Syncing in the wild

HumanAIGC group, Tongyi Lab,Alibaba Group
arXiv code
MY ALT TEXT

High-quality AI-powered video dubbing demands precise audio-lip synchronization, high-fidelity visual generation, and faithful preservation of identity and background. Most existing methods rely on a mask-based training strategy, where the mouth region is masked in talking-head videos, and the model learns to synthesize lip movements from corrupted inputs and target audios. While this facilitates lip-sync accuracy, it disrupts spatiotemporal context, impairing performance on dynamic facial motions and causing instability in facial structure and background consistency. To overcome this limitation, we propose SyncAnyone, a novel two-stage learning framework that achieves accurate motion modeling and high visual fidelity simultaneously. In Stage 1, we train a diffusion-based video transformer for masked mouth inpainting, leveraging its strong spatiotemporal modeling to generate accurate, audio-driven lip movements. However, due to input corruption, minor artifacts may arise in the surrounding facial regions and the background. In Stage 2, we develop a mask-free tuning pipeline to address mask-induced artifacts. Specifically, on the basis of the Stage 1 model, we develop a data generation pipeline that creates pseudo-paired training samples by synthesizing lip-synced videos from the source video and random sampled audio. We further tune the stage 2 model on this synthetic data, achieving precise lip editing and better background consistency. Extensive experiments show that our method achieves state-of-the-art results in visual quality, temporal coherence, and identity preservation under in-the wild lip-syncing scenarios.

Results

Our video lip-sync model generates high-fidelity results across diverse styles and languages. Furthermore, it demonstrates superior robustness in challenging scenarios, such as handling occlusions, extreme head poses, and rapid motion.

Realistic Style

Cartoon Style

Video Translate

Occlusion

Comparison

Compared to existing methods, our approach demonstrates superior performance across a variety of challenging scenarios. Specifically, it not only better handles dynamic challenges such as fast motion, extreme head poses, and occlusions, but also shows significant advantages in preserving facial identity and background consistency.

In-the-wild scenarios

BibTeX

@article{zhang2025syncanyone,
      title={SyncAnyone: Implicit Disentanglement via Progressive Self-Correction for Lip-Syncing in the wild},
      author={Xindi Zhang and Dechao Meng and Steven Xiao and Qi Wang and Peng Zhang and Bang Zhang},
      journal={arXiv preprint arXiv:2512.21736},
      year={2025}
}