Knot Forcing: Taming Autoregressive Video Diffusion Models for Real-time Infinite Interactive Portrait Animation

HumanAIGC group, Tongyi Lab,Alibaba Group
arXiv code
MY ALT TEXT

We propose Knot Forcing, a novel auto-regressive transformer for real-time and infinite portrait animation. It consists of three key designs: (1) a chunk-wise generation strategy with global identity preservation via cached KV states of the reference image and local temporal modeling using sliding window attention; (2) a Temporal Knot Module that overlaps adjacent chunks and propagates spatio-temporal cues via image-to-video conditioning to smooth inter-chunk motion transitions; and (3) a "running forward" mechanism that dynamically shifts the reference frame’s temporal position during inference—updating its RoPE and KV cache—to maintain a forward-looking context that guides long-term coherence. Together, these components enable high-fidelity, temporally consistent, and interactive portrait animation over infinite sequences, achieving real-time performance (16 FPS) with strong visual stability on a single H100 GPU.

Results

Our audio-driven model achieves high-quality generation results under conditions involving diverse reference styles and multiple languages.

Realistic Style

Cartoon Style

Ablation Study

Comparison

BibTeX

@article{xiao2025knotforcing,
      title={Knot Forcing: Taming Autoregressive Video Diffusion Models for Real-time Infinite Interactive Portrait Animation},
      author={Steven Xiao and Xindi Zhang and Dechao Meng and Qi Wang and Peng Zhang and Bang Zhang},
      journal={arXiv preprint arXiv:2512.21734},
      year={2025}
}