ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model

Tongyi Lab, Alibaba Group
MY ALT TEXT

Illustration of real-time portrait video generation. Given a portrait image and audio sequence as input, our model can generate high-fidelity animation results from full head to upper-body interaction with diverse facial expressions and style control.

Abstract

Real-time interactive video-chat portraits have been increasingly recognized as the future trend, particularly due to the remarkable progress made in text and voice chat technologies. However, existing methods primarily focus on real-time generation of head movements, but struggle to produce synchronized body motions that match these head actions. Additionally, achieving fine-grained control over the speaking style and nuances of facial expressions remains a challenge. To address these limitations, we introduce a novel framework for stylized real-time portrait video generation, enabling expressive and flexible video chat that extends from talking head to upper-body interaction. Our approach consists of the following two stages. The first stage involves efficient hierarchical motion diffusion models, that take both explicit and implicit motion representations into account based on audio inputs, which can generate a diverse range of facial expressions with stylistic control and synchronization between head and body movements. The second stage aims to generate portrait video featuring upper-body movements, including hand gestures. We inject explicit hand control signals into the generator to produce more detailed hand movements, and further perform face refinement to enhance the overall realism and expressiveness of the portrait video. Additionally, our approach supports efficient and continuous generation of upper-body portrait video in maximum 512 × 768 resolution at up to 30fps on 4090 GPU, supporting interactive video-chat in real-time. Experimental results demonstrate the capability of our approach to produce portrait videos with rich expressiveness and natural upper-body movements.

Method

MY ALT TEXT

  • Efficient Hierarchical motion diffusion model is proposed for audio2motion representation to generate face and body control signals hierarchically based on input audio, considering both explicit and implicit motion signals for precise facial expressions. Furthermore, fine-grained expression control is introduced to realize different variations in expression intensity, as well as stylistic expression transfer from reference videos, which aims to produce controllable and personalized expressions.
  • Hybrid control fusion generative model is designed for upper-body image generation, which utilizes explicit landmarks for direct and editable facial expression generation, while implicit offsets based on explicit signals are introduced to capture facial variations on diverse avatar styles. We also inject explicit hand controls for more accurate and realistic hand textures and movements. Additionally, a face refinement module is employed to enhance facial realism ensuring highly expressive and lifelike portrait videos.
  • Extensible and real-time generation framework is constructed for interactive video chat application, which can adapt to various scenarios through flexible sub-module combinations, supporting tasks ranging from head-driven animation to upper-body generation with hand gestures. Besides, we establish an efficient streaming inference pipeline that achieves 30fps at a resolution of maximum 512 × 768 on 4090 GPU, ensuring smooth and immersive experiences in real-time video chat.

place gif

Results



Audio-driven Upper-body Animation

We can generate highly expressive audio-driven upper-body digital human videos, supporting different scenarios with or without hands.

Audio-driven Talking Head Animation

We can achieve highly accurate lip-sync results, as well as generate expressive facial expressions and natural head poses.

Audio-driven Stylized Animation

We can generate audio-driven results for stylized characters, while also supporting the creation of highly expressive singing videos.

Dual-host AI Podcasts Demo

We can also generate dual-host podcasts, enabling AI-driven conversations.

Interactive Demo

Our approach achieves real-time generation of 30fps on a 4090 GPU, supporting practical application for interactive video chat.