EMO2: End-Effector Guided Audio-Driven Avatar Video Generation

Institute for Intelligent Computing, Alibaba Group

Character: AI Marilyn Monroe
Vocal Source: Unholy - Sam Smith, Kim Petras

Character: AI Storm Trooper
Vocal Source: INTRO CINEMATIC - HELLDIVERS™ 2

Character: Dr. Emmett Brown in "Back to the Future"
Vocal Source: Rick and Morty

Abstract

we propose a novel audio-driven talking head method capable of simultaneously generating highly expressive facial expressions and hand gestures. Unlike existing methods that focus on generating full-body or half-body poses, we investigate the challenges of audio-driven ges- ture generation and identify the weak correspon- dence between audio features and full-body ges- tures as a key limitation. To address this, we re- define the task as a two-stage process. In the first stage, we generate hand poses directly from audio input, leveraging the stronger correlation between audio signals and hand movements. In the second stage, we employ a diffusion model to synthe- size video frames, incorporating the hand poses generated in the first stage to produce realistic facial expressions and body movements.

Method

MY ALT TEXT

The motivation behind our method. Human motion, similar to that of robots, involves planning the "end-effector" (EE), typically the hands, towards the target situation. The rest of the body then cooperates accordingly with the EE, using inverse kinematics principles.

place gif

Various Generated Videos



Singing

By inputting a single character image and vocal audio, such as singing, our method can generate vocal avatar videos featuring not only expressive facial expressions but also a variety of body poses.

Character: Karina
Vocal Source: Yonezu Kenshi 「LOSER」┃Cover by Raon Lee

Character:AI girl
Vocal Source: Charlie Puth - Attention (Emma Heesters Cover)

Character: Elon Musk
Vocal Source: OneRepublic - Apologize (Live)

Character: Arthur Morgan from Red Dead Redemption
Vocal Source: Black Myth: Wukong - Headless Guy Singing Scene

Character: AI girl
Vocal Source: YUQI - Freak

Speaking

Our method supports voice in multiple languages and brings images to life by intuitively recognizing tonal variations in audio, enabling the creation of dynamic, richly performing avatars.

Character: KA KA
Vocal Source: 서울의 봄

Character: Elon Musk
Vocal Source: Musk's Speech

Character: Elon Musk
Vocal Source: Trevor's Talkshow

Character: Taylor Swift
Vocal Source: Iliza Shlesinger's Talkshow

Hand Dance

Our method can generate complex and smooth hand movements, bringing the avatar to life with a vivid performance.

Character: Karina
Vocal Source: 明明 (Cut)

Character: Jang Won Young
Vocal Source: 想你 (Cut)

Cosplay

One potential application of our method is to enable designated characters to act out relevant scripts in film and game scenarios, with performances that align with their character profiles.

Character: Will Smith
Vocal Source: GTA 5

Character: AI commander
Vocal Source: Red Alert 2

Character: Donald Trump
Vocal Source: The Great Shenyang Street

Character: Donald Trump
Vocal Source: The Boys

Character: Jensen Huang
Vocal Source: House Votes to Ban TikTok & RFK’s Unexpected VP Contender

Character: AI girl
Vocal Source: Genshin

Character: Elon Musk
Vocal Source: The Wolf of Wall Street

Character: AI girl
Vocal Source: Honkai

Character: Albert Einstein
Vocal Source: Rick and Morty

Character: AI girl
Vocal Source: Genshin

Character: AI girl
Vocal Source: "Yes, one; and in this manner." by: Octavia Selena Alexandru

Comparison with Other Methods

Comparison with Vlogger

Comparison with CyberHost

Easter Egg

Check out our lighthearted video, created using our method. This video serves as a demonstration of potential application scenarios for our research. Hope you like it, and it will truely raise me up.