Wan-S2V: Audio-Driven Cinematic Video Generation

Tongyi Lab, Alibaba

Abstract

Wan-S2V is an AI video generation model that can transform static images and audio into high-quality videos. Our model excels in film and television application scenarios, capable of presenting realistic visual effects, including generating natural facial expressions, body movements, and professional camera work. It supports both full-body and half-body character generation, and can high-quality complete various professional-level content creation needs such as dialogue, singing, and performance.

Image + Audio = Video Generation

Input Image

Input Audio

Generated Video

Wan-S2V takes a single image and audio input to generate high-quality, synchronized video content

Prompt: "In the video, a man is walking beside the railway tracks, singing and expressing his emotions while walking. A train slowly passes by beside him."

Prompt: "In the video, a woman is talking to the man in front of her. She looks sad, thoughtful and about to cry."

Prompt: "In the video, a woman is singing. Her expression is very lyrical and intoxicated with music."

Prompt: "The video shows a woman with long hair playing the piano at the seaside. The woman has a long head of silver white hair, and a flame crown is burning on her head. The girls are singing with deep feelings, and their facial expressions are rich. The woman sat sideways in front of the piano, playing attentively."

Prompt: "In the video, Einstein is educating students outside the camera."

Prompt: "In the video, a woman is singing. Her expression is very lyrical and intoxicated with music."

Method

Overview of our pipeline.

The pipeline of the audio injection.

Data Collection

Our data collection strategy combines automated screening of large-scale datasets (OpenHumanVid, Koala36M) with manual curation of high-quality samples. We focus on videos featuring human characters engaged in specific activities like speaking, singing, and dancing, creating a comprehensive dataset of millions of human-centric video samples.

place gif

Various Generated Videos

Prompt: "In the video, a woman stood on the deck of a sailing boat and sang loudly. The background was the choppy sea and the thundering sky. It was raining heavily in the sky, the ship swayed, the camera swayed, and the waves splashed everywhere, creating a heroic atmosphere. The woman has long dark hair, part of which is wet by rain. Her expression is serious and firm, her eyes are sharp, and she seems to be staring at the distance or thinking."

Prompt: "In the video, a boy is sitting on a running train. His eyes are blurred. He is singing softly and tapping the beat with his hands. It may be a scene from an MV movie. The train was moving, and the view passed quickly."

Prompt: "In the video, there is a man's selfie perspective. He glides in the sky in a parachute. He sings happily and looks engaged. The scenery passes around him."

Cinematic-Grade Audio-Driven

Our method is capable of generating film-quality videos, enabling the synthesis of film dialogues and the recreation of narrative scenes.

Prompt: "The video shows a group of nuns singing hymns in the church. The sky emits fluctuating golden light and golden powder falls from the sky. Dressed in traditional black robes and white headscarves, they are neatly arranged in a row with their hands folded in front of their chests. Their expressions are solemn and pious, as if they are conducting some kind of religious ceremony or prayer. The nuns' eyes looked up, showing great concentration and awe, as if they were talking to the gods."

Prompt: "In the video, a man is lying on the sofa with his hands folded on his legs. He is talking with his legs cocked. The lamp flickered. The camera slowly circled, as if it were a movie scene."

Prompt: "In the video, a man in a suit is sitting on the sofa. He leans forward and seems to want to dissuade the opposite person. He speaks to the opposite person with a serious expression of concern."

Prompt: "The video shows the scene on the rooftop. A bald man is talking with his hand on another person. His expression is serious and serious, as if he is advising and educating the other person. The wind is very strong, and the lens is slightly shaken and pulled closer. The whole scene looks serious and tense, as if it is a movie scene."

Enhanced Instruction Following, Motion & Environment Control

Our model can generate character actions and environmental factors in videos according to instructions, thereby creating video content that better fits the theme.

Prompt: "In the video, it is raining heavily. It shows a man who is topless and has clear muscle lines, showing good physical fitness and strength. The rain wet his whole body. His arms were open and he was singing happily. His expression was engaged and his hands were slowly extended. The man's head is slightly raised, his eyes are upward, and his mouth is slightly open. His expression was full of surprise and expectation, giving a feeling that something important was about to happen."

Prompt: "In the video, a man is holding an apple and talking, he takes a bite of the apple."

Quantitative Comparison

We conduct comprehensive quantitative comparisons with state-of-the-art methods across multiple evaluation metrics.

Method	FID↓	FVD↓	SSIM↑	PSNR↑	Sync-C↑	EFID↓	HKC↑	HKV↑	CSIM↑
EchoMimicV2	33.42	217.71	0.662	18.17	4.44	1.052	0.425	0.150	0.519
MimicMotion	25.38	248.95	0.585	17.15	2.68	0.617	0.356	0.169	0.608
EMO2	27.28	129.41	0.662	17.75	4.58	0.218	0.553	0.198	0.650
FantasyTalking	22.60	178.12	0.703	19.63	3.00	0.366	0.281	0.087	0.626
Hunyuan-Avatar	18.07	145.77	0.670	18.16	4.71	0.7082	0.379	0.145	0.583
Wan2.2-S2V-14B	15.66	129.57	0.734	20.49	4.51	0.283	0.435	0.142	0.677

Table 1: According to actual measurement data, Wan2.2-S2V achieved the best or near-best performance among similar models in core metrics such as FID (video quality), EFID (expression authenticity), and CSIM (identity consistency).

Citation

@misc{gao2025wans2vaudiodrivencinematicvideo,
   title={Wan-S2V: Audio-Driven Cinematic Video Generation},
   author={Xin Gao and Li Hu and Siqi Hu and Mingyang Huang and Chaonan Ji and Dechao Meng and Jinwei Qi and Penchong Qiao and Zhen Shen and Yafei Song and Ke Sun and Linrui Tian and Guangyuan Wang and Qi Wang and Zhongjian Wang and Jiayu Xiao and Sheng Xu and Bang Zhang and Peng Zhang and Xindi Zhang and Zhe Zhang and Jingren Zhou and Lian Zhuo},
   year={2025},
   eprint={2508.18621},
   archivePrefix={arXiv},
   primaryClass={cs.CV},
   url={https://arxiv.org/abs/2508.18621}
}