OmniTalker: Real-Time Text-Driven Talking Head Generation with In-Context Audio-Visual Style Replication

Accepted by NeurIPS 2025

Zhongjian Wang, Peng Zhang, Jinwei Qi, Guangyuan Wang, Sheng Xu, Bang Zhang, Liefeng Bo

Tongyi Lab, Alibaba Group

arXiv Huggingface

We propose OmniTalker, a unified framework to jointly generate speech and talking video from text, which alleviates the pain of redundant computation, error accumulation and audio-visual style mismatch in existing methods.

Method

Framework of OmniTalker model. (a) In-Context Embed module adopts modal-specific encoders to extract text, audio and motion embeddings. The audio and motion embedding are then padded according to the target sequence length, which estimated by an extra duration prediction module. (b) The audio and visual features jointly interact in the audio-visual fusion module. Then the audio and visual features are feed in to several DiT blocks separately.

Unified Multimodal Framework: OmniTalker integrates text-to-audio and text-to-video generation in a single model, enabling synchronized output through cross-modal fusion.

In-Context Multimodal Style Replication: A reference-guided mechanism captures speech and facial styles for zero-shot replication.

Real-Time Efficiency: By integrating flow matching and maintaining a small model size (0.8B), OmniTalker achieves real-time inference while preserving high-fidelity outputs.

Results

Zero-shot In-context Multimodal Generation

Language	Prompt	Same Language Generation	Cross-linugal Generation
ZH	同学们好。我是小米公司的雷军。今天我正式加入了B站大家庭,成为了一名B站的萌新。	喜欢小米的小朋友们你们好，我是雷军，今天我当了一个小时的中国首富，感谢大家捧场。	How are you? Mi fans. I'm Lei Jun, I'm very happy to be China's richest man in one hour today, thank you ry much. Are you OK?
	你生活的那么忙,每天东跑西跑的,会不会想家呢?当然会了,尤其是离开家乡很久的时候,更会有思乡的感觉。	亲爱的观众朋友们，大家好。在这个美好的夜晚，非常荣幸能够以AI的形式再次与你们相遇。每当站在这个舞台上，我都感到无比的幸福和激动，因为能用歌声传达心中的情感，与每一位在场的朋友分享这份真挚的心意，是我最大的愿望。	My dear audience friends, good evening. On this beautiful night, it is such an honor to meet you all again in the form of AI. Every time standing on this stage, I feel incredibly happy and excited, because being able to convey my inner feelings through songs and share this sincere heart with every friend present is my greatest wish.
	那我想讲一讲跟虚假信息有关的问题。什么叫虚假信息呢?顾名思义,就是假消息,就是谣言。	听说现在有人拿AI复刻我的声音和形象，搞的大家分不出来真假，啊，你们这个被抓起来，是要判三年的。	我试试说几句英文，啊，I've heard that nowadays, some people are using AI to replicate my voice and appearance, making it hard for everyone to tell the difference between real and fake. Yet, if they get caught, they could be sentenced to three years in prison.
	过年贴春联吃团圆饭,和家人一起守岁迎新年,都是个很好玩,也很重要的一个事情哦。	生命就像一场电影，有欢笑也有泪水，有高潮也有低谷。重要的是，在每一个时刻都要全力以赴，用心去演绎属于你自己的故事。	Life is like a movie—there are moments of joy and moments of struggle, but the key is to give your best in every scene. No matter how tough things get, never lose faith.
	而中国呢,虽然方向上没有转变,它是一直在降息,但从九月份开始,货币政策,财政政策,都接连的发大招,开始加强刺激。	想象一下，你面前站着的是一个完全由代码构建却仿佛真人般鲜活的2D数字人。它不仅有着细腻入微的表情变化，每一个眼神、每一次微笑都能准确传达出参考人物的情感特质。	It's 小Lin说 here, your friendly neighborhood storyteller, tech enthusiast, and eternal optimist about life's messy, wonderful chaos. Whether you're sipping coffee, scrolling between meetings, or just stumbled here by fate, welcome to this little slice of our digital universe.
EN	Unify. I bring people together. I get along with people. I've always gotten along with people. I'll get along with Democrats, with Republicans, with liberals, with conservatives.	We shall not be swayed easily, no matter how rocky the road ahead may be. We must trust that amidst the difficulties, solutions can be found; amidst disagreements, consensus can be reached. This is not just a choice about policies or politics, it concerns our shared future and the world we wish to leave for the next generation.	让我告诉你，没有人比我更爱中国了。我多次说过，这是真的。中国人民、中国文化、以及那令人惊叹的历史——这一切都太棒了！他们制造的东西，说真的，世界上没有哪个国家能比得上中国的制造能力。
	For our children's children, and for those people out there whose voices have been drowned out by the politics of greed, I thank you all for this amazing award tonight. Let us not take this planet for granted.	We shall not be swayed easily, no matter how rocky the road ahead may be. We must trust that amidst the difficulties, solutions can be found; amidst disagreements, consensus can be reached. This is not just a choice about policies or politics, it concerns our shared future and the world we wish to leave for the next generation.	为了我们的下一代，以及那些被贪婪政治边缘化，无法为自己发声的人们，我真心感谢今晚大家给予我的这份荣誉。我们必须意识到不能把这颗美丽的星球当作理所当然的存在。让我们一起为地球的未来努力，因为这是我们共同的责任。
	And this year, as I consider all that 2017 has in store, I believe those opportunities are greater than ever. For we have made a momentous decision and set ourselves on a new direction.	We shall not be swayed easily, no matter how rocky the road ahead may be. We must trust that amidst the difficulties, solutions can be found; amidst disagreements, consensus can be reached. This is not just a choice about policies or politics, it concerns our shared future and the world we wish to leave for the next generation.	我们不会被轻易地击倒，无论前方的道路，多么崎岖不平。我们要相信，在困难面前，我们可以找到解决之道；在分歧之中，我们可以达成共识。这不仅是一次关于政策或政治的选择，它关乎我们共同的未来，以及我们希望留给下一代的世界。
	We show time and time again that we do not equally value women's participation, contribution, and leadership.	Ah, neural networks... they're not just mathematical curiosities, you see. They are mirrors—imperfect, but fascinating—reflecting the computational principles that life itself has evolved over billions of years.	当前，随着全球燃料和食品价格不断上涨，在气候危机和持续的军事冲突中，女性的收入，以及她们对商业成功和市场复苏的贡献，比以往任何时候都更加重要。然而，我们一次又一次地表明，我们并没有平等地重视女性的参与、贡献和领导力。
	The, the research world realized that deep learning really worked well and could be applicable to a lot of	Ah, neural networks... they're not just mathematical curiosities, you see. They are mirrors—imperfect, but fascinating—reflecting the computational principles that life itself has evolved over billions of years.	中国近年来在人工智能领域的进展显著，尤其在大模型研发、开源框架应用及产业落地方面的表现都很突出，体现了中国团队在参数规模，训练效率，以及多语言支持等方面的持续优化能力。

Emotionally Expressive Generation

Based on prompt videos of different emotions, we are able to generate results that correspond to the given emotions, with expressive facial expressions and natural head poses. Prompt videos are from RAVDESS dataset.

Emotion	Prompt	Text	Generation
Calm		I was, like, talking to my friend, and she's all, um, excited about her, uh, trip to Europe, and I'm just, like, so jealous, right?
Happy
Sad
Angry		In my father's day, we respected each other as if we were members of the same family. But today, while the world has speed up, the ties between people have grown thinner.
Disgust
Surprised

Long-term Generation

We can generate long-term videos while maintaining consistent tone and talking style.

Interactive Demo

Our method enables real-time generation at 25 FPS, providing practical support for interactive video chat applications. The interface is built upon OpenAvatarChat .