Recent years have witnessed remarkable advances in talking head generation(THG), owing to its potential to revolutionize the human-AI interaction from text interfaces into realistic video chats. However, research on text-driven talking heads remains underexplored, with existing methods predominantly adopting a cascaded pipeline that combines text-to-speech (TTS) systems with audio-driven talking head models. This conventional pipeline not only introduces system complexity and latency overhead but also fundamentally suffers from asynchronous audiovisual output and stylistic discrepancies between generated speech and visual expressions. To address these limitations, we introduce OmniTalker, an end-to-end unified framework that simultaneously generates synchronized speech and talking head videos from text and reference video in real-time zero-shot scenarios, while preserving both speech style and facial styles. The framework employs a dual-branch diffusion transformer architecture: the audio branch synthesizes mel-spectrograms from text, while the visual branch predicts fine-grained head poses and facial dynamics. To bridge modalities, we introduce a novel audio-visual fusion module that integrates cross-modal information to ensure temporal synchronization and stylistic coherence between audio and visual outputs. Furthermore, our in-context reference learning module effectively captures both speech and facial style characteristics from a single reference video without introducing an extra style extracting module. To the best of our knowledge, OmniTalker represents the first unified framework that jointly models speech style and facial style in a zero-shot setting, achieving real-time inference speed of 25 FPS. Extensive experiments demonstrate that our method surpasses existing approaches in generation quality, particularly excelling in style preservation and audio-video synchronization, while maintaining real-time prediction efficiency.
Language | Prompt | Same Language Generation | Cross-linugal Generation |
---|---|---|---|
ZH | 同学们好。我是小米公司的雷军。今天我正式加入了B站大家庭,成为了一名B站的萌新。 |
喜欢小米的小朋友们你们好,我是雷军,今天我当了一个小时的中国首富,感谢大家捧场。 |
How are you? Mi fans. I'm Lei Jun, I'm very happy to be China's richest man in one hour today, thank you ry much. Are you OK? |
你生活的那么忙,每天东跑西跑的,会不会想家呢?当然会了,尤其是离开家乡很久的时候,更会有思乡的感觉。 |
亲爱的观众朋友们,大家好。在这个美好的夜晚,非常荣幸能够以AI的形式再次与你们相遇。每当站在这个舞台上,我都感到无比的幸福和激动,因为能用歌声传达心中的情感,与每一位在场的朋友分享这份真挚的心意,是我最大的愿望。 |
My dear audience friends, good evening. On this beautiful night, it is such an honor to meet you all again in the form of AI. Every time standing on this stage, I feel incredibly happy and excited, because being able to convey my inner feelings through songs and share this sincere heart with every friend present is my greatest wish. |
|
那我想讲一讲跟虚假信息有关的问题。什么叫虚假信息呢?顾名思义,就是假消息,就是谣言。 |
听说现在有人拿AI复刻我的声音和形象,搞的大家分不出来真假,啊,你们这个被抓起来,是要判三年的。 |
我试试说几句英文,啊,I've heard that nowadays, some people are using AI to replicate my voice and appearance, making it hard for everyone to tell the difference between real and fake. Yet, if they get caught, they could be sentenced to three years in prison. |
|
过年贴春联吃团圆饭,和家人一起守岁迎新年,都是个很好玩,也很重要的一个事情哦。 |
生命就像一场电影,有欢笑也有泪水,有高潮也有低谷。重要的是,在每一个时刻都要全力以赴,用心去演绎属于你自己的故事。 |
Life is like a movie—there are moments of joy and moments of struggle, but the key is to give your best in every scene. No matter how tough things get, never lose faith. |
|
而中国呢,虽然方向上没有转变,它是一直在降息,但从九月份开始,货币政策,财政政策,都接连的发大招,开始加强刺激。 |
想象一下,你面前站着的是一个完全由代码构建却仿佛真人般鲜活的2D数字人。它不仅有着细腻入微的表情变化,每一个眼神、每一次微笑都能准确传达出参考人物的情感特质。 |
It's 小Lin说 here, your friendly neighborhood storyteller, tech enthusiast, and eternal optimist about life's messy, wonderful chaos. Whether you're sipping coffee, scrolling between meetings, or just stumbled here by fate, welcome to this little slice of our digital universe. |
|
EN | Unify. I bring people together. I get along with people. I've always gotten along with people. I'll get along with Democrats, with Republicans, with liberals, with conservatives. |
We shall not be swayed easily, no matter how rocky the road ahead may be. We must trust that amidst the difficulties, solutions can be found; amidst disagreements, consensus can be reached. This is not just a choice about policies or politics, it concerns our shared future and the world we wish to leave for the next generation. |
让我告诉你,没有人比我更爱中国了。我多次说过,这是真的。中国人民、中国文化、以及那令人惊叹的历史——这一切都太棒了!他们制造的东西,说真的,世界上没有哪个国家能比得上中国的制造能力。 |
For our children's children, and for those people out there whose voices have been drowned out by the politics of greed, I thank you all for this amazing award tonight. Let us not take this planet for granted. |
We shall not be swayed easily, no matter how rocky the road ahead may be. We must trust that amidst the difficulties, solutions can be found; amidst disagreements, consensus can be reached. This is not just a choice about policies or politics, it concerns our shared future and the world we wish to leave for the next generation. |
为了我们的下一代,以及那些被贪婪政治边缘化,无法为自己发声的人们,我真心感谢今晚大家给予我的这份荣誉。 我们必须意识到不能把这颗美丽的星球当作理所当然的存在。让我们一起为地球的未来努力,因为这是我们共同的责任。 |
|
And this year, as I consider all that 2017 has in store, I believe those opportunities are greater than ever. For we have made a momentous decision and set ourselves on a new direction. |
We shall not be swayed easily, no matter how rocky the road ahead may be. We must trust that amidst the difficulties, solutions can be found; amidst disagreements, consensus can be reached. This is not just a choice about policies or politics, it concerns our shared future and the world we wish to leave for the next generation. |
我们不会被轻易地击倒,无论前方的道路,多么崎岖不平。我们要相信,在困难面前,我们可以找到解决之道;在分歧之中,我们可以达成共识。这不仅是一次关于政策或政治的选择,它关乎我们共同的未来,以及我们希望留给下一代的世界。 |
|
We show time and time again that we do not equally value women's participation, contribution, and leadership. |
Ah, neural networks... they're not just mathematical curiosities, you see. They are mirrors—imperfect, but fascinating—reflecting the computational principles that life itself has evolved over billions of years. |
当前,随着全球燃料和食品价格不断上涨,在气候危机和持续的军事冲突中,女性的收入,以及她们对商业成功和市场复苏的贡献,比以往任何时候都更加重要。然而,我们一次又一次地表明,我们并没有平等地重视女性的参与、贡献和领导力。 |
|
The, the research world realized that deep learning really worked well and could be applicable to a lot of |
Ah, neural networks... they're not just mathematical curiosities, you see. They are mirrors—imperfect, but fascinating—reflecting the computational principles that life itself has evolved over billions of years. |
中国近年来在人工智能领域的进展显著,尤其在大模型研发、开源框架应用及产业落地方面的表现都很突出,体现了中国团队在参数规模,训练效率,以及多语言支持等方面的持续优化能力。 |
Based on prompt videos of different emotions, we are able to generate results that correspond to the given emotions, with expressive facial expressions and natural head poses. Prompt videos are from RAVDESS dataset.
Emotion | Prompt | Text | Generation |
---|---|---|---|
Calm | I was, like, talking to my friend, and she's all, um, excited about her, uh, trip to Europe, and I'm just, like, so jealous, right? | ||
Happy | |||
Sad | |||
Angry | In my father's day, we respected each other as if we were members of the same family. But today, while the world has speed up, the ties between people have grown thinner. | ||
Disgust | |||
Surprised |
We can generate long-term videos while maintaining consistent tone and talking style.
|
Our method enables real-time generation at 25 FPS, providing practical support for interactive video chat applications. The interface is built upon OpenAvatarChat .
|
|