Animate Anyone

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

Institute for Intelligent Computing，Alibaba Group

Abstract

Character Animation aims to generating character videos from still images through driving signals. Currently, diffusion models have become the mainstream in visual generation research, owing to their robust generative capabilities. However, challenges persist in the realm of image-to-video, especially in character animation, where temporally maintaining consistency with detailed information from character remains a formidable problem. In this paper, we leverage the power of diffusion models and propose a novel framework tailored for character animation. To preserve consistency of intricate appearance features from reference image, we design ReferenceNet to merge detail features via spatial attention. To ensure controllability and continuity, we introduce an efficient pose guider to direct character's movements and employ an effective temporal modeling approach to ensure smooth inter-frame transitions between video frames. By expanding the training data, our approach can animate arbitrary characters, yielding superior results in character animation compared to other image-to-video methods. Furthermore, we evaluate our method on benchmarks for fashion video and human dance synthesis, achieving state-of-the-art results.

Method

The overview of our method. The pose sequence is initially encoded using Pose Guider and fused with multi-frame noise, followed by the Denoising UNet conducting the denoising process for video generation. The computational block of the Denoising UNet consists of Spatial-Attention, Cross-Attention, and Temporal-Attention, as illustrated in the dashed box on the right. The integration of reference image involves two aspects. Firstly, detailed features are extracted through ReferenceNet and utilized for Spatial-Attention. Secondly, semantic features are extracted through the CLIP image encoder for Cross-Attention. Temporal-Attention operates in the temporal dimension. Finally, the VAE decoder decodes the result into a video clip.

Human

Anime/Cartoon

Fashion Video Synthesis

Inference Acceleration

Animate Anyone video generation workloads are accelerated by DeepGPU (AIACC) of Alibaba Cloud with immense performance uplift compare to the original pytorch + xformers solution without hurting the quality of generated videos. This helps inference workloads reduce around 30% waiting time for end users as well as operating cost which makes a better user experience and cost-effective AI solution. The chart below shows some details of performance numbers on Animate Anyone inference that accelerated by DeepGPU. As observed from the chart, 32 frames 832x640 resolution video generation duration in 1 step reduces from 2.45s to 1.75s on A10 GPU when the inference is powered by DeepGPU acceleration which achieves 40% performance gain, while on RTX6000 GPU, the number reduces from 2.8s to 2.25s, nearly 25% advantages over pytorch.

BibTeX

@article{hu2023animateanyone, title={Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation}, author={Li Hu and Xin Gao and Peng Zhang and Ke Sun and Bang Zhang and Liefeng Bo}, journal={arXiv preprint arXiv:2311.17117}, website={https://humanaigc.github.io/animate-anyone/}, year={2023} }

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

Abstract

Video

Method

Animating Various Characters

Human

Anime/Cartoon

Humanoid

Comparisons

Fashion Video Synthesis

Fashion Video Synthesis aims to turn fashion photographs into realistic, animated videos using a driving pose sequence. Experiments are conducted on the UBC fashion video dataset with same training data.

Human Dance Generation

Human Dance Generation focuses on animating images in real-world dance scenarios. Experiments are conducted on the TikTok dataset with same training data.

More Applications

Animate Anyone + Outfit Anyone

Outfit Anyone: Ultra-high quality virtual try-on for Any Clothing and Any Person.

Image to talking-head video

Image to video (like Gen2) + talking head generation (out internal project based on VividTalk)

Inference Acceleration

BibTeX