AnimateAnywhere: Rouse the Background
in Human Image Animation

Xiaoyu Liu1 Mingshuai Yao1 Yabo Zhang1 Xianhui Lin Peiran Ren
Xiaoming Li2 Ming Liu1 Wangmeng Zuo1

1 Harbin Institute of Technology, Harbin, China 2 Nanyang Technological University, Singapore    



Comparison with existing paradigms of human image animation. (a) Human animation with static backgrounds: Animate Anyone, Champ, and etc. They only generate static backgrounds. (b) Human animation with dynamic backgrounds guided by background motion conditions: Humanvid, Liu et al., and etc. They extract background motion conditions of reference videos to guide the movement of the background. However, extracting background motion conditions paired with human poses heavily relies on reference videos, which are often unavailable in practice. (c) Ours: Human animation with dynamic backgrounds synchronized with human poses. Our paradigm learns the background movements for human poses.

[Paper]     [Code]

The Framework of AnimateAnywhere

Overview Pipeline of AnimateAnywhere. Given a reference background image, a reference human image, and a human pose sequence as inputs, AnimateAnywhere generates photorealistic videos with synchronized motion of both the human and the background. Built upon CogVideoX, we employ a ControlNet to inject human pose sequences and a ReferenceNet to incorporate the reference human. The background image is concatenated with noise to inject background appearance, while a Background Motion Learner (BML) predicts background motion from the pose sequence. (b) 3D Attention with Epipolar Constraint. (b)-(1) The \( i \)-th video frame. (b)-(2) Vanilla 3D attention map \( \mathbf{A}_{ij}(u,v) \) represents the model-learned attention weights from pixel \( (u,v) \) in the \( i \)-th frame to all positions in the \( j \)-th frame. (b)-(3) Epipolar mask \( \mathbf{M}_{ij}(u,v) \) defines the geometrically valid region in the \( j \)-th frame. (b)-(4) Attention activation mask \( \mathbf{1} - \mathbf{\Omega}_{ij}(u,v) \) highlights the retained attention regions without constraint. Adaptive suppression mask \( \mathbf{\Omega}_{ij}(u,v) \) selectively constrains low-confidence attention outside the epipolar region. (b)-(5) Target 3D attention map under our epipolar constraint. Our loss function is equivalent to computing the difference between \( \mathbf{A}_{ij}(u,v) \) and this target map.



Qualitative comparison on Humanvid test dataset

Qualitative comparisons on the Humanvid test dataset clearly demonstrate the superiority of our proposed method in generating photorealistic videos. Notably, our approach ensures coherent and harmonious motion between the foreground and background.

Qualitative comparisons on the BL200 test dataset

Qualitative comparisons on the BL200 test dataset shows that our approach not only generates realistic background motion but also maintains high quality, temporally consistent animation of both the human and the background.

More Videos

We provide more results for both horizontal and vertical videos. Our approach demonstrates robust background motion across diverse human poses, including translational movements (vertical, horizontal, forward, backward) and rotation.

Ablation Study

We investigate the effect of the epipolar loss, foundational framework, and the VGG Loss.