AnimateAnywhere: Rouse the Background
in Human Image Animation
Xiaoyu Liu1
Mingshuai Yao1
Yabo Zhang1
Xianhui Lin
Peiran Ren
Xiaoming Li2
Ming Liu1
Wangmeng Zuo1
1 Harbin Institute of Technology, Harbin, China
2 Nanyang Technological University, Singapore
Comparison with existing paradigms of human image animation. (a) Human animation with static backgrounds: Animate Anyone, Champ, and etc. They only generate static backgrounds. (b) Human animation with dynamic backgrounds guided by background motion conditions: Humanvid, Liu et al., and etc. They extract background motion conditions of reference videos to guide the movement of the background.
However, extracting background motion conditions paired with human poses heavily relies on reference videos, which are often unavailable in practice. (c) Ours: Human animation with dynamic backgrounds synchronized with human poses. Our paradigm learns the background movements for human poses.
[Paper]
[Code]
The Framework of AnimateAnywhere
Overview Pipeline of AnimateAnywhere. Given a reference background image, a reference human image, and a human pose sequence as inputs, AnimateAnywhere generates photorealistic videos with synchronized motion of both the human and the background. Built upon CogVideoX, we employ a ControlNet to inject human pose sequences and a ReferenceNet to incorporate the reference human. The background image is concatenated with noise to inject background appearance, while a Background Motion Learner (BML) predicts background motion from the pose sequence. (b) 3D Attention with Epipolar Constraint. (b)-(1) The \( i \)-th video frame. (b)-(2) Vanilla 3D attention map \( \mathbf{A}_{ij}(u,v) \) represents the model-learned attention weights from pixel \( (u,v) \) in the \( i \)-th frame to all positions in the \( j \)-th frame. (b)-(3) Epipolar mask \( \mathbf{M}_{ij}(u,v) \) defines the geometrically valid region in the \( j \)-th frame. (b)-(4) Attention activation mask \( \mathbf{1} - \mathbf{\Omega}_{ij}(u,v) \) highlights the retained attention regions without constraint. Adaptive suppression mask \( \mathbf{\Omega}_{ij}(u,v) \) selectively constrains low-confidence attention outside the epipolar region. (b)-(5) Target 3D attention map under our epipolar constraint. Our loss function is equivalent to computing the difference between \( \mathbf{A}_{ij}(u,v) \) and this target map.
Qualitative comparison on Humanvid test dataset
Qualitative comparisons on the Humanvid test dataset clearly demonstrate the superiority of our proposed method in generating photorealistic videos. Notably, our approach ensures coherent and harmonious motion between the foreground and background.
Qualitative comparisons on the BL200 test dataset
Qualitative comparisons on the BL200 test dataset shows that our approach not only generates realistic background motion but also maintains high quality, temporally consistent animation of both the human and the background.
More Videos
We provide more results for both horizontal and vertical videos. Our approach demonstrates robust background motion across diverse human poses, including translational movements (vertical, horizontal, forward, backward) and rotation.
Ablation Study
We investigate the effect of the epipolar loss, foundational framework, and the VGG Loss.