Abstract.
World models based on video generation demonstrate remarkable potential for simulating interactive environments but face persistent difficulties in two key areas:
maintaining long-term content consistency when scenes are revisited and enabling precise camera control from user-provided inputs.
Existing methods based on explicit 3D reconstruction often compromise flexibility in unbounded scenarios and fine-grained structures.
Alternative methods rely directly on previously generated frames without establishing explicit spatial correspondence, thereby constraining controllability and consistency.
To address these limitations, we present
UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism.
To reduce computational overhead, we design an efficient dual-stream diffusion transformer for high-fidelity generation.
Moreover, we introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting, facilitating training on over 500K monocular videos.
Extensive experiments on real-world and synthetic benchmarks demonstrate that
UCM significantly outperforms state-of-the-art methods in long-term scene consistency, while also achieving precise camera controllability in high-fidelity video generation.
An overview of UCM.
Given previously generated frames and a specific camera trajectory as input,
UCM encodes the historical frames into clean tokens to condition the denoising of noisy tokens.
For camera control and memory injection, the framework proposes time-aware positional encoding warping to establish spatio-temporal correspondence and an efficient dual-stream transformer architecture for processing.
After iterative denoising,
UCM yields a high-fidelity, scene-consistent video that adheres to the user-specified trajectory.
The architecture of UCM DiT-block.
Each noisy token attends to all other noisy tokens and is guided by clean tokens via time-aware warped PEs, implemented through KV concatenation.
For the clean tokens, each token attends only to other clean tokens within the same frame using original PEs.
This block-sparse attention mask enables camera control and memory guidance with reduced computational cost.