Meta AI, in partnership with the University of Tel Aviv, has introduced VideoJAM, a new framework designed to enhance motion generation in video models.
This approach aims to address a common challenge in video generation: the tendency of models to prioritize visual appearance over realistic motion and dynamics.
VideoJAM is our new framework for improved motion generation from @AIatMeta
We show that video generators struggle with motion because the training objective favors appearance over dynamics.
VideoJAM directly adresses this **without any extra data or scaling**👇🧵
— #Hila Chefer (#@hila_chefer)
2:57 PM • Feb 4, 2025
Behind the Scenes
VideoJAM encourages video generators to learn a joint appearance-motion representation, improving motion coherence without requiring additional data or model scaling.
The framework introduces two key components: an extended training objective and an "Inner-Guidance" mechanism for inference.
During training, the model is tasked with predicting both generated pixels and their corresponding motion from a single learned representation.
The Inner-Guidance feature steers generation towards coherent motion by using the model's evolving motion prediction as a dynamic guidance signal during inference.
VideoJAM can be applied to existing video models with minimal adaptations, making it a versatile solution for improving motion generation.
The framework has demonstrated state-of-the-art performance in motion coherence, surpassing competitive proprietary models while also enhancing the perceived visual quality of generated videos.
Final Take
VideoJAM's approach to balancing appearance and motion in video generation represents a significant step forward in creating more realistic and coherent video content.
This development could have far-reaching implications for various applications in film production, visual effects, and content creation, potentially offering filmmakers and producers more sophisticated tools for generating high-quality video assets.