IMPORTANT
This is the single most critical metric for AI video.
In AI video generation, Temporal Consistency (also known as "coherence") refers to the ability of a generative model to maintain the structural integrity, identity, and physical properties of objects across sequential frames of video.
The Challenge
Generative models (like Diffusion Models) work by removing noise from random static. When generating video, the model must ensure that:
- Identity: The character's face doesn't morph into a different person.
- Physics: A ball thrown in frame 1 follows a realistic arc in frame 10.
- Textures: The pattern on a shirt doesn't "boil" or change randomly.
If a model fails at this, the video looks "jittery" or dream-like, which is unacceptable for professional production.
Measuring Consistency
We evaluate Temporal Consistency using:
- Warp Error: Measuring the optical flow between frames.
- CLIP Score: Comparing the semantic similarity of the first and last frame.
- Face ID Score: Verifying the character's identity remains constant.
Achieving high temporal consistency (like in Sora 2) requires advanced techniques such as 3D-aware latent spaces or "spatio-temporal attention" blocks.
See how the top models stack up in our Sora 2 vs. Gen-4 Comparison.