Spatiotemporal Diffusion Priors for Extreme Video Compression

In this paper, we propose to extend this paradigm to video compression by utilizing a generative spatiotemporal prior and present the first codec based on a video diffusion model.

December 7, 2025
Picture Coding Symposium (PCS) (2025)

 

Authors

Lucas Relic (ETH Zurich/DisneyResearch|Studios)

André Emmenegger (ETH Zurich/DisneyResearch|Studios)

Roberto Azevedo (DisneyResearch|Studios)

Yang Zhang (DisneyResearch|Studios)

Markus Gross (DisneyResearch|Studios/ETH Zurich)

Christopher Schroers  (DisneyResearch|Studios)

Spatiotemporal Diffusion Priors for Extreme Video Compression

Abstract

Diffusion models have recently demonstrated impressive results in image compression, where the strong spatial prior enables the synthesis of fine details rather than allocating bits to transmit them. In this work, we propose to extend this paradigm to video compression by utilizing a generative spatiotemporal prior and present the first codec based on a video diffusion model. Our method operates by performing longcontext interpolation guided by sparse inter-frame predictions, thus requiring minimal motion information. To this end, we develop a sparse, bidirectional optical flow which serves as a bitrate-efficient motion conditioning in the diffusion decoding process. The resulting codec can compress videos to extremely low rates (as low as 0.01 bits per pixel) while maintaining realistic textures and motion, and outperforms both neural and traditional baselines on several benchmark datasets. Our method shows state-of-the-art performance in perceptually-oriented distortion metrics, and, when considering rate-realism, we achieve an improvement in FID score of up to 73.3 at the same bitrate compared to the leading traditional video codec, VTM. Overall, we present an important first work examining spatiotemporal diffusion priors for video compression.

Copyright Notice