DiVAS: Video and Audio Synchronization with Dynamic Frame Rates
In this paper, we study the automatic discovery of such issues. Specifically, we focus on the alignment of lip movements with spoken words, targeting realistic production scenarios which can include background noise and music, intricate head poses, excessive makeup, or scenes with multiple individuals where the speaker is unknown.
June 17, 2024
CVPR (2024)
Authors
Clara Fernandez-Labrador (DisneyResearch|Studios)
Mertcan Akçay (DisneyResearch|Studios / ETH Zurich)
Eitan Abecassis (Disney Entertainment and ESPN Technology)
Joan Massich (DisneyResearch|Studios)
Christopher Schroers (DisneyResearch|Studios)
DiVAS: Video and Audio Synchronization with Dynamic Frame Rates
Synchronization issues between audio and video are one of the most disturbing quality defects in film production and live broadcasting. Even a discrepancy as short as 45 milliseconds can degrade the viewer’s experience enough to warrant manual quality checks over entire movies. In this paper, we study the automatic discovery of such issues. Specifically, we focus on the alignment of lip movements with spoken words, targeting realistic production scenarios which can include background noise and music, intricate head poses, excessive makeup, or scenes with multiple individuals where the speaker is unknown. Our model’s robustness also extends to various media specifications, including different video frame rates and audio sample rates. To address these challenges, we present a model fully based on Transformers that encodes face crops or full video frames and raw audio using timestamp information, identifies the speaker and provides highly accurate synchronization pre- dictions much faster than previous methods.