TempFormer: Temporally Consistent Transformer for Video Denoising

We propose an efficient hybrid Transformer-based model, TempFormer, which composes SpatioTemporal Transformer Blocks (STTB) and 3D convolutional layers.

October 11, 2022
European Conference on Computer Vision (ECCV) (2022)

Authors

Mingyang Song (ETH Zürich)

Yang Zhang (DisneyResearch|Studios)

Tunç O. Aydın (DisneyResearch|Studios)

TempFormer: Temporally Consistent Transformer for Video Denoising

Download Publication PDF

Download Supplemental PDF

Abstract

Video denoising is a low-level vision task that aims to restore high-quality videos from noisy content. Vision Transformer (ViT) is a new machine learning architecture that has shown promising performance on both high-level and low-level image tasks, e.g., object detection, classification, and image restoration in the past year. In this paper, we propose a modified ViT architecture for video processing tasks, introducing a new training strategy and loss function to enhance temporal consistency without compromising spatial quality. Specifically, we propose an efficient hybrid Transformer-based model, TempFormer, which composes SpatioTemporal Transformer Blocks (STTB) and 3D convolutional layers. The proposed STTB learns the temporal information between neighboring frames implicitly by utilizing the proposed Joint Spatio-Temporal Mixer module for attention calculation and feature aggregation in each ViT block. Moreover, existing methods suffer from temporal inconsistency artifacts that are problematic in practical cases and distracting to the viewers. We propose a sliding block strategy with recurrent architecture, and use a new loss term, Overlap Loss, to alleviate the flickering between adjacent frames. Our method produces state-of-the-art spatio-temporal denoising quality with significantly improved temporal coherency and requires less computational resources to achieve comparable denoising quality with competing methods.

Copyright Notice

The documents contained in these directories are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author’s copyright. These works may not be reposted without the explicit permission of the copyright holder.

TempFormer: Temporally Consistent Transformer for Video Denoising

We propose an efficient hybrid Transformer-based model, TempFormer, which composes SpatioTemporal Transformer Blocks (STTB) and 3D convolutional layers.

Authors

TempFormer: Temporally Consistent Transformer for Video Denoising

Abstract

Copyright Notice

Research at Disney

Legal

MORE