Versatile Vision Foundation Model for Image and Video Colorization

In this work we show how a latent diffusion model, pre-trained on text-to-image synthesis, can be finetuned for image colorization and provide a flexible solution for a wide variety of scenarios: high quality direct colorization with diverse results, user guided colorization through colors hints, text prompts or reference image and finally video colorization.

July 28, 2024

SIGGRAPH 2024

Authors

Vukasin Bozic (ETH Zürich)

Abdelaziz Djelouah (DisneyResearch|Studios)

Yang Zhang (DisneyResearch|Studios)

Radu Timofte (University of Wurzburg)

Markus Gross (DisneyResearch|Studios / ETH Zürich)

Christopher Schroers (DisneyResearch|Studios)

Versatile Vision Foundation Model for Image and Video Colorization

Download Publication PDF

Abstract

Image and video colorization are among the most common prob- lems in image restoration. This is an ill-posed problem and a wide variety of methods have been proposed, ranging from more tra- ditional computer vision strategies to most recent development with transformer-based or generative neural network models. In this work we show how a latent diffusion model, pre-trained on text-to-image synthesis, can be finetuned for image colorization and provide a flexible solution for a wide variety of scenarios: high quality direct colorization with diverse results, user guided colorization through colors hints, text prompts or reference image and finally video colorization. Some works already investigated using diffusion models for colorization, however the proposed solutions are often more complex and require training a side model guiding the denoising process (à la ControlNet). Not only is this approach increasing the number of parameters and compute time, it also results in sub optimal colorization as we show. Our evaluation demonstrates that our model is the only approach that offers a wide flexibility while either matching or outperforming existing methods specialized in each sub-task, by proposing a group of universal, architecture-agnostic mechanisms which could be applied to any pre-trained diffusion model.

Copyright Notice

The documents contained in these directories are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author’s copyright. These works may not be reposted without the explicit permission of the copyright holder.

Versatile Vision Foundation Model for Image and Video Colorization

Authors

Versatile Vision Foundation Model for Image and Video Colorization

Abstract

Copyright Notice

Research at Disney

Legal

MORE