Versatile Vision Foundation Model for Image and Video Colorization
In this work we show how a latent diffusion model, pre-trained on text-to-image synthesis, can be finetuned for image colorization and provide a flexible solution for a wide variety of scenarios: high quality direct colorization with diverse results, user guided colorization through colors hints, text prompts or reference image and finally video colorization.
Authors
Vukasin Bozic (ETH Zürich)
Abdelaziz Djelouah (DisneyResearch|Studios)
Yang Zhang (DisneyResearch|Studios)
Radu Timofte (University of Wurzburg)
Markus Gross (DisneyResearch|Studios / ETH Zürich)
Christopher Schroers (DisneyResearch|Studios)
Image and video colorization are among the most common prob- lems in image restoration. This is an ill-posed problem and a wide variety of methods have been proposed, ranging from more tra- ditional computer vision strategies to most recent development with transformer-based or generative neural network models. In this work we show how a latent diffusion model, pre-trained on text-to-image synthesis, can be finetuned for image colorization and provide a flexible solution for a wide variety of scenarios: high quality direct colorization with diverse results, user guided colorization through colors hints, text prompts or reference image and finally video colorization. Some works already investigated using diffusion models for colorization, however the proposed solutions are often more complex and require training a side model guiding the denoising process (à la ControlNet). Not only is this approach increasing the number of parameters and compute time, it also results in sub optimal colorization as we show. Our evaluation demonstrates that our model is the only approach that offers a wide flexibility while either matching or outperforming existing methods specialized in each sub-task, by proposing a group of universal, architecture-agnostic mechanisms which could be applied to any pre-trained diffusion model.