GANs in computer vision - 2K image and video synthesis, and large-scale class-conditional image generation

GANs in computer vision: 2K image and video synthesis, and large-scale class-conditional image generation (part 4)

Computer vision is indeed a promising application field for GANs. Until now, we focused on conditional and unconditional image generation . In the previous post , we provided a lot of aspects such as training with Wasserstein loss, understanding disentangled representations, modeling global and local structures with different strategies, progressive incremental training of GANs, etc. Nevertheless, deep learning in computer vision includes a variety of awesome tasks such as object generation, video generation, etc. We actually reached the point of megapixel resolution with progressive GANs for the first time. The question is: can we do better ?

In this part, we will inspect 2K image and video synthesis, large-scale conditional image generation. Our extensive analysis attempts to bridge the gaps that you may have from previous works on the field. We will revize a bunch of computer vision concepts such as object detection , semantic segmentation , instance semantic segmentation . Basically, we would like to exploit all the available labels and highly accurate networks to maximize visual quality. That means even the ones that can be generated from state-of-the-art deep learning models. Since 2018 GANs gained increased attention in the community due to their avast cool applications, especially in computer vision. Nevertheless, one easily grasps that since the introduction of progressive GANs at the end of 2017, NVidia started to own GANs in computer vision! We will analyze the following three foundational works:


1. pix2pixHD (High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs)

2. Video-to-Video Synthesis

3. BigGAN (Large Scale GAN Training for High Fidelity Natural Image Synthesis)

pix2pixHD (High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs 2017)

As you can imagine this work extends the pix2pix model that we discussed in part 2. In essence, the pix2pix method adopts a symmetric u-net architecture for the generator and a patch-based discriminator. More importantly, the input to the discriminator is both the input image and the generated one (i.e. concatenation of image with semantic label). Nevertheless, the image resolution reaches up to 256 × 256, which is still low. It is widely known in computer vision that a multi-resolution pipeline is a well-established practice, with two-scale approaches to dominate because it is often enough to model 2 scales.

But how did they manage to improve pix2pix in a multi-scale fashion ?

1. Decomposing the generator

The generator G is decomposed in G1 and G2. G1 acts as the global generator network and G2 acts as the local enhancer network . G1 operates in a lower dimension, specifically half...