GANs in computer vision - semantic image synthesis and learning a generative model from a single image

GANs in computer vision: semantic image synthesis and learning a generative model from a single image (part 6)
So far we have seen multiple computer vision tasks such as object generation , video synthesis , unpaired image to image translation . Now, we have reached publications of 2019 in our journey to summarize all the most influential works from the beginning of GANs. We focus on intuition and design choices, rather than boringly reported numbers. In the end, what is the value of a reported number in a visual generation task, if the results are not appealing?
In this section, we chose two unique publications: image synthesis based on a segmentation map and unconditional generation based on a single reference image . We present multiple perspectives that one has to take into account when designing a GAN. The models that we will visit in this tutorial have tackled the tasks out of the box and from a lot of perspectives.
Let us begin!
Contents
GauGAN : Semantic Image Synthesis with Spatially-Adaptive Normalization
SinGAN : Learning a Generative Model from a Single Natural Image
GauGAN (Semantic Image Synthesis with Spatially-Adaptive Normalization 2019)
We have already seen a lot of works that receive as input the segmentation map and output an image. When something is good, the question that always comes in my mind is: can we do better?
Let’s say that we can expand on this idea a bit more. Suppose that we want to generate an image given both a segmentation map and a reference image. This task, defined as semantic image synthesis , is of great importance. We don’t just generate diverse images based on the segmentation map, but we further constrain our model to account for the reference image that we want.
This work is the epitome of NVidia in GANs in computer vision. It borrows heavily from previous works of pix2pixHD and StyleGAN . Actually, they borrowed the multi-scale discriminator of pix2pixHD. Let’s take a look on how it works:
Multi-scale pix2pixHD discriminator overview, image pyramid borrowed from here
Based on this, they also inspected the generator of StyleGAN. The generator of this model exploits adaptive instance normalization (AdaIN) to encode the style of the latent space representation. Basically it receives noise which is the modeling style. Interestingly, they found out that AdaIN discards semantic content. Let’s see why:
[AdaIN(x,y) = sigma(y)(frac{x - mu(x)}{sigma(x)}) + mu(y)]
In this equation, we use the statistics of the feature map to normalize its values. This means that all the features that are lying in a grid structure are normalized by the same amount. Since we want to design a generator for style and semantics disentanglement, one way to encode the semantics is in the modulation of the in-layer normalizations. Let’s see how we can...