Deepfakes: Face synthesis with GANs and Autoencoders

Deepfakes: Face synthesis with GANs and Autoencoders

Recently, fake news have become a major threat to human society. False information can be spread fast through social media and can affect decision making. Moreover, it is challenging even for recent AI technologies to recognize fake data. One of the most recent developments in data manipulation is well-known as “ Deepfake ”, which refers to the swap of faces in images or videos . So far, deepfake techniques have mostly been applied by swapping celebrity faces in funny videos or by making politicians saying hilarious dumb speeches. However, many industries could benefit from deepfake applications such as the film industry by using advanced video editing.

How do DeepFakes work?

Let’s have a closer look at how Deepfakes work. Deepfakes are usually based on Generative Adversarial Networks (GANs) , where two competing neural networks are jointly trained. GANs have had significant success in many computer vision tasks. They were introduced in 2014 and modern architectures are capable of generating realistic-looking images that even a human can’t recognize whether it’s real or not . Below you can see some images from a successful GAN model called StyleGAN.

These people are not real – they were produced by StyleGAN’s generator that allows control over different aspects of the image.

What is Deepfakes?

Based on Wiki, Deepfakes are synthetic media in which a person in an existing image or video is replaced with someone else’s likeness. The act of injecting a fake person in an image is not new. However, recent Deepfakes methods usually leverage the recent advancements of powerful GAN models, aiming at facial manipulation.

In general, facial manipulation is usually conducted with Deepfakes and can be categorized in the following categories:

Face synthesis

Face swap

Facial attributes and expression

1. Face synthesis

In this category, the objective is to create non-existent realistic faces using GANs. The most popular approach is StyleGAN . Briefly, a new generator architecture learns separation of high-level attributes (e.g., pose and identity when trained on human faces) without supervision and stochastic variation in the generated images (e.g., freckles, hair), and it enables intuitive, scale-specific control of the synthesis. The StyleGAN’s generator is shown in Figure 2.

The input is mapped through several fully connected layers to an intermediate representation w which is then fed to each convolutional layer through adaptive instance normalization (AdaIN), where each feature map is normalized separately. Gaussian noise is added after each convolution. The benefit of adding noise directly in the feature maps of each layer is that global aspects such as identity and pose are unaffected.

The StyleGAN generator...