Self-supervised representation learning on videos

Self-supervised representation learning on videos

Nowadays, transfer learning from pretrained models on Imagenet is the ultimate standard in computer vision. Self-supervised learning dominates natural language processing, but this doesn’t mean that there are no significant use-cases for computer vision that it should be considered. There are indeed a lot of cool self-supervised tasks that one can devise when she/he is dealing with images, such as jigsaw puzzles [6], image colorization, image inpainting, or even unsupervised image synthesis .

But what happens when the time dimension comes into play? How can you approach the video-based tasks that you would like to solve?

So, let’s start from the beginning, one concept at a time. What is self-supervised learning? And how is it different from transfer learning? What is a pretext task?

Self-supervised learning VS transfer learning

Transfer learning is a fair point to start.

Transfer learning enables us to exploit the weights of a (convolutional) neural network used for task A and apply it to another task (B), given that the input domains are somehow related.

The training process in task A is called pretraining in machine learning terminology. The core idea behind this is that the acquired ‘knowledge’ of task A may be exploited to enhance generalization in task B. And this is usually the case, because who wants to start from a random weight initialization?

Obviously, transfer learning is one way to maximize the performance of your model, by pretraining on a similar supervised (video) dataset. However, in domains such as videos the problem of annotation (supervision) quickly arises. It is difficult to find video annotated data and transfer weights.

But even before that, let’s clarify the inherent differences of self-supervised learning VS transfer learning . I created this simple diagram to make things crystal clear:

Self-supervized VS transfer-learning

As a solution to supervision-deprived domains, self-supervised learning is one way to transfer weights, by pretraining your model on labels that are artificially produced from the data/videos.

This can be achieved with a bunch of different (and sometimes tricking) transformations, as we will see. This is exactly what we call self-supervised representation learning.

Note that, for self supervised learning, you can even use just your own data, but without the provided labels

In essence, in self-supervised learning we are trying to guess a related task A so the weight transfer to task B, would be a better initialization point than random.

Notations, concepts and example tasks

The devised self-supervised task A is usually called pretext or proxy task, while the desired task B we want to solve is referred in the literature as a downstream task . I usually refer...