Data preprocessing for deep learning: How to build an efficient big data pipeline

Data preprocessing for deep learning: How to build an efficient big data pipeline (part 1)

In this article, we explore the topic of big data processing for machine learning applications. Building an efficient data pipeline is an essential part of developing a deep learning product and something that should not be taken lightly. As I‘m pretty sure you know by now, machine learning is completely useless without the right data. And by the right data, we mean data from the correct sources and in the right format.

But what is a data pipeline? And when do we characterize it as efficient?

Generally speaking, data prepossessing consists of two steps: Data engineering and feature engineering .

Data engineering is the process of converting raw data into prepared data, which can be used by the ML model.

Feature engineering creates the features expected by the model.

When we deal with a small number of data points, building a pipeline is usually straightforward. But that’s almost never the case with Deep Learning. Here we play with very very large datasets (I’m talking about GBs or even TBs in some cases). And manipulating those is definitely not a piece of cake. But dealing with difficult software challenges is what this article series is all about. If you do not know what I’m talking about here is a brief reminder:

This article is the 5th part of the Deep Learning in Production series. In the series, we are starting from a simple experimental jupyter notebook with a neural network that performs image segmentation and we write our way towards converting it in production-ready highly-optimized code and deploy it to a production environment serving millions of users. If you missed that, you can start from the first article .

Back to data processing. Where were we? Oooh yeah. So how do we build efficient big data pipelines to feed the data into the machine learning model? Let’s start with the fundamentals.


ETL: Extract, Transform, Load
Data Reading
Data Processing

ETL: Extract, Transform, Load

In the wonderful world of databases, there is this notion called ETL. As you can see in the headline ETL is an acronym of Extract, Transform, Load. These are the 3 building blocks of most data pipelines.

Extraction involves the process of extracting the data from multiple homogeneous or heterogeneous sources.

Transformation refers to data cleansing and manipulation in order to convert them into a proper format.

Loading is the injection of the transformed data into the memory of the processing units that will handle the training (whether this is CPUs, GPUs or even TPUs )

When we combine these 3 steps, we get the notorious data pipeline. However, there is a caveat here. It’s not enough to build the sequence of necessary steps. It’s equally...