8 Top Books on Data Cleaning and Feature Engineering
Data preparation is the transformation of raw data into a form that is more appropriate for modeling.
It is a challenging topic to discuss as the data differs in form, type, and structure from project to project.
Nevertheless, there are common data preparation tasks across projects. It is a huge field of study and goes by many names, such as “ data cleaning ,” “ data wrangling ,” “ data preprocessing ,” “ feature engineering ,” and more. Some of these are distinct data preparation tasks, and some of the terms are used to describe the entire data preparation process.
Even though it is a challenging topic to discuss, there are a number of books on the topic.
In this post, you will discover the top books on data cleaning, data preparation, feature engineering, and related topics.
Let’s get started.
Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book , with 30 step-by-step tutorials and full Python source code.
The focus here is on data preparation for tabular data, e.g. data in the form of a table with rows and columns as it looks in an excel spreadsheet.
Data preparation is an important topic for all data types, although specialty methods are required for each, such as image data in computer vision, text data in natural language processing, and sequence data in time series forecasting.
Data preparation is often a chapter in a machine learning textbook, although there are books dedicated to the topic. We will focus on these books.
I have gathered all the books I can find on the topic data preparation, selected what I think are the best or better books, and organized them into three groups; they are:
I will try to give the flavor of each book, including the goal, the table of contents, and where to learn more about it.
Want to Get Started With Data Preparation?
Take my free 7-day email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Download Your FREE Mini-Course
Data cleaning refers to identifying and fixing errors in the data prior to modeling, including, but not limited to, outliers, missing values, and much more.
The top books on data cleaning include:
Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work , 2012.
Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data , 2012.
Data Cleaning , 2019.
Let’s take a closer look at each in turn.
“Bad Data Handbook”
The book “ Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work ” was edited by Q. Ethan Mccallum and was published in 2012.
Bad data is described...