Data systems that learn to be better

Big data has gotten really, really big: By 2025, all the world’s data will add up to an estimated 175 trillion gigabytes . For a visual, if you stored that amount of data on DVDs, it would stack up tall enough to circle the Earth 222 times.

One of the biggest challenges in computing is handling this onslaught of information while still being able to efficiently store and process it. A team from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) believe that the answer rests with something called “instance-optimized systems.”

Traditional storage and database systems are designed to work for a wide range of applications because of how long it can take to build them — months or, often, several years. As a result, for any given workload such systems provide performance that is good, but usually not the best. Even worse, they sometimes require administrators to painstakingly tune the system by hand to provide even reasonable performance.

In contrast, the goal of instance-optimized systems is to build systems that optimize and partially re-organize themselves for the data they store and the workload they serve.

“It’s like building a database system for every application from scratch, which is not economically feasible with traditional system designs,” says MIT Professor Tim Kraska.

As a first step toward this vision, Kraska and colleagues developed Tsunami and Bao. Tsunami uses machine learning to automatically re-organize a dataset’s storage layout based on the types of queries that its users make. Tests show that it can run queries up to 10 times faster than state-of-the-art systems. What’s more, its datasets can be organized via a series of "learned indexes" that are up to 100 times smaller than the indexes used in traditional systems.

Kraska has been exploring the topic of learned indexes for several years, going back to his influential work with colleagues at Google in 2017.

Harvard University Professor Stratos Idreos, who was not involved in the Tsunami project, says that a unique advantage of learned indexes is their small size, which, in addition to space savings, brings substantial performance improvements.

“I think this line of work is a paradigm shift that’s going to impact system design long-term,” says Idreos. “I expect approaches based on models will be one of the core components at the heart of a new wave of adaptive systems.”

Bao , meanwhile, focuses on improving the efficiency of query optimization through machine learning. A query optimizer rewrites a high-level declarative query to a query plan, which can actually be executed over the data to compute the result to the query. However, often there exists more than one query plan to answer any query; picking the wrong one can cause a query to take days to compute the answer, rather than seconds.

Traditional query optimizers take years to...

Data systems that learn to be better

Related News