Learning Life’s ABCs: AI Models Read Proteins to Fight COVID-19

Ahmed Elnaggar and Michael Heinzinger are helping computers read proteins as easily as you read this sentence.
The researchers are applying the latest AI models used to understand text to the field of bioinformatics. Their work could accelerate efforts to characterize living organisms like the coronavirus.
By the end of the year, they aim to launch a website where researchers can plug in a string of amino acids that describe a protein. Within seconds, it will provide some details of the protein’s 3D structure, a key to knowing how to treat it with a drug.
Today, researchers typically search databases to get this kind of information. But the databases are growing rapidly as more proteins are sequenced, so a search can take up to 100 times longer than the approach using AI, depending on the size of a protein’s amino acid string.
In cases where a particular protein hasn’t been seen before, a database search won’t provide any useful results — but AI can.
“Twelve of the 14 proteins associated with COVID-19 are similar to well validated proteins, but for the remaining two we have very little data — for such cases, our approach could help a lot,” said Heinzinger, a Ph.D. candidate in computational biology and bioinformatics.
While time consuming, methods based on the database searches have been 7-8 percent more accurate than previous AI methods. But using the latest models and datasets, Elnaggar and Heinzinger cut the accuracy gap in half, paving the way for a shift to using AI.
AI Models, GPUs Drive Biology Insights
“The speed at which these AI algorithms are improving makes me optimistic we can close this accuracy gap, and no field has such fast growth in datasets as computational biology, so combining these two things I think we will reach a new state of the art soon,” said Heinzinger.
“This work couldn’t have been done two years ago,” said Elnaggar, an AI specialist with a Ph.D. in transfer learning . “Without the combination of today’s bioinformatics data, new AI algorithms and the computing power from NVIDIA GPUs, it couldn’t be done,” he said.
Elnaggar and Heinzinger are team members in the Rostlab at the Technical University of Munich, which helped pioneer this field at the intersection of AI and biology. Burkhard Rost, who heads the lab, wrote a seminal paper in 1993 that set the direction.
The Semantics of Reading a Protein
The underlying concept is straightforward. Proteins, the building blocks of life, are made up of strings of amino acids that need to be interpreted sequentially, just like words in a sentence.
So, researchers like Rost started applied emerging work in natural-language processing to understand proteins. But in the 1990s they had very little data on proteins and the AI models were still fairly crude.
Fast forward to today and a lot has changed.
Sequencing has become relatively fast and...