Can an AI Predict the Language of Viral Mutation?
Viruses lead a rather repetitive existence. They enter a cell, hijack its machinery to turn it into a viral copy machine, and those copies head on to other cells armed with instructions to do the same. So it goes, over and over again. But somewhat often, amidst this repeated copy-pasting, things get mixed up. Mutations arise in the copies. Sometimes, a mutation means an amino acid doesn’t get made and a vital protein doesn’t fold—so into the dustbin of evolutionary history that viral version goes. Sometimes the mutation does nothing at all, because different sequences that encode the same proteins make up for the error. But every once in a while, mutations go perfectly right. The changes don’t affect the virus’s ability to exist; instead, they produce a helpful change, like making the virus unrecognizable to a person’s immune defenses. When that allows the virus to evade antibodies generated from past infections or from a vaccine, that mutant variant of the virus is said to have “escaped.”
Scientists are always on the lookout for signs of potential escape. That’s true for SARS-CoV-2, as new strains emerge and scientists investigate what genetic changes could mean for a long-lasting vaccine. (So far, things are looking okay.) It’s also what confounds researchers studying influenza and HIV, which routinely evade our immune defenses. So in an effort to see what’s possibly to come, researchers create hypothetical mutants in the lab and see if they can evade antibodies taken from recent patients or vaccine recipients. But the genetic code offers too many possibilities to test every evolutionary branch the virus might take over time. It’s a matter of keeping up.
Last winter, Brian Hie, a computational biologist at MIT and a fan of the lyric poetry of John Donne, was thinking about this problem when he alighted upon an analogy: What if we thought of viral sequences the way we think of written language? Every viral sequence has a sort of grammar, he reasoned—a set of rules it needs to follow in order to be that particular virus. When mutations violate that grammar, the virus reaches an evolutionary dead end. In virology terms, it lacks “fitness.” Also like language, from the immune system’s perspective, the sequence could also be said to have a kind of semantics. There are some sequences the immune system can interpret—and thus stop the virus with antibodies and other defenses—and some that it can’t. So a viral escape could be seen as a change that preserves the sequence’s grammar but changes its meaning.
The analogy had a simple, almost too simple, elegance. But to Hie, it was also practical. In recent years, AI systems have gotten very good at modeling principles of grammar and semantics in human language. They do this by training a system with data sets of billions of words, arranged in sentences and paragraphs, from which the system derives patterns. In this way, without being told any specific rules, the system learns where the commas should go and how to structure a clause. It can also be said to intuit the meaning of certain sequences—words and phrases—based on the many contexts in which they appear throughout the data set. It’s patterns, all the way down. That’s how the most advanced language models, like OpenAI’s GPT-3, can learn to produce perfectly grammatical prose that manages to stay reasonably on topic.
One advantage of this idea is that it’s generalizable. To a machine learning model, a sequence is a sequence, whether it’s arranged in sonnets or amino acids. According to Jeremy Howard, an AI researcher at the University of San Francisco and a language model expert, applying such models to biological sequences can be fruitful. With enough data from, say, genetic sequences of viruses known to be infectious, the model will implicitly learn something about how infectious viruses are structured. “That model will have a lot of sophisticated and complex knowledge,” he says.