GAZETTE: I think the number you’ve analyzed — 6 million genomes — would surprise most readers, if you’re talking about single genomes. How many are there?
THE BEST : What we call a genome is a sequence typical of an individual patient. We tend to think of a genome representing a patient’s virus. That’s a pretty good approximation of what’s in the database. But each patient’s infection is several million copies of the virus, so it’s a tiny fraction of the number of SARS-CoV-2 replication events that occurred during the pandemic.
GAZETTE: Are there at least small variations of the virus in each patient’s body?
THE BEST : There are small variations in any given person, but we don’t need to model them all to understand the pandemic. In fact, many of the viral sequences from different individuals are identical at the consensus level. So there are not 6.5 million unique genomic sequences. Some are identical. This is actually what we follow, and we even grow [generalize] lineage-level data, which are essentially groups of genetically similar genomes that we consider together. Then we ask, in different populations over time: do we see more of this group of genomes called “the lineage” or less of this group of genomes over time? For the purposes of this model, we use 3,000 lineages and each contains a unique constellation of mutations. Mutations, however, can occur in more than one lineage. And this is where we can get the power to ask which mutations are responsible for a lineage growing over time or disappearing. And, because people all over the world are contributing genomes to these databases, we essentially have a real-time view of which lineages are developing in which places, sometimes due to chance, like a big super-spreading event. But if we find that the same lineage dominates in Massachusetts, New York, and California, that tells us there’s probably something about that lineage. We can deduce what it is by doing the same for mutations. If we see a mutation like N501Y, for example, that is consistently found in lineages that tend to grow, then we think there is something about that mutation that causes that lineage to grow in a population.
GAZETTE: Can this model predict future variants that might arise, or does it really work with existing genomes, sorting through the thousands of lineages for those that might spread? Can he really look ahead and say, “Well, that’s likely to mutate here. And that’s going to be a problem”?
THE BEST : Sort of both. One thing it does well is provide an estimate of the growth rate of the various lines that are currently circulating. We assign an ability to each mutation observed in the population, and if a mutation has never been observed before, we cannot assign an ability to it. So, if there is a hypothetical strain that arises from combinations of mutations that have been observed in other places, but have not been brought together in the same lineage before, we can predict the growth rate of this strain. If we haven’t observed the mutations, the model doesn’t know the effects of that particular mutation.
GAZETTE: How did the work start?
SABETI: Jacob, then a medical student turned postdoc, and another graduate student turned postdoc, Danny Park, had long been studying methods to detect adaptive variants in microbes, starting with malaria — it was a passion project of the lab. Our early work was to detect natural selection in humans and other mammals, and the challenge is that because generation times are so long, we have to infer historical events. In infectious diseases, what is amazing is that we see natural selection happening before our very eyes. We can track it in real time. That’s the power of this approach.
But when Jacob and others started this malaria work a decade ago, the data was just too sparse. In the midst of Ebola, we started getting higher density data and publishing work with Jeremy Luban [at the University of Massachusetts Chan Medical School] identify variants that have increased in prevalence. But there was still too little data to make any statistical inferences from the nature that we can now. With the pandemic, we went very quickly from a situation where we didn’t have enough data to a situation where we had so much data that people weren’t able to manage it. And it was very heterogeneous data: we didn’t know the data sources; we did not know the quality of the sequences and therefore how to organize and essentially tame this massive dataset to obtain robust results.
THE BEST : Back then, we weren’t used to working with millions of microbial genomes. We were used to dealing with hundreds or thousands. That’s when we started working with the Pyr0 Broad team, who came from Uber AI, where they had built this probabilistic programming language to perform calculations on very large data sets. Fritz Obermeyer was the main person working on this project. He was able to develop a model that makes sense of which lineages are more easily inherited and grow faster in the population and represents these lineages by their constitutive mutations. The other essential innovation of Fritz’s work is that it can run on modern processing hardware, using innovations in software engineering and modern computing power. It made this possible in a way that would not have been possible before.
GAZETTE: How important was an interdisciplinary approach in this research? It looks like you involved a lot of different people.
SABETI: It’s at the interface of what we call ‘variant to function’, and individuals from math, computer science and computational biology have come together with virologists, molecular biologists, infectious disease researchers and clinicians. As you move from the bench to the bedside, you see patterns and become intrigued by them.
GAZETTE: Obviously, the ability to predict which variants and which will dominate is important. What do you see for the future with this model?
SABETI: The holy grail that the field often turns to is the ability to predict up front what mutations will matter and what their effects will be, essentially how a microbe will adapt. To do that, we’ll need these massive patterns to really interrogate viral and microbial genomes and, when you see enough different mutations, start to understand the patterns and the logic behind them. I think we can get to the point where we start to understand how adaptation is going to happen and how we should approach it in the development of our countermeasures, but that’s going to take a lot of data. Whenever people ask, “Did we generate too much data?” I answer that we haven’t done it by far. We should really get to the point where it becomes routine to sequence every microbial genome detected in infections because there are things we don’t even know that it’s still possible to ask because we don’t have the data .
The daily gazette
Sign up for daily emails to get the latest news from Harvard.