There really is too much information. Suppose, for example, that you are an astronomer scanning the cosmos for black holes, or a climatologist modeling the next century of global temperature change. After just a few days of recording observations or running simulations on the most sophisticated equipment, you could end up with millions of gigabytes of data. Some of them contain the elements you are interested in, but a large part does not. It is too much to analyze, too much even to store.
“We are drowned in data”, says Rafael Hiriart, a computer scientist at the National Radio Astronomical Observatory in New Mexico, which will soon be the site of the next-generation Very Large Array radio telescope. (Its precursor, the first Very Large Array, is what Jodi Foster uses to listen for alien signals in Contact.) When it goes online in a few years, the telescope’s antennas will collect 20 million gigabytes of night sky observations each month. Processing that much data will require a computer capable of performing 100 billion trillion floating point operations per second; only two supercomputers on Earth are that fast.
And it’s not just astronomers who are drowning. “I would say that just about any scientific field would face this,” says Bill Spotz, a program manager in the US Department of Energy’s Advanced Scientific Computing Research program, which manages many supercomputers nationwide, including Mountain peak, the second fastest machine in the world.
From climate modeling to genomics to nuclear physics, increasingly precise sensors and powerful computers are delivering data to scientists at lightning speeds. In 2018, Summit produced the very first exascale calculating on, of all things, a set of poplar genomes, calculating in an hour what would take about 30 years for an ordinary laptop computer. (One exabyte equals one billion gigabytes, which is enough to store a video call that lasts over 200,000 years. An exascale calculation involves a quintillion floating point operations per second.) Supercomputers in the works, such as Frontier at the Oak Ridge National Laboratory, will go even faster and generate even more data.
These huge volumes of data and incredible speeds allow scientists to make progress on all kinds of issues, from designing more efficient engines, to finding the link between cancer and genetics, to studying gravity at the center. the Galaxy. But the large amount of data can also become unwieldy: too big big data.
This is why, in January, the Ministry of Energy called a (virtual) meeting Meet hundreds of scientists and data experts to discuss what to do about all this data and the even bigger data deluge to come. The DOE has since put up $ 13.7 million for research on how to get rid of some of this data without getting rid of the useful stuff. In September, it awarded funds to nine of these data reduction efforts, including research teams from several national laboratories and universities. “We’re trying to tame exabytes of data,” says Spotz.
“This is definitely something we need,” said Jackie Chen, a mechanical engineer at Sandia National Laboratories who uses supercomputers to simulate turbulence-chemical interactions in internal combustion engines to develop more efficient engines that burn carbon-neutral fuels. “We have the power to generate data that gives us unprecedented insight into complex processes, but what do we do with all that data? And how do you extract meaningful scientific information from this data? And how do you reduce it to a form that someone who actually designs practical devices like motors can use? “
Another area that should benefit from better data reduction is bioinformatics. Although it is currently less data intensive than climate science or particle physics, faster and cheaper DNA sequencing means the tide of biological data will continue to increase, says Cenk Sahinalp, computer biologist at the National Cancer Institute. “The cost of storage becomes an issue, and the cost of analysis is a big, big issue,” he says. Data reduction could help solve data-intensive omics problems like these. For example, the reduction in data could make it more feasible to sequence and analyze the genomes of thousands of individual tumor cells to target and destroy specific groups of cells.
But data reduction is especially difficult for scientific issues, as it has to be sensitive to the anomalies and outliers that are so often the source of information. For example, attempts to explain anomalies observations of a form of light emitted by hot black objects ultimately led to quantum mechanics. A reduction in data that would remove unexpected or rare events and smooth each curve would be unacceptable. “If you’re trying to answer a question you’ve never answered before, you might not know” what data will be useful, says Spotz. “You don’t want to throw out the interesting part.”
DOE-funded researchers will work on several strategies to address the problem, including improving compression algorithms, allowing scientific teams to have more control over the amounts lost due to compression; minimize the dimensions represented in a dataset; integrate data reduction into the instruments themselves; and develop better ways to trigger instruments to start recording data only when a phenomenon occurs. All of them will involve machine learning to some extent.
Byung-Jun Yoon, an applied mathematician at Brookhaven National Labs, leads one of the data reduction teams. On a Zoom call rightly plagued by bandwidth issues, he explained that scientists often shrink data out of necessity already, but that “it’s more of a combination of art and science.” In other words, it is imperfect and forces scientists to be less systematic than they would like. “And that doesn’t even take into account the fact that a lot of the generated data is just dumped because it can’t be stored,” he says.
Yoon’s approach is to develop ways to quantify the impact of a data reduction algorithm on signals in a data set precisely defined by scientists, for example, a planet passing through a star or a mutation in a particular gene. Quantifying this effect will allow Yoon to tinker with the algorithm to maintain acceptable resolution in these quantities of interest, while also removing as much irrelevant data as possible. “We want to be more confident about data reduction,” he says. “And that is only possible when we can quantify its impact on the things that really matter to us.”
Yoon aims for his method to be applicable in all fields of science, but will start with cryoelectron microscopy datasets, as well as particle accelerators and light sources, which are among the largest producers of scientific data, which should soon produce exabytes of data, which will also soon have to be reduced. If we don’t learn anything else from our exabytes, at least we can be sure that less is more.