Alexa and Siri, listen! UVA Collab teaches machines to really hear us


Current AI Training: Auditory Overload

For decades, but more so in the past 20 years, companies have built complex artificial neural networks into machines to try to mimic the way the human brain recognizes a changing world. These programs not only facilitate basic information seeking and consumerism; they also specialize in predicting the stock market, diagnosing medical conditions, and monitoring national security threats, among many other applications.

“Basically, we’re trying to detect meaningful patterns in the world around us,” Sederberg said. “These patterns will help us make decisions about how to behave and align with our environment, so we can get as many rewards as possible.”

Programmers used the brain as the initial inspiration for technology, hence the name “neural networks”.

“Early AI researchers took the basic properties of neurons and how they connect to each other and recreated them with computer code,” Sederberg said.

However, for complex problems like teaching machines to “hear” language, programmers have unwittingly taken a different route than how the brain actually works, he said. They have failed to pivot to developments in understanding neuroscience.

“The way these big companies deal with the problem is to dedicate computing resources to it,” the professor explained. “So they enlarge the neural networks. A field initially inspired by the brain turned into an engineering problem.

Essentially, programmers input a multitude of different voices using different words at different speeds and train large networks through a process called backpropagation. Programmers know the answers they want to get, so they keep feeding back the continually refined information. The AI ​​then begins to give appropriate weight to the aspects of the input that will result in accurate answers. The sounds become usable characters of the text.

“You do this many millions of times,” Sederberg said.

While the training datasets that serve as inputs have improved, as have computational speeds, the process is still less than ideal as programmers add more layers to detect greater nuance and complexity – what is called “deep” or “convolutional” learning.

More than 7,000 languages ​​are spoken in the world today. Variations arise with accents and dialects, lower or higher pitched voices – and of course faster or slower speech. As competitors create better products, at each stage a computer must process information.

This has real consequences for the environment. In 2019, a study found that the carbon dioxide emissions from the energy needed to train a single large deep learning model was equivalent to the lifetime footprint of five cars.

Three years later, datasets and neural networks have grown steadily.

How the Brain Really Hears Speech

The late Howard Eichenbaum of Boston University coined the term “time cells,” the phenomenon on which this new AI research is built. Neuroscientists studying temporal cells in mice and later in humans have demonstrated that there are spikes in neural activity as the brain interprets temporal inputs, such as sound. Residing in the hippocampus and other parts of the brain, these individual neurons capture specific intervals – data points that the brain examines and interprets in relation. The cells reside alongside so-called “place cells” that help us form mind maps.

Time cells help the brain create a unified understanding of sound, regardless of how quickly the information arrives.

“If I say ‘oooooooc-tooooooo-pussssssss’, you’ve probably never heard someone say ‘octopus’ at this speed before, and yet you can relate to it because the way your brain processes that information is called ‘scale invariant,'” Sederberg said. “What that basically means is that if you’ve heard that and learned to decode that information at a scale, if that information now comes a little faster or a little slower, or even much slower, you will always get it.”

The main exception to the rule, he said, is information that arrives very quickly. This data will not always be translated. “You lose information,” he said.

Cognitive researcher Marc Howard’s lab at Boston University continues to build on the discovery of time cells. A Sederberg collaborator for more than 20 years, Howard studies how human beings understand events in their lives. It then converts this understanding into mathematics.

Howard’s equation describing auditory memory involves a timeline. The timeline is built using time cells that fire in sequence. Critically, the equation predicts that the timeline blurs — and in a peculiar way — as sound moves into the past. This is because the brain’s memory of an event becomes less accurate over time.

“So there’s a specific trigger pattern that encodes what happened at a specific time in the past, and the information gets fuzzier and fuzzier the further it goes back into the past,” Sederberg said. “The cool thing is that Marc and a post-doc passing through Marc’s lab figured out mathematically what it should look like. Then neuroscientists began to find evidence of it in the brain.

Time adds context to sounds, and it is part of what gives meaning to what is said to us. Howard said the math sums up perfectly.

“Brain temporal cells seem to obey this equation,” Howard said.

UVA encode voice decoder

About five years ago, Sederberg and Howard identified that the field of AI could benefit from such brain-inspired representations. Working with Howard’s lab and in consultation with Zoran Tiganj and his colleagues at Indiana University, the Sederberg Computational Memory Lab began building and testing models.

Jacques made the big breakthrough about three years ago that helped him code the resulting proof of concept. The algorithm features a form of compression that can be decompressed as needed – much like a zip file on a computer works to compress and store large files. The machine only stores the “memory” of a sound at a resolution that will be useful later, saving storage space.

“Because the information is compressed logarithmically, it doesn’t completely change the pattern when the input is scaled, it just moves around,” Sederberg said.

Previous Business Intelligence Lighting Market with Future Business Opportunities and Key Competitor Analysis
Next Universal free school meals scheme to expire