The original version of this story appeared in Quanta Magazine.
A team of computer scientists has created a nimbler, more flexible type of machine learning model. The trick: It must periodically forget what it knows. And while this new approach won’t displace the huge models that undergird the biggest apps, it could reveal more about how these programs understand language.
The new research marks “a significant advance in the field,” said Jea Kwon, an AI engineer at the Institute for Basic Science in South Korea.
Today’s AI language engines are primarily driven by artificial neural networks. Each “neuron” in these networks is actually a mathematical function that accepts input from other neurons, undertakes some calculations, and disseminates signals across multiple layers of neurons. All of this begins rather randomly, however, upon training, the neurons better navigate the flow of information, honing their adaptivity to the training data. Hence, if an AI research scientist aims to develop a bilingual model, she would feed the model with voluminous text data from both languages, enabling the model to establish connections between equivalent words in both languages.
The downside is that this training process demands considerable computing power. Plus, if the model fails to deliver desired results, or if there is a change in the user’s requirements in the future, making alterations could be daunting. “Imagine you have a model that can comprehend 100 languages, but there’s a language you want that’s not covered,” clarifies Mikel Artetxe, a contributor to the new study and founder of Reka, an AI startup. “Starting from scratch could be an option, but far from ideal.”
To navigate around these limitations, Artetxe, along with his colleagues, employed a different approach. A few years ago, Artetxe and his team trained a neural network in one language, afterwards eradicating its understanding of words, known as tokens. These tokens are housed in the neural network’s initial layer, known as the embedding layer. The model’s other layers were left untouched. After erasing the first language’s tokens, the model underwent another round of training, this time in a different language, populating the embedding layer with the second language’s tokens.
Contrary to what might be expected, this approach worked. Despite the fact that the model contained mismatched information, the retraining enabled it to learn and comprehend the new language effectively. Based on this, the researchers theorized that while the embedding layer retains details specific to the words of a particular language, the network’s deeper levels store rather abstract information about the concepts driving human languages. This in turn aids the model in learning a new language.
“We live in the same world. We conceptualize the same things with different words” in different languages, said Yihong Chen, the lead author of the recent paper. “That’s why you have this same high-level reasoning in the model. An apple is something sweet and juicy, instead of just a word.”
Fred Pearce
Caroline Haskins
Andy Greenberg
Author: Joel Khalili
While this forgetting approach was an effective way to add a new language to an already trained model, the retraining was still demanding—it required a lot of linguistic data and processing power. Chen suggested a tweak: Instead of training, erasing the embedding layer, then retraining, they should periodically reset the embedding layer during the initial round of training. “By doing this, the entire model becomes used to resetting,” Artetxe said. “That means when you want to extend the model to another language, it’s easier, because that’s what you’ve been doing.”
The researchers took a commonly used language model called Roberta, trained it using their periodic-forgetting technique, and compared it to the same model’s performance when it was trained with the standard, non-forgetting approach. The forgetting model did slightly worse than the conventional one, receiving a score of 85.1 compared to 86.1 on one common measure of language accuracy. Then they retrained the models on other languages, using much smaller data sets of only 5 million tokens, rather than the 70 billion they used during the first training. The accuracy of the standard model decreased to 53.3 on average, but the forgetting model dropped only to 62.7.
The forgetting model also fared much better if the team imposed computational limits during retraining. When the researchers cut the training length from 125,000 steps to just 5,000, the accuracy of the forgetting model decreased to 57.8, on average, while the standard model plunged to 37.2, which is no better than random guesses.
The team concluded that periodic forgetting seems to make the model better at learning languages generally. “Because [they] keep forgetting and relearning during training, teaching the network something new later becomes easier,” said Evgenii Nikishin, a researcher at Mila, a deep learning research center in Quebec. It suggests that when language models understand a language, they do so on a deeper level than just the meanings of individual words.
Fred Pearce
Caroline Haskins
Andy Greenberg
Joel Khalili
The approach is similar to how our own brains work. “Human memory in general is not very good at accurately storing large amounts of detailed information. Instead, humans tend to remember the gist of our experiences, abstracting and extrapolating,” said Benjamin Levy, a neuroscientist at the University of San Francisco. “Enabling AI with more humanlike processes, like adaptive forgetting, is one way to get them to more flexible performance.”
In addition to what it might say about how understanding works, Artetxe hopes more flexible forgetting language models could also help bring the latest AI breakthroughs to more languages. Though AI models are good at handling Spanish and English, two languages with ample training materials, the models are not so good with his native Basque, the local language specific to northeastern Spain. “Most models from Big Tech companies don’t do it well,” he said. “Adapting existing models to Basque is the way to go.”
Chen also looks forward to a world where more AI flowers bloom. “I’m thinking of a situation where the world doesn’t need one big language model. We have so many,” she said. “If there’s a factory making language models, you need this kind of technology. It has one base model that can quickly adapt to new domains.”
Original story reprinted with permission from Quanta Magazine, an editorially independent publication of the Simons Foundation whose mission is to enhance public understanding of science by covering research developments and trends in mathematics and the physical and life sciences.