The original version of this story appeared in Quanta Magazine.
Imagine you had a friend who gave different answers to the same question, depending on how you asked it. “What’s the capital of Peru?” would get one answer, and “Is Lima the capital of Peru?” would get another. You’d probably be a little worried about your friend’s mental faculties, and you’d almost certainly find it hard to trust any answer they gave.
That’s exactly what’s happening with many large language models (LLMs), the ultra-powerful machine learning tools that power ChatGPT and other marvels of artificial intelligence. A generative question, which is open-ended, yields one answer, and a discriminative question, which involves having to choose between options, often yields a different one. “There is a disconnect when the same question is phrased differently,” said Athul Paul Jacob, a doctoral student at the Massachusetts Institute of Technology.
To enhance the consistency in the responses of language models and ensure their reliability, Jacob and his team introduced an innovative approach involving a strategy wherein the model’s two operational modes strive to achieve a mutual consensus. Named the consensus game, this method employs the principles of game theory, using competition within the model itself to refine its accuracy and coherence.
“There’s been minimal research into self-consistency in these models,” remarked Shayegan Omidshafiei, the chief scientific officer at Field AI. He celebrates the paper as pioneering, utilizing a game that challenges the model to self-interact in an innovative and methodical manner.
“It’s really exciting work,” concurred Ahmad Beirami, a research scientist at Google Research. He noted that traditional language models have been responding to prompts in a uniform manner for years. “Introducing the concept of a game into this routine offers a fresh, transformative approach which opens the door to a vast range of new possibilities,” he added.
The recent research uses gaming dynamics to advance AI, diverging from older methods where the effectiveness of an AI was gauged through its ability to excel at games. Notable in historical context, IBM’s Deep Blue defeated chess grandmaster Garry Kasparov in 1997, marking a significant development for AI. Similarly, in 2016, Google DeepMind’s AlphaGo secured a majority of wins against Go champion Lee Sedol, showing more activities where machines excel over humans, including checkers and two-player poker, illustrating the zero-sum nature where one participant’s win necessitates the loss of another.
Athul Paul Jacob helped devise the consensus game, which provides a way for large language models to improve their accuracy and reliability.
By Matt Burgess
By Sachi Mulkey
By Andy Greenberg
By Matthew Gault
Posing a far greater challenge for AI researchers was the game of Diplomacy—a favorite of politicians like John F. Kennedy and Henry Kissinger. Instead of just two opponents, the game features seven players whose motives can be hard to read. To win, a player must negotiate, forging cooperative arrangements that anyone could breach at any time. Diplomacy is so complex that a group from Meta was pleased when, in 2022, its AI program Cicero developed “human-level play” over the course of 40 games. While it did not vanquish the world champion, Cicero did well enough to place in the top 10 percent against human participants.
During the project, Jacob—a member of the Meta team—was struck by the fact that Cicero relied on a language model to generate its dialog with other players. He sensed untapped potential. The team’s goal, he said, “was to build the best language model we could for the purposes of playing this game.” But what if instead they focused on building the best game they could to improve the performance of large language models?
In 2023, Jacob began to pursue that question at MIT, working with Yikang Shen, Gabriele Farina, and his adviser, Jacob Andreas, on what would become the consensus game. The core idea came from imagining a conversation between two people as a cooperative game, where success occurs when a listener understands what a speaker is trying to convey. In particular, the consensus game is designed to align the language model’s two systems—the generator, which handles generative questions, and the discriminator, which handles discriminative ones.
After several intermittent efforts, the collective finally evolved their basic idea into a comprehensive game. Initially, the generator is presented with a question, originating either from a person or from an already prepared set of questions, such as “Where was Barack Obama born?” Subsequently, it explores possible answers like Honolulu, Chicago, and Nairobi which might be provided by a person, pulled from a database, or generated by the AI itself.
Before providing an answer, the generator is instructed if it should respond correctly or not, based on the outcome of a random coin flip.
On a heads result, the generator strives to give the right answer. It then passes the original query and its selected answer to a discriminator. If the discriminator confirms that the answer was intentionally correct, they both earn a point as a form of reward.
Conversely, if the coin shows tails, the generator deliberately proposes what it believes to be the incorrect answer. Should the discriminator recognize that the wrong answer was given on purpose, both participants again receive a point. This method aims to promote consensus. “It’s similar to teaching a dog a trick,” Jacob said. “You reward them when they perform correctly.”
The generator and discriminator enter the scene with predefined initial “beliefs” manifested as probability distributions over different scenarios. For instance, the generator might hold an initial belief, derived from online data, giving an 80% likelihood that Obama was born in Honolulu, 10% in Chicago, 5% in Nairobi, and 5% in other locations. The discriminator, on the other hand, may begin with a variant set of probabilities. Although both components strive to agree, there are penalties for straying too much from their initial beliefs. This promotes an integration of their internet-acquired worldly knowledge into their output, aiming to refine the accuracy of their conclusions. Without such a mechanism, they could erroneously concur on a completely incorrect place like Delhi and continue to accumulate points undeservedly.
By Matt Burgess
By Sachi Mulkey
By Andy Greenberg
By Matthew Gault
For each question, the two systems play roughly 1,000 games against each other. Over the course of these numerous iterations, each side learns about the other’s beliefs and modifies its strategies accordingly.
Eventually, the generator and the discriminator begin to agree more as they settle into something called Nash equilibrium. This is arguably the central concept in game theory. It represents a kind of balance in a game—the point at which no players can better their personal outcomes by shifting strategies. In rock-paper-scissors, for example, players do best when they choose each of the three options exactly one-third of the time, and they will invariably do worse with any other tactic.
In the consensus game, this can play out in many ways. The discriminator might observe that it gets a point when it says “correct” every time the generator sends the word “Honolulu” for Obama’s birthplace. The generator and discriminator will learn, after repeated play, that they will be rewarded for continuing to do this, and neither will have any motivation to do anything else. this consensus represents one of many possible examples of Nash equilibrium for this question. The MIT group also relied on a modified form of Nash equilibrium that incorporates the players’ prior beliefs, which helps keep their responses grounded in reality.
The net effect, the researchers observed, is to make the language model playing this game more accurate and more likely to give the same answer, no matter how the question is asked. To test the effects of the consensus game, the team tried out a set of standard questions on various moderate-size language models with 7 billion to 13 billion parameters. These models routinely got a higher percentage of correct responses than models that hadn’t played, even much bigger ones with up to 540 billion parameters. Playing the game also improved a model’s internal consistency.
Matt Burgess
Sachi Mulkey
Andy Greenberg
By Matthew Gault
In essence, playing an LLM against itself could be beneficial, and executing such matches a thousand times would only require milliseconds on a typical laptop. Omidshafiei notes, “A key advantage of this method is that it’s exceptionally efficient computationally, as it doesn’t require any training or changes to the fundamental language model.”
Following this success, Jacob is exploring additional methods to integrate game theory into LLM studies. Early findings suggest that a robust LLM can achieve further enhancements by engaging in a new setup known as the ensemble game, which features multiple smaller models. In this configuration, the main LLM cooperates with at least one smaller ally model and opposes at least one adversarial model. For instance, if the primary LLM needs to identify the U.S. president, successful alignment with its ally’s choice or divergence from its adversary’s answer each scores a point. These strategic interactions with smaller models not only seem to enhance LLM performance but also require no further training or adjustments to parameters.
Ian Gemp applies game theory to practical scenarios, helping large language models provide assistance in strategic interactions.
And that is just the start. Because a variety of situations can be viewed as games, the tools from game theory can be brought into play in various real-world settings, said Ian Gemp, a research scientist at Google DeepMind. In a February 2024 paper, he and colleagues focused on negotiation scenarios that require more elaborate exchanges than just questions and answers. “The main objective of this project is to make language models more strategic,” he said.
One example he discussed at an academic conference is the paper review process for acceptance by a journal or conference, especially after one’s initial submission received a harsh review. Given that language models assign probabilities to different responses, researchers can construct game trees similar to those designed for poker games, which chart the available choices and their possible consequences. “Once you do this, you can start to compute Nash equilibria and then rank a bunch of rebuttals,” Gemp said. The model essentially tells you: This is what we think you should say back.
By Matt Burgess
By Sachi Mulkey
By Andy Greenberg
By Matthew Gault
With the benefit of game theory’s insights, language models will be able to handle even more sophisticated interactions, rather than being limited to question-and-answer-type problems. “The big payoff going forward has to do with longer conversations,” Andreas said. “The next step is to have an AI interact with a person, not just another language model.”
Jacob views the DeepMind work as complementary to the consensus and ensemble games. “At a high level, both these methods are combining language models and game theory,” he said, even if the goals are somewhat different. While the Gemp group is casting commonplace situations into a game format to help with strategic decisionmaking, Jacob said, “we’re using what we know about game theory to improve language models in general tasks.”
Right now, these efforts represent “two branches of the same sure,” Jacob said—two different ways to enhance the functioning of language models. “My vision is that in a year or two, these two branches will converge.”
Original story reprinted with permission from Quanta Magazine, an editorially independent publication of the Simons Foundation whose mission is to enhance public understanding of science by covering research developments and trends in mathematics and the physical and life sciences.