For some time now, organizations such as OpenAI and Google have been promoting sophisticated “reasoning” abilities as a significant advancement in their latest artificial intelligence models. However, a fresh study conducted by six engineers from Apple reveals that the mathematical “reasoning” exhibited by these advanced large language models can be quite fragile and inconsistent when faced with seemingly minor alterations to standard benchmark problems.
The vulnerabilities revealed in these findings lend support to earlier research indicating that the probabilistic pattern matching employed by LLMs lacks the formal comprehension of fundamental concepts necessary for truly dependable mathematical reasoning capabilities. The researchers surmise, based on their findings, that “Current LLMs are not capable of genuine logical reasoning.” Instead, they assert that these models strive to emulate the reasoning processes seen in their training datasets.
In “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models”—presently accessible as a preprint paper—the six Apple researchers begin with GSM8K’s standardized collection of over 8,000 grade-school level mathematical word problems, which is commonly utilized as a benchmark for evaluating modern LLMs’ sophisticated reasoning faculties. They adopt an innovative method of altering a segment of that testing set by dynamically substituting certain names and numbers with new values—transforming a question about Sophie receiving 31 building blocks for her nephew in GSM8K into a query about Bill getting 19 building blocks for his brother in the new GSM-Symbolic evaluation.
This article originally appeared on Ars Technica, a reputable source for technology news, technological policy analysis, reviews, and more. Ars is owned by Condé Nast, the parent company of WIRED.
This method aids in preventing any possible “data contamination” that could occur from directly incorporating the static GSM8K questions into an AI model’s training dataset. Additionally, these minor adjustments do not impact the actual difficulty of the embedded mathematical reasoning, which means that models should, in theory, perform comparably when evaluated on GSM-Symbolic as they do on GSM8K.
However, upon testing over 20 top-notch LLMs on GSM-Symbolic, researchers noticed a decline in average accuracy across all models compared to GSM8K, with performance dips ranging from 0.3 percent to 9.2 percent, influenced by the specific model. The findings also revealed considerable variability across 50 distinct trials of GSM-Symbolic featuring different names and values. Accuracy differentials of up to 15 percent between the highest and lowest runs were frequently observed within a single model. Interestingly, variations in numerical values resulted in poorer accuracy than variations in naming conventions.
This level of variability—observed both among various GSM-Symbolic trials and when juxtaposed with GSM8K outcomes—is rather intriguing. As the researchers note, “the overall reasoning steps necessary to solve a question remain unchanged.” The occurrence of such slight modifications leading to significantly different results leads the researchers to conclude that these models may not engage in “formal” reasoning but rather are “attempt[ing] to enact a sort of in-distribution pattern-matching, correlating given questions and solution methods with similar instances encountered during training.”
Nevertheless, the overall variability demonstrated in the GSM-Symbolic assessments tended to be relatively minimal in a broader context. For example, OpenAI’s ChatGPT-4o experienced a drop from 95.2 percent accuracy on GSM8K to an impressive 94.9 percent on GSM-Symbolic. This represents a notably high success rate utilizing either standard, irrespective of whether the model is employing “formal” reasoning processes behind the curtain (although the overall accuracy for many models plummeted dramatically when the researchers introduced just one or two additional logical steps to the problems).
The recent examination of large language models (LLMs) revealed a significant drop in their performance when the Apple researchers altered the GSM-Symbolic benchmark by including “seemingly relevant but ultimately inconsequential statements” in the questions. This revised benchmark, known as “GSM-NoOp” (short for “no operation”), illustrates that a straightforward inquiry about how many kiwis someone picks over several days could be tweaked to include details like “five of them [the kiwis] were a bit smaller than average.”
The addition of these distractions led to what the researchers described as “catastrophic performance drops” in accuracy relative to GSM8K, with decreases ranging from 17.5 percent to a staggering 65.7 percent, contingent on the model being evaluated. Such substantial declines in accuracy underscore the limitations associated with relying on basic “pattern matching” to “convert statements to operations without truly grasping their meaning,” according to the researchers.
In the scenario involving the smaller kiwis, most models would attempt to subtract the smaller fruits from the total count. The researchers believe this response stems from the models’ training datasets including analogous examples that required subtraction operations. This persistent “critical flaw” indicates “deeper issues in [the models’] reasoning processes” that cannot be rectified through fine-tuning or other adjustments.
The findings presented in this new GSM-Symbolic paper are not entirely unprecedented within the domain of AI research. Other recent papers have similarly indicated that LLMs do not truly engage in formal reasoning; rather, they replicate it through probabilistic pattern-matching based on the nearest similar data encountered in their extensive training sets.
Despite advancements, recent research reveals how delicate this type of mimicry can be when faced with prompts that diverge from its training data. It underscores the fundamental challenges of performing complex reasoning without a foundational understanding of logic or the real world. As noted by Ars’ Benj Edwards in a July article regarding AI video generation:
OpenAI’s GPT-4 gained attention in text synthesis because it achieved a scale that allowed it to absorb enough training data to create the illusion of true understanding and modeling of the world. In truth, its success stems from its ability to “know” more than many humans and creatively combine these established concepts. With sufficient data and computational power, the AI industry may eventually reach a state resembling what one might call “the illusion of understanding” in AI video synthesis.
We seem to be encountering a similar “illusion of understanding” with the latest “reasoning” models in AI, revealing how this illusion can fall apart when faced with unforeseen circumstances.
AI authority Gary Marcus argues in his examination of the recent GSM-Symbolic paper that a significant advancement in AI will only emerge when neural networks can incorporate genuine “symbol manipulation,” where knowledge is represented abstractly through variables and operations similar to algebra and traditional programming. Until that milestone is reached, we will continue to witness fragile “reasoning” that can lead AI systems to malfunction in mathematical contexts in ways calculators do not.
This article first published on Ars Technica.