Researchers at Arizona State University have taken a fresh look at large language models (LLMs) and found that the impressive logical reasoning they’re known for might be more of an illusion. If you’ve ever relied on AI for complex problem-solving, this study might make you pause and think twice.
At the heart of the study is the chain-of-thought (CoT) prompting method. Designed to help AI break down tougher problems into manageable steps, CoT works well only when the test data mirrors what the model has seen before. When the input changes—even slightly—the technique seems to lose its grip.
To explore these limits, the researchers created a controlled environment called DataAlchemy. They trained a model from scratch and put it to work on tasks like cyclic letter transformations—simple exercises that, surprisingly, exposed the model’s struggles with new variations. Even minor distractions, such as extra noise tokens, could derail its reasoning.
The experiment revealed that as inputs varied in length or required additional reasoning steps, the model’s performance took a noticeable hit. Instead of true logical processing, the model often fell back on repeating familiar patterns, producing what the team labelled as “fluent nonsense.” This means the AI could generate convincing yet flawed answers, a caveat that’s especially important for critical applications.
One striking example was a Google Gemini model that initially argued 1776 was a leap year—drawing on simple divisibility rules—only to contradict itself moments later. This inconsistency adds to a growing chorus of research, including studies from Apple, Tsinghua University, and New York University, all suggesting that LLMs mimic reasoning without actually understanding it.
These insights remind us that while LLMs can be incredibly useful, they aren’t infallible. It’s essential to recognise their current limitations and invest in more robust, efficient approaches to true AI reasoning. By doing so, we can better harness AI’s power in areas where accuracy and genuine understanding are key.