New Study Questions Apple’s Take on AI Reasoning

A Spanish research team from the CSIC-UPM Center for Automation and Robotics has taken another look at Apple’s once-controversial paper, ‘The Illusion of Thinking’. Originally published in June 2025, Apple argued that even the best large reasoning models (LRMs) stumble on tasks requiring basic symbolic planning. The new study confirms some of Apple’s criticisms while offering fresh insights: it’s not just a lack of cognitive ability causing these errors. Instead, task design, prompt structure, and the models’ inherent stochastic optimisation methods also play key roles.

In one set of experiments using the Towers of Hanoi puzzle, the researchers observed that model performance quickly dropped off when handling eight disks or more—a result that echoes Apple’s own findings. They discovered the number of tokens used by a model often signalled its perceived ability to solve a task: more tokens suggested higher confidence. Even when two language models teamed up, they ended up trapped in cycles of valid yet irrelevant moves, underscoring a gap in strategic planning despite rule adherence.

The study also reexamined Apple’s river crossing benchmark, revealing that many of the original test cases were unsolvable. When the team focused on configurations that made sense, the models performed consistently well, even with larger instances. Interestingly, the trickiest tests weren’t always the largest ones but those of moderate difficulty, where a small window of valid solutions demanded precise planning.

Rather than conceding that these large reasoning models lack generalisable reasoning, the Spanish researchers describe them as “stochastic, RL-tuned searchers” that draw on known patterns instead of genuine planning. They argue that the way these models manage token usage reflects a built-in sense of a task’s solvability, ramping up resource use when a solution seems within reach.

If you’ve ever wrestled with the challenge of optimising prompt designs in your own projects, you might find these insights particularly useful. This study adds nuance to our understanding of LRMs, suggesting that with the right tweaks in task setup and search methods, these models might be far more capable than we’ve given them credit for.