Pfizer researchers are questioning Apple’s recent study, ‘The Illusion of Thinking’, which claimed that large reasoning models (LRMs) hit a hard limit when facing complex tasks. Instead of suggesting that these models are naturally capped, Pfizer points to the restrictive, text-only testing environment as the real culprit. Without access to programming tools, the models struggle with multi-step puzzles even before reaching their theoretical limits.
In one striking experiment, the o4-mini model misjudged a river crossing puzzle as unsolvable due to memory constraints—a case of ‘learned helplessness’ where longer sequences spiral into compounded errors. Pfizer argues that Apple’s study overlooked these cumulative mistakes, mistakenly attributing them to an intrinsic failure in the AI’s reasoning process.
To explore this further, Pfizer re-tested GPT-4o and o4-mini while allowing Python tool access. The results were revealing. GPT-4o continued with a flawed strategy, but o4-mini quickly adapted its approach and solved the puzzle, illustrating a clear difference in how the models operate. This comparison echoes familiar concepts in cognitive science, where GPT-4o resembles a fast, error-prone ‘System 1’, while o4-mini aligns with the more deliberate and self-correcting ‘System 2’ thinker.
If you’ve ever grappled with technology that seems to hit a wall at the worst time, you’ll appreciate Pfizer’s call for refined testing. They advocate for benchmarks that include both tool-assisted and tool-free scenarios, along with a closer look at metacognitive abilities—an essential aspect of dependable AI.
By challenging Apple’s conclusions, Pfizer invites us to rethink how we assess AI reasoning capabilities. Maybe the true potential of these models isn’t lost in their programming, but is waiting to be unlocked with better testing conditions.