Recent benchmarks are shedding new light on how our AI models perform in nuanced reasoning tasks. In the ARC-AGI-2 evaluation—a test that values genuine problem-solving over rote recall—Grok 4 managed roughly 16% accuracy, outstripping GPT-5’s 9.9%, though at a steeper cost of about $2 to $4 per task compared with GPT-5’s $0.73.
For the slightly simpler ARC-AGI-1 test, Grok 4 maintained a modest lead with 68% accuracy compared to GPT-5’s 65.7%. That said, GPT-5 offers a more budget-friendly option at $0.51 per task versus Grok 4’s roughly $1 per task—a balance that might shift if pricing changes from xAI are implemented.
The landscape also includes more economical variants. GPT-5 Mini scored 54.3% on AGI-1 at just $0.12 per task and 4.4% on AGI-2 at $0.20 per task, while the leaner GPT-5 Nano posted 16.5% on AGI-1 and 2.5% on AGI-2, each for only $0.03 per task.
Meanwhile, early tests of the interactive ARC-AGI-3 benchmark are underway. This evaluation puts models in a trial-and-error simulation, tackling tasks that, though straightforward for people, remain a challenge for many AI systems, especially when it comes to visual puzzles.
Grok 4’s impressive numbers highlight its strengths in these specific tests, assuming that xAI followed the established protocols. It’s worth noting that during OpenAI’s latest GPT-5 showcase, the ARC Prize was noticeably absent—a curious omission given its longstanding role in model evaluations.
On another front, the o3-preview model—launched in December 2024—continues to lead with nearly 80% accuracy on the ARC-AGI-1 test, albeit with higher operational costs. Although OpenAI has not elaborated on significant cost-reduction tweaks for subsequent chat-focused versions, ARC Prize has confirmed that the publicly released o3 model in April now shows diminished performance.