Back in 2019, A.I. researcher François Chollet came up with a puzzle game called ARC. It’s meant to be a breeze for humans but a real brain-twister for machines. This game has turned into a vital tool for experts who want to check on how artificial intelligence is coming along and to push back against the idea that A.I. is about to outsmart us.
Chollet’s puzzles are about spotting visual patterns from just a few examples and then transforming grids of colored squares based on those patterns. At first, these puzzles were tough nuts for A.I. systems like ChatGPT, which usually learn from massive amounts of data. But then, in December, OpenAI revealed that their latest model, OpenAI o3, had actually beaten human performance on Chollet’s test. Unlike older models, o3 could juggle multiple possibilities before giving an answer.
This got some folks thinking that A.I. might be closing in on artificial general intelligence (A.G.I.), where machines reach the level of human smarts. However, Chollet’s puzzles were designed to show that A.I. still has a long way to go before hitting that milestone.
The success of OpenAI’s o3 highlighted some flaws in benchmarks like ARC, which have been used for decades to gauge A.I. progress. Once these benchmarks are met, they often show themselves to be poor indicators of true intelligence. Arvind Narayanan, a professor at Princeton, pointed out that claiming ARC measures progress toward A.G.I. is a bit of a stretch, though he did admit OpenAI’s system was pretty impressive.
The original ARC test was a tough challenge for A.I., but OpenAI’s tech still needed a lot of resources to crack the puzzles. Last June, Chollet and Mike Knoop from Zapier launched the ARC Prize, offering a cool $1 million to anyone who could create an A.I. system that beats human performance on this benchmark, now called “ARC-AGI.” Despite over 1,400 submissions, no one snagged the prize since all entries fell short of what we’d call a ‘smart’ human.
OpenAI’s o3 system managed a score of 87.5%, but it got disqualified due to high computing costs and not meeting open-source requirements. Another version of o3 hit 75.7% efficiency and cost less than $10,000.
“Intelligence is efficiency,” Chollet said, indicating that current models are far from being as efficient as humans. In response, the ARC Prize rolled out a new benchmark, ARC-AGI-2, which features more complex tasks.
This new series is expected to be a challenge for both humans and A.I., with OpenAI’s o3 unlikely to solve these puzzles. Even with improvements, A.I. systems still stumble over new problems, lacking the intuitive knack for learning that humans have. The ARC Prize has now become a nonprofit foundation, steering efforts toward A.G.I.
The team expects ARC-AGI-2 to remain unsolved for about two years, but they’re already working on ARC-AGI-3, which is set to come out in 2026. This version might involve dynamic, grid-based games, bringing us closer to real-world challenges where movement is constant, unlike static puzzles.
As A.I. moves forward, the benchmarks will keep evolving. “If it’s no longer possible to create benchmarks that are easy for humans but impossible for A.I.,” Chollet remarked, “then you have A.G.I.”