Dark
Light

OpenAI’s New AI Models: Why They’re Hallucinating More

April 24, 2025

OpenAI recently rolled out its latest AI models, o3 and o4-mini, promising cutting-edge capabilities. But here’s the twist: these models are more prone to ‘hallucinations’—producing inaccurate information—than their predecessors. It’s a hiccup that even the most advanced AI systems face today.

In the past, each new AI model from OpenAI has typically shown improvements in reducing such hallucinations. But with o3 and o4-mini, things haven’t quite gone as expected. Internal assessments show that these reasoning models are producing more frequent hallucinations compared to earlier versions like o1, o1-mini, and o3-mini, as well as non-reasoning models such as GPT-4o.

Why this regression? Well, that’s still a bit of a mystery. OpenAI has noted in its technical documentation that further investigation is needed to grasp why these reasoning models are hallucinating more as they scale.

Despite this, o3 and o4-mini shine in specific tasks like coding and mathematics. They make more claims overall, which means they generate both more accurate and inaccurate statements.

OpenAI uses the PersonQA benchmark to measure knowledge accuracy, and it turns out o3 hallucinated answers to 33% of the questions. That’s a big jump from the 16% and 14.8% hallucination rates seen in o1 and o3-mini, respectively. The o4-mini model did even worse, with a 48% hallucination rate.

Independent research by Transluce, a nonprofit AI lab, backs these findings. Evidence suggests that o3 often fabricates actions it supposedly took to reach conclusions. For instance, o3 once claimed to have run code on a 2021 MacBook Pro outside of ChatGPT, which it simply can’t do.

Neil Chowdhury, a former OpenAI employee and now a researcher at Transluce, suggested that the reinforcement learning techniques used in the o-series models might be worsening issues typically alleviated by standard post-training processes.

Sarah Schwettmann, co-founder of Transluce, pointed out that the high hallucination rate could reduce o3’s usefulness. Meanwhile, Kian Katanforoosh, a Stanford adjunct professor and CEO of Workera, observed that o3 often provides broken website links in coding workflows.

While hallucinations might spark creative ideas, they pose challenges in fields where accuracy is crucial. Legal firms, for example, can’t afford models that introduce factual errors into contracts.

One potential fix could be enhancing models with web search capabilities. OpenAI’s GPT-4o, which includes this feature, achieves 90% accuracy on the SimpleQA benchmark. This might help reduce hallucination rates in reasoning models if users are willing to integrate third-party search engines.

The AI field is shifting towards reasoning models, which improve performance across tasks without needing extensive computational resources. However, this shift also brings challenges, as reasoning models tend to hallucinate more.

OpenAI spokesperson Niko Felix emphasized that reducing hallucinations remains a research priority. The quest for solutions continues as the AI industry navigates these complexities.

Don't Miss