Meta’s AI Benchmark Tactics: What’s Really Going On?

Meta’s recent unveiling of its AI model, Maverick, has stirred quite the conversation. If you’re keeping up with AI advancements, you might have heard that Maverick scored second on LM Arena, a competitive test where human evaluators compare AI outputs. But there’s a twist—discrepancies have popped up between the Maverick version tested and the one developers can actually use.

According to some AI researchers on the platform X, Meta revealed that the Maverick tested on LM Arena is actually an ‘experimental chat version.’ This was backed up by a chart on the Llama website, showing that Meta used a ‘Llama 4 Maverick optimized for conversationality’ during testing.

Now, LM Arena isn’t exactly the gold standard for measuring AI performance. Most AI companies avoid customizing their models just to ace these benchmarks—or at least they don’t admit to it. This practice can be a bit of a headache for developers who need to predict how a model will perform in different scenarios.

When a model is tailored for a specific test but isn’t the same as what’s publicly available, it can be misleading. Ideally, benchmarks should show off a model’s skills across a range of tasks. Researchers have noticed that the Maverick on LM Arena behaves quite differently from the version you can download, with the former being more expressive and verbose.

These differences have sparked a lot of online chatter. For instance, Nathan Lambert shared on X, ‘Okay Llama 4 is def a little cooked lol, what is this yap city,’ while Tech Dev Notes commented on the model’s emotive expression.

Both Meta and Chatbot Arena, the entity managing LM Arena, have been approached for further comments on this issue. It’s clear that there’s more to this story, and it’ll be interesting to see how it unfolds.