Dark
Light

Meta’s Llama 4 Models Shine in Tests but Face Challenges with Long Contexts

April 8, 2025

Meta’s newest Llama 4 models, Maverick and Scout, have been making waves in the standard testing arena. They perform remarkably well, but when it comes to handling longer, more complex contexts, they hit a few bumps. Let’s dive into what this means and why it matters.

According to the “Intelligence Index” by Artificial Analysis, Maverick scores a solid 49 points. This puts it ahead of Claude 3.7 Sonnet, but it’s still a bit behind Deepseek’s V3 0324. Meanwhile, Scout holds its own against GPT-4o-mini, even surpassing Claude 3.5 Sonnet and Mistral Small 3.1. It’s consistent across reasoning, coding, and math, with no glaring weaknesses.

Deepseek V3 leads the pack with 53 points, but Maverick is hot on its heels. What’s impressive about Maverick is its efficiency—it uses half the active parameters of Deepseek V3 but can handle both text and images, something its competitor can’t do. Plus, Maverick and Scout are easier on the wallet, offering token pricing that’s significantly lower than even some budget-friendly models, and up to ten times less than OpenAI’s GPT-4o.

However, there are some discrepancies when it comes to Meta’s promoted LMArena benchmark. Meta admitted to using an “experimental chat version” of Maverick, which was optimized for human evaluators with structured responses. When LMArena’s “Style Control” is activated to separate content quality from style, Llama 4’s ranking drops.

Fiction.live’s tests, which focus on complex narrative comprehension, further highlight Llama 4’s struggles with long-context tasks. Despite Meta’s claims, Maverick shows only slight improvements over Llama 3.3 70B, and Scout doesn’t fare well either. While Gemini 2.5 Pro boasts a 90.6% accuracy with 120,000 tokens, Maverick and Scout manage just 28.1% and 15.6%, respectively.

These results raise questions about Meta’s claims regarding extensive text contexts. Even though Scout is marketed for handling up to 10 million tokens, it struggles with just 128,000, as does Maverick. Research is beginning to suggest that larger context windows might not be as beneficial as previously thought, with smaller contexts often proving more effective.

For those of us working with AI models, these insights are crucial. They remind us that while new models can offer exciting capabilities, they also come with limitations that we need to understand and work around. As always, staying informed and adaptable is key to making the most of these technologies.

 

Don't Miss