In a recent benchmark, Google’s Gemini 2.5 Pro has proven its mettle on the Fiction.Live test, demonstrating an impressive capacity to understand and reproduce intricate narratives. The test challenges models with tasks more demanding than standard search functions—imagine finding a needle in a haystack of text.
The evaluation showed that OpenAI’s o3 model stays in line with Gemini 2.5 Pro up to around 128,000 tokens (roughly 96,000 words). However, when the context size jumps to about 192,000 tokens (around 144,000 words), the o3 model starts to falter. In contrast, Gemini 2.5 Pro, even in its June preview (preview-06-05), maintains over 90% accuracy at these extended lengths. Although these tests are well below Google’s claim of handling one million tokens, they do highlight how models balance scale and precision. For context, OpenAI’s o3 caps at a 200,000-token window, while Meta’s Llama 4 Maverick, despite touting a capacity of up to ten million tokens, struggles with complex, long-form content.
Nikolay Savinov from Google DeepMind puts it simply: a larger context window isn’t a magic fix. If you’ve ever found huge documents overwhelming, you can relate—trying to process too much can mean missing out on the key details. Savinov advises trimming out the irrelevant bits so that the model can focus on what truly matters. This selectivity not only improves accuracy but also makes processing more efficient.