Dark
Light

How Fewer Documents Can Boost AI Performance in RAG Systems

April 3, 2025

Hey there! If you’ve ever wondered how the number of documents affects the performance of AI models in Retrieval Augmented Generation (RAG) systems, you’re in for some interesting insights. A recent study by researchers at the Hebrew University of Jerusalem has shed light on this very topic. Let’s dive into what they found out.

Using the MuSiQue validation dataset, the researchers explored 2,417 questions, each linked to 20 Wikipedia paragraphs. Interestingly, only a few of these paragraphs contained relevant information. By gradually reducing the number of documents from 20 to just those that were pertinent, and expanding the remaining text to keep the length consistent, they noticed something remarkable. Most models, including open-source ones like Llama-3.1 and Gemma 2, showed up to a 10% boost in performance with fewer documents. However, Qwen2 was an exception, handling larger collections just fine.

This study highlights a crucial point: irrelevant documents often retrieved in RAG systems can confuse models. It suggests that finding the right balance in data retrieval strategies is key. The researchers recommend that future models should develop ways to discard contradictory information while maintaining document diversity.

While the study didn’t explore variations in prompts, it opens up exciting possibilities for further research. The dataset is publicly available for anyone interested in diving deeper. As context windows in AI models expand, the necessity of RAG systems is being debated. Yet, they continue to offer great benefits, especially for smaller, open-source models.

 

Don't Miss