Salesforce’s launch of the CRMArena-Pro benchmark makes it clear that today’s AI agents still face significant hurdles in real-world business contexts. The test was designed to throw a spotlight on these challenges, showing that even refined systems have room for improvement.
For example, even advanced models like Gemini 2.5 Pro manage a 58% success rate on simple, single-turn tasks—but when dialogues stretch into multiple turns, that figure drops to just 35%. If you’ve ever found it tricky to get an AI to consistently understand your needs over a longer conversation, you’re not alone.
CRMArena-Pro was crafted to evaluate large language models (LLMs) in practical scenarios, covering key areas such as sales, customer service, and pricing under the customer relationship management (CRM) umbrella. By building on the original benchmark with more complex business functions, extended dialogues, and privacy tests, it provides a clearer picture of where these models stand.
Within a simulated Salesforce environment, the team put the models through 4,280 task instances across 19 business activities and three data protection categories. The results underline a common issue: while models can handle simple tasks relatively well, their performance in extended conversations often falls short, leaving important details overlooked.
Another key insight was the models’ reluctance to ask follow-up questions. In a closer look at 20 failed multi-turn tasks using Gemini 2.5 Pro, nearly half of the failures were due to the lack of information-gathering queries. For anyone grappling with AI miscommunications, this shows that prompting for clarity can make a tangible difference.
Despite these challenges, Gemini 2.5 Pro generally outperformed other models in both B2B and B2C settings, particularly shining in workflow automation tasks like routing customer service cases—achieving an 83% success rate. That said, tasks demanding strict text interpretation or rule-following, like spotting product configuration errors or extracting detailed call logs, saw a noticeable drop in accuracy.
An earlier study in collaboration with Microsoft confirmed similar trends. Even the most sophisticated LLMs tend to falter as conversations grow longer, with an average performance drop of around 39%. These insights remind us all that AI, as promising as it is, still has a way to go in managing the evolution of user needs over time.
Data privacy is another area where AI models struggle. By default, these models usually don’t flag or deny sensitive requests unless specifically prompted. Introducing privacy guidelines has improved detection—GPT-4o, for example, went from a 0% to a 34.2% detection rate—but this often comes at the cost of slightly reduced task performance. Open-source options like LLaMA-3.1 also show limited adaptability when prompt nuances shift, signalling a need for further training improvements.
By including systematic data protection tests, Salesforce is taking a needed step in evaluating how AI handles one of the most critical issues in today’s digital world. Overall, while models like Gemini 2.5 Pro show real potential, they’re also a reminder that AI still faces significant obstacles in delivering consistent, reliable performance in complex, multi-turn business dialogues.