AI Models Struggle with Unexpected Twists in YouTube Fail Videos, Research Finds

A recent study by experts at the University of British Columbia, the Vector Institute for AI, and Nanyang Technological University shows that even the most advanced AI models can get tripped up by unexpected events. The research team tested over 1,600 YouTube fail videos from the Oops! dataset, which led to the creation of a benchmark known as BlackSwanSuite.

The study compared top models—including systems like GPT-4o and Gemini 1.5 Pro—with human performance. While people quickly adjust their views when something unforeseen happens, AI models tend to stick with their first impressions. In one case, an AI interpreted a scene as someone swinging a pillow, only to see later that the pillow had accidentally knocked Christmas ornaments onto a nearby woman. Even after watching the whole video, the AI clung to its initial guess.

The videos span a diverse range of scenarios—from traffic mishaps and children’s mistakes to unexpected incidents at swimming pools. Each clip was divided into three parts: setup, surprise, and aftermath. At each stage, AI models faced tasks like predicting what happens next, explaining events that weren’t immediately clear, and updating their earlier assumptions. Both commercial models such as GPT-4o and open-source efforts like LLaVA-Video and VideoLLaMA 2 were put to the test.

For example, GPT-4o answered 65% of detective tasks correctly, while humans achieved 90% accuracy. When it came to revising predictions after seeing the complete video, GPT-4o’s accuracy dropped to 60%, compared to 92% for human observers. This stubbornness in holding onto initial predictions was also seen in Gemini 1.5 Pro.

The underlying issue lies in the training methods. AI models learn from patterns across millions of videos, so when they encounter events that don’t fit familiar patterns—like a garbage truck unexpectedly dropping a tree—they often misinterpret what’s happening. In one trial, swapping AI video perception for human-written scene descriptions lifted LLaVA-Video’s performance by around 10%, highlighting how much these systems depend on human insight.

While these findings might not affect your casual viewing, they raise important considerations for real-world applications like autonomous vehicles. In such fields, safety hinges on the ability to respond accurately to truly unexpected scenarios. The research team has made BlackSwanSuite available on platforms like GitHub, inviting further work to strengthen AI resilience in the face of unpredictability.