Hey there! If you’re interested in how machines understand color, you’ll be excited to hear about COLORBENCH, a new tool from the University of Maryland. It’s designed to see how well vision-language models (VLMs) can interpret and process colors. While these models have come a long way, researchers found that even the biggest ones still struggle with color perception.
COLORBENCH puts these models through their paces in three main areas: color perception, color reasoning, and how they handle changes in color. It includes 11 tasks with 1,448 test cases and 5,814 image-text prompts. The challenges range from identifying colors and estimating their proportions, to counting colored objects and dealing with tricky color illusions.
Interestingly, larger models tend to do better than smaller ones, but not by as much as you might expect from other benchmarks. The study looked at 32 popular VLMs, including GPT-4o and Gemini 2, along with various open-source models. Surprisingly, their performance in color tasks was pretty weak, often scoring below 30% in tests like color counting or color blindness. Even in tasks needing precise color extraction, their performance was just okay.
However, these models did better with object or color recognition tasks, likely because of the training data they were exposed to. Sometimes, relying on color cues led to mistakes. For example, models performed better with greyscale images in tasks involving illusions or camouflaged objects. But for some tasks, color was essential.
Adding chain-of-thought (CoT) reasoning made a noticeable difference, improving performance and adaptability to color changes. For instance, GPT-4o’s scores jumped from 46.2% to 69.9% in robustness tests. The researchers also noticed that performance was more closely linked to the size of the language model than the vision encoder, which usually has only 300-400 million parameters, suggesting a need for better visual processing components.
COLORBENCH is publicly available to help develop more nuanced and resilient VLMs. Future versions will include tasks that mix color with texture, shape, and spatial relations. It’s an exciting step forward in making machines that see the world a bit more like we do.