Advancing Vision-Language Models with 3D Spatial Reasoning Capabilities

Vision-language models (VLMs) are transforming how machines combine visual and textual information. Researchers at the Italian Institute of Technology (IIT) and the University of Aberdeen are now advancing these models by sharpening their spatial reasoning—a key step for robots to navigate and interact naturally with their surroundings.

Working under the FAIR* project, a joint effort between IIT’s Social Cognition in Human-Robot Interaction group and Aberdeen’s Action Prediction Lab, the team is addressing the challenge of recognising subtle nonverbal cues like gaze and gestures. This work could be especially reassuring if you’ve ever found it hard to get a machine to pick up on human body language. One of their core objectives is to boost visual perspective taking (VPT), allowing robots to interpret a scene from another’s viewpoint. Imagine a robot assessing whether text remains legible from a different angle, or if an object is partially hidden—this is the kind of practical understanding they’re aiming for.

To push the boundaries of current models, the researchers are combining large language models with synthetic scenes generated via NVIDIA’s Omniverse Replicator. In these controlled simulations, a simple cube is captured from multiple angles, with each view paired with descriptive text and mathematical matrices that detail the spatial relationships. Joel Currie, one of the project’s researchers, points out that this method allows for the rapid creation of diverse data pairs, setting the stage for more dynamic learning. Instead of relying solely on static geometry, this approach encourages models to learn spatial nuances from both visual and textual cues.

The ultimate aim is to foster what’s known as embodied cognition—where robots go beyond mere perception to start imagining alternative perspectives. This not only enhances their on-screen understanding but also could contribute to more natural human-robot interactions. Future work will focus on refining the realism of these simulated scenes, with the goal of transferring these insights to practical, real-world applications.