Google’s Bold Move: Combining Gemini and Veo AI Models

In a recent chat on the podcast Possible, hosted by LinkedIn co-founder Reid Hoffman, DeepMind’s CEO Demis Hassabis shared some exciting news. Google is planning to bring together its Gemini AI models with the Veo video-generating models. The big idea here? To boost the AI’s understanding of the physical world around us.

Hassabis mentioned, “We’ve always built Gemini, our foundation model, to be multimodal from the beginning,” which means they’ve been aiming for a digital assistant that can really help out in everyday situations.

The AI world is buzzing with a trend toward ‘omni’ models—platforms that can handle all sorts of media. Google’s latest version of Gemini is already skilled at creating audio, images, and text. Meanwhile, OpenAI’s ChatGPT is making waves with its ability to craft images, even in unique styles like Studio Ghibli. Amazon’s also gearing up to launch an ‘any-to-any’ model later this year.

Creating these all-encompassing models is no small feat. It requires a mountain of training data, from images and videos to audio and text. Hassabis pointed out that Veo’s video training data mainly comes from YouTube, which is a Google subsidiary. “Basically, by watching YouTube videos—a lot of YouTube videos—[Veo 2] can figure out, you know, the physics of the world,” he said.

Google has previously told TechCrunch that its models could be trained on certain YouTube content, following agreements with YouTube creators. In line with this, Google expanded its terms of service last year, partly to tap into more data for AI training.

It’s fascinating to see how these developments might shape the future of AI, and it’s clear Google is at the forefront of this transformation.