In a recent episode of the podcast Possible, hosted by LinkedIn’s co-founder Reid Hoffman, Demis Hassabis, the CEO of Google DeepMind, shared some exciting news about the future of AI at Google. If you’re curious about where AI is heading, you’ll want to hear this. Google is planning to merge its Gemini AI models with the Veo video generation systems. This move is all about making Gemini even better at understanding the real world around us.
Hassabis explained, “We’ve always built Gemini, our foundation model, to be multimodal from the beginning.” This means they’re aiming for an AI that’s not just good at one thing but can handle multiple types of media. Imagine a universal digital assistant that’s truly helpful in everyday life. That’s the vision they’re working towards.
The AI industry is on a fascinating path, moving towards creating models that can seamlessly synthesize different media types. Google’s latest Gemini models are already capable of generating audio, images, and text. Meanwhile, OpenAI’s ChatGPT can create images, including those with the unique charm of Studio Ghibli. And let’s not forget Amazon, which is gearing up to launch its own “any-to-any” model later this year.
Creating these sophisticated models requires tons of training data—think images, videos, audio, and text. Hassabis mentioned that Veo’s video data mainly comes from YouTube, which makes sense since Google owns the platform. “By watching YouTube videos—a lot of them—[Veo 2] can figure out the physics of the world,” he noted. It’s a smart way to teach AI about the complexities of real-world dynamics.
Google has previously mentioned to TechCrunch that their AI models might use some YouTube content for training, always respecting agreements with YouTube creators. In 2024, they even expanded their terms of service to allow broader data access for AI training. This is all part of their strategy to build more capable and versatile AI models.