Skip to main content

Google’s new AI generates audio soundtracks from pixels

An AI generated wolf howling
Google Deep Mind

Deep Mind showed off the latest results from its generative AI video-to-audio research on Tuesday. It’s a novel system that combines what it sees on-screen with the user’s written prompt to create synced audio soundscapes for a given video clip.

The V2A AI can be paired with vide -generation models like Veo, Deep Mind’s generative audio team wrote in a blog post, and can create soundtracks, sound effects, and even dialogue for the on-screen action. What’s more, Deep Mind claims that its new system can generate “an unlimited number of soundtracks for any video input” by tuning the model with positive and negative prompts that encourage or discourage the use of a particular sound, respectively.

V2A Cars

The system works by first encoding and compressing the video input, which the diffusion model then leverages to iteratively refine the desired audio effects from background noise based on the user’s optional text prompt and from the visual input. This audio output is finally decoded and exported as a waveform that can then be recombined with the video input.

Recommended Videos

The best part is that the user doesn’t have to go in and manually (read: tediously) sync the audio and video tracks, as the V2A system does it automatically. “By training on video, audio and the additional annotations, our technology learns to associate specific audio events with various visual scenes, while responding to the information provided in the annotations or transcripts,” the Deep Mind team wrote.

V2A Wolf

The system is not yet perfected, however. For one, the output audio quality is dependent on the fidelity of the video input and the system gets tripped up when video artifacts or other distortions are present in the input. According to the Deep Mind team, syncing dialogue to the audio track remains an ongoing challenge.

V2A Claymation family

“V2A attempts to generate speech from the input transcripts and synchronize it with characters’ lip movements,” the team explained. “But the paired vide- generation model may not be conditioned on transcripts. This creates a mismatch, often resulting in uncanny lip-syncing, as the video model doesn’t generate mouth movements that match the transcript.”

The system still needs to undergo “rigorous safety assessments and testing” before the team will consider releasing it to the public. Every video and soundtrack generated by this system will be affixed with Deep Mind’s SynthID watermarks. This system is far from the only audio-generating AI currently on the market. Stability AI dropped a similar product just last week while ElevenLabs released their sound effects tool last month.

Andrew Tarantola
Andrew Tarantola is a journalist with more than a decade reporting on emerging technologies ranging from robotics and machine…
Google’s AI detection tool is now available for anyone to try
Gemini running on the Google Pixel 9 Pro Fold.

Google announced via a post on X (formerly Twitter) on Wednesday that SynthID is now available to anybody who wants to try it. The authentication system for AI-generated content embeds imperceptible watermarks into generated images, video, and text, enabling users to verify whether a piece of content was made by humans or machines.

“We’re open-sourcing our SynthID Text watermarking tool,” the company wrote. “Available freely to developers and businesses, it will help them identify their AI-generated content.”

Read more
Radiohead’s Thom Yorke among thousands of artists who issue AI protest
Thom Yorke on stage.

Leading actors, authors, musicians, and novelists are among 11,500 artists to have put their name to a statement calling for a halt to the unlicensed use of creative works to train generative AI tools like OpenAI’s ChatGPT, describing it as a “threat” to the livelihoods of creators.

The open letter, comprising just 29 words, says: “The unlicensed use of creative works for training generative AI is a major, unjust threat to the livelihoods of the people behind those works, and must not be permitted.”

Read more
The best AI chatbots to try: ChatGPT, Gemini, and more
Bing Chat shown on a laptop.

The idea of chatbots has been around since the early days of the internet. But even compared to popular voice assistants like Siri, the generated chatbots of the modern era are far more powerful.

Yes, you can converse with them in natural language. But these AI chatbots can generate text of all kinds, from poetry to code, and the results really are exciting. ChatGPT remains in the spotlight, but as interest continues to grow, more rivals are popping up to challenge it.
OpenAI ChatGPT and ChatGPT Plus

Read more