Google’s DeepMind research lab has developed a video-to-audio (V2A) technology that can generate synchronized audio using video pixels. The tool allows editors to insert text prompts in different languages to guide the audio output. This technology can be used on traditional footage, not just AI videos, to generate an unlimited number of soundtracks for any video input.
The V2A system uses a diffusion-based approach on an autoregressive architecture to encode video input into a compressed representation. Diffusion models then create audio from random noise guided by the video images. The audio output is decoded, converted to an audio waveform, and combined with the video data. Google trained the model with video, audio, and annotations to help understand the link between visual events and audio sounds.
This innovative model allows users to have control over audio output by using text cues to guide the tone or style of the sound. This flexibility allows users to experiment with different audio outputs and choose the best combination for their videos. However, the quality of the audio output depends on the quality of the video input, as video distortion or artifacts can affect the audio quality.
While the V2A technology is a promising tool for video editors, it is currently only available to DeepMind researchers as it undergoes rigorous security evaluations and testing before being released to the general public. The tool has not yet mastered lip-syncing as it cannot be conditioned on transcripts. DeepMind did not disclose how the V2A tool was trained, but Google’s ownership of YouTube gives them a potential advantage in the development and implementation of this technology.
Article Source
https://petapixel.com/2024/06/21/googles-video-to-audio-tool-generates-music-from-pixels/