首頁  >  文章  >  科技週邊  >  殺瘋了!谷歌捲視訊到語音,逼真音效讓AI視訊告別!

殺瘋了!谷歌捲視訊到語音,逼真音效讓AI視訊告別!

PHPz
PHPz原創
2024-06-19 09:36:24313瀏覽
The great situation that the AI ​​​​circle is blooming everywhere has surprised the melon-eating public.

# In the past few days, the other side of the ocean has been crazy!

#Luma’s excitement hasn’t passed yet, last night Runway released a king bomb-Gen-3 Alpha. (For details, please go to: Runway version Sora released: high fidelity, super consistency, Gen-3 Alpha shocked netizens)

What was even more unexpected was that when I woke up, Google DeepMind also had new news, quietly announcing the progress of video-to-speech (V2A) technology.
殺瘋了!谷歌捲視訊到語音,逼真音效讓AI視訊告別!
Although this feature is not yet open to the public, judging from the official video demo, the effect is quite smooth. At the same time, Google DeepMind emphasized that all examples were jointly created by V2A technology and their most advanced generative video model Veo.

Audio prompt: An exciting horror movie soundtrack, footsteps echoing on the concrete. (Cinematic, thriller, horror film, music, tension, ambience, footsteps on concrete) 殺瘋了!谷歌捲視訊到語音,逼真音效讓AI視訊告別!
In the dark abandoned warehouse, a man in black walks slowly like a ghost. There is weird music and footsteps, and the atmosphere is full of terror.

Audio prompt: The wolf howls in the moonlight. (Wolf howling at the moon) 殺瘋了!谷歌捲視訊到語音,逼真音效讓AI視訊告別!
As soon as the video demo came out, Qing Yishui in the comment area asked: When will it be available?
殺瘋了!谷歌捲視訊到語音,逼真音效讓AI視訊告別!
殺瘋了!谷歌捲視訊到語音,逼真音效讓AI視訊告別!
Some netizens hope that the open source community will become a cyber bodhisattva and copy Google’s technology.
殺瘋了!谷歌捲視訊到語音,逼真音效讓AI視訊告別!
In fact, not long after Google DeepMind was officially announced, ElevenLabs, the "leader" in the field of AI audio, stepped in and open sourced a project for automatic dubbing of uploaded videos. , which can generate suitable sound effects for videos.
殺瘋了!谷歌捲視訊到語音,逼真音效讓AI視訊告別!
Link:
https://elevenlabs.io/docs/api-reference/how-to-use-text-to-sound-effects

Nowadays, the competition in the AI ​​​​circle has become fierce. The pursuit of each other by large and small manufacturers will create a more fair competitive environment. Once these technologies mature, the AI ​​​​video field will There are endless possibilities.
殺瘋了!谷歌捲視訊到語音,逼真音效讓AI視訊告別!
殺瘋了!谷歌捲視訊到語音,逼真音效讓AI視訊告別!
##AI Video Farewell to Silent Movies

It is known that video generation models are developing at an astonishing rate. However, whether it is Sora, which shocked the world at the beginning of the year, or the recent Keling, Luma, and Gen-3 Alpha, they are all "silent movies" without exception.

And Google DeepMind’s video-to-audio (V2A) technology makes synchronous audio-visual generation possible. It can combine video pixels and natural language text cues to generate rich voiceovers for on-screen action.

In terms of technical application, V2A technology can be combined with video generation models such as Veo to create dialogues with dramatic soundtracks, realistic sound effects, or matching video characters and styles. lens.

It can also generate audio tracks for archival materials, silent films and other traditional images, broadening creative possibilities.

Audio prompt: Cute baby dinosaurs chirp in the jungle, accompanied by the sound of cracking eggshells. (Cute baby dinosaur chirps, jungle ambience, egg cracking)殺瘋了!谷歌捲視訊到語音,逼真音效讓AI視訊告別!Audio prompts: The sound of cars skidding and engines roaring, accompanied by angelic electronic music. (cars skidding, car engine throttling, angelic electronic music) 殺瘋了!谷歌捲視訊到語音,逼真音效讓AI視訊告別!Audio prompt: At sunset, the melodious harmonica sounds on the grassland. (a slow mellow harmonica plays as the sun goes down on the prairie) 殺瘋了!谷歌捲視訊到語音,逼真音效讓AI視訊告別!
V2A technology is capable of generating an unlimited number of audio tracks for any video input. Users can choose to define "positive cues" to guide the generation of desired sounds, or "negative cues" to avoid undesired sounds.

This flexibility gives users more control over audio output, allowing them to quickly try different audio outputs and choose the best match.

Audio prompt: A spaceship is speeding in the vast space, stars are passing around it, flying at high speed, full of science fiction feeling. (A spaceship hurtles through the vastness of space, stars streaking past it, high speed, Sci-fi) 殺瘋了!谷歌捲視訊到語音,逼真音效讓AI視訊告別!Audio prompt: Ethereal cello atmosphere 殺瘋了!谷歌捲視訊到語音,逼真音效讓AI視訊告別! Audio prompt: A spaceship shuttles through the vast space at high speed, with stars passing quickly around it, giving it a sci-fi feel. (A spaceship hurtles through the vastness of space, stars streaking past it, high speed, Sci-fi) 殺瘋了!谷歌捲視訊到語音,逼真音效讓AI視訊告別!
The working principle behind it

The research team tried autoregressive and diffusion methods to discover the most scalable AI architecture. Diffusion methods give the most realistic and engaging results on audio generation for synchronizing video and audio information.

#The V2A system first encodes the video input into a compressed representation, then a diffusion model iteratively refines the audio from random noise. This process is guided by visual input and given natural language cues, producing synchronized, realistic audio that is tightly aligned with the cues. Finally, the audio output is decoded into an audio waveform and combined with the video data.
殺瘋了!谷歌捲視訊到語音,逼真音效讓AI視訊告別!
To generate higher quality audio and guide the model to generate specific sounds, the research team added more information during the training process, including AI-generated annotations that describe the sounds in detail and dialogue text.

Through training on video, audio and additional annotations, the technology learns to associate specific audio events with various visual scenes, while responding to annotations or text provided Information.

Google emphasizes that their technology is different from existing video-to-audio solutions because it understands raw pixels and adding text hints is optional. In addition, the system does not require manual alignment of generated sound and video, greatly simplifying the creative process.殺瘋了!谷歌捲視訊到語音,逼真音效讓AI視訊告別!
不過,Google的這項技術也並非完美,他們仍在努力解決一些 bug。例如,視訊輸入的品質直接影響音訊輸出的質量,視訊中的偽影或失真可能導致音訊品質下降。

同時,他們也在優化唇形同步功能。

V2A 技術嘗試從輸入文字中產生語音,並將其與角色的口型動作進行同步,但若視訊模型未針對文字內容進行對應的調整,就可能導致口型與語音不同步。他們正在改進這項技術,以提升唇形同步的自然度。 殺瘋了!谷歌捲視訊到語音,逼真音效讓AI視訊告別!
音訊提示:音樂,文字轉錄「這隻火雞看起來好極了,我好餓。」(Music, Transcript: 「this turkey looks amazing, I 'm so hungry”)

或許是由於深度偽造技術帶來諸多社會問題,Google DeepMind 求生欲滿滿,一個勁承諾將負責任開發和部署AI技術,在向公眾開放之前,V2A 技術將經過嚴格的安全評估和測試。

此外,他們也整合了 SynthID 工具包到 V2A 研究中,為所有 AI 產生的內容添加浮水印,以防止技術的濫用。

參考連結:

https://deepmind.google/ discover/blog/generating-audio-for-video/

#https://x.com/GoogleDeepMind/status/1802733643992850760

############## ##

以上是殺瘋了!谷歌捲視訊到語音,逼真音效讓AI視訊告別!的詳細內容。更多資訊請關注PHP中文網其他相關文章!

陳述:
本文內容由網友自願投稿,版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容,請聯絡admin@php.cn