Home > Article > Technology peripherals > This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry
Recently, AIGC has been on the hot search, and its popularity remains high. Of course, in addition to its extremely famous name, its breakthroughs are also absolutely remarkable: images, videos and even 3D models can be automatically generated by inputting natural language. You say Is it surprising?
But in the field of audio and sound effects, AIGC’s welfare seems to be a little worse. Mainly because high-degree-of-freedom audio generation relies on a large amount of text-audio pair data, and there are many difficulties in long-term waveform modeling. In order to solve the above difficulties, Zhejiang University and Peking University jointly proposed an innovative text-to-audio generation system, namely Make-An-Audio. It can take natural language description as input, and it can be in any modality (such as text, audio, image, video, etc.), and at the same time output audio sound effects that match the description. It is difficult for the majority of netizens to ignore its controllability and generalization. like.
In just two days, the demo video received 45K views on Twitter.
After New Year’s Eve 2023, a large number of audio synthesis articles emerged such as Make-An-Audio and MusicLM. There have been 4 breakthrough developments within 48 hours.
## User comments 1## Many netizens have said that AIGC sound effect synthesis will change the future of film and short video production.
## User comments 2
#Netizen comments 3
More netizens posted Such a sigh: "audio is all you need..."
##Netizen Comments 4
Audio effect display
Without further ado, just look at the effect,Text 1: a speedboat running as wind blows into a microphone
Convert audio 1Audio:
00:0000:09Text 2: fireworks pop and explode
##Convert audio 2Audio:
00:0000:09 Have you ever been troubled by repairing damaged audio? Once the Make-An-Audio model comes out, this becomes much easier. Before repair
## Audio before repair
Audio before repairAudio:00:0000:09
##After repair
Audio after repair
Audio after repairAudio: 00:0000:09##Understand pictures to generate sound effects
, it’s not impossible.
Picture 1
Convert audio
Convert image to audioAudio:
00:0000:09
## Picture 2
Convert Audio
Picture Convert Audio 2Audio:
00:0000:09##According to Video content generates corresponding sound effects, this model can also do it easily.
Video 1
## Convert audio
Video 1Audio:
00:0000:09 Video 2
Convert audio
##Video 2
Audio:00:0000:09
In-depth analysis of the magical connotations of the "Internet Celebrity" model must go back to the objective problem of sparse audio-natural language data. In this regard, Zhejiang University and Peking University jointly launched the Volcano Voice The team collaborated with two major universities to jointly propose the Distill-then-Reprogram text enhancement strategy, which uses the teacher model to obtain the natural language description of the audio, and then obtains dynamic training through random reorganization sample. Specifically, in the Distill link, audio-to-text and audio-text retrieval models are used to find natural language description candidates (Candidate) for language-free audio. By calculating the matching similarity between the candidate text and the audio, the best result is obtained under the threshold as the description of the audio. This method has strong generalization, and real natural language avoids out-of-domain text in the testing phase. "In the Reprogram phase, the team randomly sampled from additional event data sets and combined them with the current training samples to obtain new concept combinations and descriptions to increase the model's robustness to different event combinations," the research team said. Internal Technical Principles of the Model
##Distill-then-Reprogram Text Enhancement Strategy Framework Diagram
As shown in the figure above, self-supervised learning has successfully migrated pictures to audio spectrum, used spectral autoencoders to solve the problem of long audio sequences, and completed self-processing based on the Latent Diffusion generation model. Prediction of supervised representations avoids direct prediction of long-term waveforms.
##Make-An-Audio model system framework diagram
In addition, during the research, the team also explored powerful text condition strategies, including contrastive Language-Audio Pretraining (CLAP) and language model (LLM) T5, BERT, etc., which verified the effectiveness and computational friendliness of CLAP text representation. sex. At the same time, CLAP Score was used for the first time to evaluate the generated audio, which can be used to measure the consistency between text and generated scenes; using a combination of subjective and objective evaluation methods, the effectiveness of the model was verified in the benchmark data set test, demonstrating The model has excellent zero-shot learning (Zero-Shot) generalization, etc.##Make-An-Audio and baseline model subjective and objective evaluation experimental results
How much do you know about the application prospects of the magic model?
Make-An-Audio implements highly controllable X-audio AIGC synthesis for the first time, X can be Text/Audio/Image/Video
For visually guided audio synthesis, Make-An-Audio conditions the CLIP text encoder on its image-text joint space , can directly synthesize audio based on image encoding.
##Make-An-Audio Vision-Audio Synthesis Framework Diagram
It is foreseeable that audio synthesis AIGC will play an important role in future film dubbing, short video creation and other fields, and with the help of models such as Make-An-Audio, it may be possible for everyone to become a professional in the future Sound effects engineers can use text, video, and images to synthesize lifelike audio and sound effects at any time and at any place. However, Make-An-Audio is not perfect at this stage. Perhaps due to the rich data sources and inevitable sample quality issues, side effects will inevitably occur during the training process, such as generating audio that does not conform to the text content. Make-An- Audio is technically positioned as "assisted artist generation", and one thing is for sure, the progress in the AIGC field is indeed surprising. Huoshan Voice has long provided ByteDance’s major business lines with globally advantaged AI voice technology capabilities and full-stack voice product solutions, including audio understanding, audio synthesis, and virtual digits. People, dialogue interaction, music retrieval, intelligent hardware, etc. Since its establishment in 2017, the team has focused on developing industry-leading AI intelligent voice technology and constantly exploring the efficient combination of AI and business scenarios to achieve greater user value. At present, its speech recognition and speech synthesis have covered multiple languages and dialects. Many technical papers have been selected for various top AI conferences, providing leading voice capabilities for Douyin, Jianying, Feishu, Tomato Novels, Pico and other businesses. It is suitable for diverse scenarios such as short videos, live broadcasts, video creation, office and wearable devices, and is open to external companies through the Volcano Engine.
The above is the detailed content of This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry. For more information, please follow other related articles on the PHP Chinese website!