search
HomeTechnology peripheralsAIThis audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

Recently, AIGC has been on the hot search, and its popularity remains high. Of course, in addition to its extremely famous name, its breakthroughs are also absolutely remarkable: images, videos and even 3D models can be automatically generated by inputting natural language. You say Is it surprising?

But in the field of audio and sound effects, AIGC’s welfare seems to be a little worse. Mainly because high-degree-of-freedom audio generation relies on a large amount of text-audio pair data, and there are many difficulties in long-term waveform modeling. In order to solve the above difficulties, Zhejiang University and Peking University jointly proposed an innovative text-to-audio generation system, namely Make-An-Audio. It can take natural language description as input, and it can be in any modality (such as text, audio, image, video, etc.), and at the same time output audio sound effects that match the description. It is difficult for the majority of netizens to ignore its controllability and generalization. like.

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry


  • ##Paper link: https://arxiv .org/abs/2301.12661
  • Project link: https://text-to-audio.github.io

In just two days, the demo video received 45K views on Twitter.

After New Year’s Eve 2023, a large number of audio synthesis articles emerged such as Make-An-Audio and MusicLM. There have been 4 breakthrough developments within 48 hours.

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

## User comments 1## Many netizens have said that AIGC sound effect synthesis will change the future of film and short video production.

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

## User comments 2

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

#Netizen comments 3

More netizens posted Such a sigh: "audio is all you need..."

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry##Netizen Comments 4

Audio effect display

Without further ado, just look at the effect,

Generate sound effects based on text

It turns out that it can be so convenient and smooth.

Text 1: a speedboat running as wind blows into a microphone

Convert audio 1Audio:

00:0000:09Text 2: fireworks pop and explode

##Convert audio 2Audio:

00:0000:09

Have you ever been troubled by repairing damaged audio? Once the Make-An-Audio model comes out, this becomes much easier.

Before repair

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

## Audio before repair

Audio before repairAudio:00:0000:09

##After repair

Audio after repair

Audio after repairAudio: 00:0000:09##​Understand pictures to generate sound effects

, it’s not impossible.

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industryPicture 1

Convert audio

Convert image to audioAudio:

00:0000:09

## Picture 2This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

Convert Audio

Picture Convert Audio 2Audio:

00:0000:09##According to Video content generates corresponding sound effects, this model can also do it easily.

Video 1

## Convert audio

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

Video 1Audio:

00:0000:09 Video 2

Convert audio

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry##Video 2

Audio:

00:0000:09

Internal Technical Principles of the Model

In-depth analysis of the magical connotations of the "Internet Celebrity" model must go back to the objective problem of sparse audio-natural language data. In this regard, Zhejiang University and Peking University jointly launched the Volcano Voice The team collaborated with two major universities to jointly propose the Distill-then-Reprogram text enhancement strategy, which uses the teacher model to obtain the natural language description of the audio, and then obtains dynamic training through random reorganization sample.

Specifically, in the Distill link, audio-to-text and audio-text retrieval models are used to find natural language description candidates (Candidate) for language-free audio. By calculating the matching similarity between the candidate text and the audio, the best result is obtained under the threshold as the description of the audio. This method has strong generalization, and real natural language avoids out-of-domain text in the testing phase. "In the Reprogram phase, the team randomly sampled from additional event data sets and combined them with the current training samples to obtain new concept combinations and descriptions to increase the model's robustness to different event combinations," the research team said.

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

##Distill-then-Reprogram Text Enhancement Strategy Framework Diagram

As shown in the figure above, self-supervised learning has successfully migrated pictures to audio spectrum, used spectral autoencoders to solve the problem of long audio sequences, and completed self-processing based on the Latent Diffusion generation model. Prediction of supervised representations avoids direct prediction of long-term waveforms.

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

##Make-An-Audio model system framework diagram

In addition, during the research, the team also explored powerful text condition strategies, including contrastive Language-Audio Pretraining (CLAP) and language model (LLM) T5, BERT, etc., which verified the effectiveness and computational friendliness of CLAP text representation. sex. At the same time, CLAP Score was used for the first time to evaluate the generated audio, which can be used to measure the consistency between text and generated scenes; using a combination of subjective and objective evaluation methods, the effectiveness of the model was verified in the benchmark data set test, demonstrating The model has excellent zero-shot learning (Zero-Shot) generalization, etc.

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

##Make-An-Audio and baseline model subjective and objective evaluation experimental results

How much do you know about the application prospects of the magic model?

Overall, the Make-An-Audio model achieves high-quality, highly controllable audio synthesis, and proposes "No Modality Left Behind" to fine-tune the text conditional audio model ( finetune), which can unlock audio synthesis (audio/image/video) for any modal input.

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

Make-An-Audio implements highly controllable X-audio AIGC synthesis for the first time, X can be Text/Audio/Image/Video

For visually guided audio synthesis, Make-An-Audio conditions the CLIP text encoder on its image-text joint space , can directly synthesize audio based on image encoding.

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

##Make-An-Audio Vision-Audio Synthesis Framework Diagram

It is foreseeable that audio synthesis AIGC will play an important role in future film dubbing, short video creation and other fields, and with the help of models such as Make-An-Audio, it may be possible for everyone to become a professional in the future Sound effects engineers can use text, video, and images to synthesize lifelike audio and sound effects at any time and at any place. However, Make-An-Audio is not perfect at this stage. Perhaps due to the rich data sources and inevitable sample quality issues, side effects will inevitably occur during the training process, such as generating audio that does not conform to the text content. Make-An- Audio is technically positioned as "assisted artist generation", and one thing is for sure, the progress in the AIGC field is indeed surprising.

Huoshan Voice has long provided ByteDance’s major business lines with globally advantaged AI voice technology capabilities and full-stack voice product solutions, including audio understanding, audio synthesis, and virtual digits. People, dialogue interaction, music retrieval, intelligent hardware, etc. Since its establishment in 2017, the team has focused on developing industry-leading AI intelligent voice technology and constantly exploring the efficient combination of AI and business scenarios to achieve greater user value. At present, its speech recognition and speech synthesis have covered multiple languages ​​​​and dialects. Many technical papers have been selected for various top AI conferences, providing leading voice capabilities for Douyin, Jianying, Feishu, Tomato Novels, Pico and other businesses. It is suitable for diverse scenarios such as short videos, live broadcasts, video creation, office and wearable devices, and is open to external companies through the Volcano Engine.

The above is the detailed content of This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
[Ghibli-style images with AI] Introducing how to create free images with ChatGPT and copyright[Ghibli-style images with AI] Introducing how to create free images with ChatGPT and copyrightMay 13, 2025 am 01:57 AM

The latest model GPT-4o released by OpenAI not only can generate text, but also has image generation functions, which has attracted widespread attention. The most eye-catching feature is the generation of "Ghibli-style illustrations". Simply upload the photo to ChatGPT and give simple instructions to generate a dreamy image like a work in Studio Ghibli. This article will explain in detail the actual operation process, the effect experience, as well as the errors and copyright issues that need to be paid attention to. For details of the latest model "o3" released by OpenAI, please click here⬇️ Detailed explanation of OpenAI o3 (ChatGPT o3): Features, pricing system and o4-mini introduction Please click here for the English version of Ghibli-style article⬇️ Create Ji with ChatGPT

Explaining examples of use and implementation of ChatGPT in local governments! Also introduces banned local governmentsExplaining examples of use and implementation of ChatGPT in local governments! Also introduces banned local governmentsMay 13, 2025 am 01:53 AM

As a new communication method, the use and introduction of ChatGPT in local governments is attracting attention. While this trend is progressing in a wide range of areas, some local governments have declined to use ChatGPT. In this article, we will introduce examples of ChatGPT implementation in local governments. We will explore how we are achieving quality and efficiency improvements in local government services through a variety of reform examples, including supporting document creation and dialogue with citizens. Not only local government officials who aim to reduce staff workload and improve convenience for citizens, but also all interested in advanced use cases.

What is the Fukatsu-style prompt in ChatGPT? A thorough explanation with example sentences!What is the Fukatsu-style prompt in ChatGPT? A thorough explanation with example sentences!May 13, 2025 am 01:52 AM

Have you heard of a framework called the "Fukatsu Prompt System"? Language models such as ChatGPT are extremely excellent, but appropriate prompts are essential to maximize their potential. Fukatsu prompts are one of the most popular prompt techniques designed to improve output accuracy. This article explains the principles and characteristics of Fukatsu-style prompts, including specific usage methods and examples. Furthermore, we have introduced other well-known prompt templates and useful techniques for prompt design, so based on these, we will introduce C.

What is ChatGPT Search? Explains the main functions, usage, and fee structure!What is ChatGPT Search? Explains the main functions, usage, and fee structure!May 13, 2025 am 01:51 AM

ChatGPT Search: Get the latest information efficiently with an innovative AI search engine! In this article, we will thoroughly explain the new ChatGPT feature "ChatGPT Search," provided by OpenAI. Let's take a closer look at the features, usage, and how this tool can help you improve your information collection efficiency with reliable answers based on real-time web information and intuitive ease of use. ChatGPT Search provides a conversational interactive search experience that answers user questions in a comfortable, hidden environment that hides advertisements

An easy-to-understand explanation of how to create a composition in ChatGPT and prompts!An easy-to-understand explanation of how to create a composition in ChatGPT and prompts!May 13, 2025 am 01:50 AM

In a modern society with information explosion, it is not easy to create compelling articles. How to use creativity to write articles that attract readers within a limited time and energy requires superb skills and rich experience. At this time, as a revolutionary writing aid, ChatGPT attracted much attention. ChatGPT uses huge data to train language generation models to generate natural, smooth and refined articles. This article will introduce how to effectively use ChatGPT and efficiently create high-quality articles. We will gradually explain the writing process of using ChatGPT, and combine specific cases to elaborate on its advantages and disadvantages, applicable scenarios, and safe use precautions. ChatGPT will be a writer to overcome various obstacles,

How to create diagrams using ChatGPT! Illustrated loading and plugins are also explainedHow to create diagrams using ChatGPT! Illustrated loading and plugins are also explainedMay 13, 2025 am 01:49 AM

An efficient guide to creating charts using AI Visual materials are essential to effectively conveying information, but creating it takes a lot of time and effort. However, the chart creation process is changing dramatically due to the rise of AI technologies such as ChatGPT and DALL-E 3. This article provides detailed explanations on efficient and attractive diagram creation methods using these cutting-edge tools. It covers everything from ideas to completion, and includes a wealth of information useful for creating diagrams, from specific steps, tips, plugins and APIs that can be used, and how to use the image generation AI "DALL-E 3."

An easy-to-understand explanation of ChatGPT Plus' pricing structure and payment methods!An easy-to-understand explanation of ChatGPT Plus' pricing structure and payment methods!May 13, 2025 am 01:48 AM

Unlock ChatGPT Plus: Fees, Payment Methods and Upgrade Guide ChatGPT, a world-renowned generative AI, has been widely used in daily life and business fields. Although ChatGPT is basically free, the paid version of ChatGPT Plus provides a variety of value-added services, such as plug-ins, image recognition, etc., which significantly improves work efficiency. This article will explain in detail the charging standards, payment methods and upgrade processes of ChatGPT Plus. For details of OpenAI's latest image generation technology "GPT-4o image generation" please click: Detailed explanation of GPT-4o image generation: usage methods, prompt word examples, commercial applications and differences from other AIs Table of contents ChatGPT Plus Fees Ch

Explaining how to create a design using ChatGPT! We also introduce examples of use and promptsExplaining how to create a design using ChatGPT! We also introduce examples of use and promptsMay 13, 2025 am 01:47 AM

How to use ChatGPT to streamline your design work and increase creativity This article will explain in detail how to create a design using ChatGPT. We will introduce examples of using ChatGPT in various design fields, such as ideas, text generation, and web design. We will also introduce points that will help you improve the efficiency and quality of a variety of creative work, such as graphic design, illustration, and logo design. Please take a look at how AI can greatly expand your design possibilities. table of contents ChatGPT: A powerful tool for design creation

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

PhpStorm Mac version

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),