search
HomeTechnology peripheralsAIIt only takes 3 seconds to steal your voice! Microsoft releases speech synthesis model VALL-E: Netizens exclaimed that the threshold for 'telephone fraud' has been lowered again

Let ChatGPT help you write the script and Stable Diffusion generate illustrations. Do you need a voice actor to make a video? It's coming!

Recently, researchers from Microsoft released a new text-to-speech (TTS) model VALL-E, which only needs to provide three seconds of audio samples to simulate the input of human voices, and Corresponding audio is synthesized based on the input text, and the emotional tone of the speaker can also be maintained.

It only takes 3 seconds to steal your voice! Microsoft releases speech synthesis model VALL-E: Netizens exclaimed that the threshold for telephone fraud has been lowered again

Thesis link: https://www.php.cn/link/402cac3dacf2ef35050ca72743ae6ca7

Project link: https://valle-demo.github. io/

Code link: https://github.com/microsoft/unilm

Let’s take a look at the effect first: Suppose you have a 3-second recording.

diversity_speaker Audio: 00:0000:03

Then just enter the text "Because we do not need it." to get the synthesized voice.

diversity_s1 Audio: 00:0000:01

Even using different random seeds, personalized speech synthesis can be performed.

diversity_s2 Audio: 00:0000:02

VALL-E can also maintain the speaker’s ambient sound, such as inputting this voice.

env_speaker Audio: 00:0000:03

Then according to the text "I think it's like you know um more convenient too.", you can output the synthesized speech while maintaining the ambient sound.

env_vall_eAudio: 00:0000:02

And VALL-E can also maintain the speaker's emotion, such as inputting an angry voice.

anger_ptAudio: 00:0000:03

Based on the text "We have to reduce the number of plastic bags.", you can also express angry emotions.

anger_oursAudio: 00:0000:02

There are many more examples on the project website.

Specifically, the researchers trained the language model VALL-E from discrete encodings extracted from off-the-shelf neural audio codec models, and treated TTS as a conditional language modeling task rather than Continuous signal regression.

In the pre-training stage, the TTS training data received by VALL-E reached 60,000 hours of English speech, which is hundreds of times larger than the data used by the existing system.

And VALL-E also demonstrates in-context learning capabilities. It only needs to use the 3-second registration recording of the unseen speaker as a sound prompt to synthesize high-quality personalized speech.

Experimental results show that VALL-E is significantly better than the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity, and can also preserve the speaker's emotion and the acoustics of the sound cues in the synthesis environment.

Zero-shot Speech Synthesis

Over the past decade, speech synthesis has made huge breakthroughs through the development of neural networks and end-to-end modeling.

But current cascaded text-to-speech (TTS) systems usually utilize a pipeline with an acoustic model and a vocoder that uses mel spectrograms as intermediate representations.

Although some high-performance TTS systems can synthesize high-quality speech from single or multiple speakers, it still requires high-quality clean data from the recording studio, which cannot be achieved with large-scale data scraped from the Internet. meet the data requirements and will lead to model performance degradation.

Due to the relatively small amount of training data, the current TTS system still has the problem of poor generalization ability.

Under the zero-shot task setting, for speakers who have not appeared in the training data, the similarity and naturalness of speech will drop sharply.

In order to solve the zero-shot TTS problem, existing work usually utilizes methods such as speaker adaption and speaker encoding, which require additional fine-tuning and complex pre-designed features. , or heavy structural work.

Rather than designing a complex and specialized network for this problem, given their success in text synthesis, the researchers believe the ultimate solution should be to train the model with as much diverse data as possible.

VALL-E model

In the field of text synthesis, large-scale unlabeled data from the Internet is directly fed into the model. As the amount of training data increases, the model performance is also constantly improving.

Researchers migrated this idea to the field of speech synthesis. The VALL-E model is the first TTS framework based on language models, utilizing massive, diverse, and multi-speaker speech data.

It only takes 3 seconds to steal your voice! Microsoft releases speech synthesis model VALL-E: Netizens exclaimed that the threshold for telephone fraud has been lowered again

In order to synthesize personalized speech, the VALL-E model generates corresponding acoustic tokens based on the acoustic tokens and phoneme prompts of the 3-second enrolled recording. This information can limit the speaker. and content information.

It only takes 3 seconds to steal your voice! Microsoft releases speech synthesis model VALL-E: Netizens exclaimed that the threshold for telephone fraud has been lowered again

Finally, the generated acoustic token is used to synthesize the final waveform with the corresponding neural codec.

The discrete acoustic tokens from the audio codec model enable TTS to be regarded as conditional codec language modeling, so some advanced hint-based large model techniques (such as GPTs) can be used in TTS tasks On.

Acoustic tokens can also use different sampling strategies during the inference process to produce diverse synthesis results in TTS.

The researchers trained VALL-E using the LibriLight dataset, which consists of 60,000 hours of English speech with more than 7,000 unique speakers. The raw data is audio-only, so only a speech recognition model is used to generate the transcripts.

Compared with previous TTS training datasets, such as LibriTTS, the new dataset provided in the paper contains more noisy speech and inaccurate transcriptions, but provides different speakers and registers (prosodies ).

The researchers believe that the method proposed in the article is robust to noise and can utilize big data to achieve good generality.

It only takes 3 seconds to steal your voice! Microsoft releases speech synthesis model VALL-E: Netizens exclaimed that the threshold for telephone fraud has been lowered again

It is worth noting that existing TTS systems are always trained with dozens of hours of monolingual speaker data or hundreds of hours of multilingual speaker data. More than hundreds of times smaller than VALL-E.

In short, VALL-E is a brand-new language model method for TTS, which uses audio encoding and decoding codes as intermediate representations and uses a large amount of different data to give the model powerful contextual learning capabilities.

Reasoning: In-Context Learning via Prompting

Context learning (in-context learning) is an amazing ability of text-based language models, which can predict unseen Input labels without requiring additional parameter updates.

For TTS, if the model can synthesize high-quality speech for unseen speakers without fine-tuning, then the model is considered to have contextual learning capabilities.

However, existing TTS systems do not have strong in-context learning capabilities because they either require additional fine-tuning or suffer from significant degradation to unseen speakers.

For language models, prompting is necessary to achieve context learning in zero-shot situations.

The prompts and reasoning designed by the researchers are as follows:

First convert the text into a phoneme sequence, and encode the enrolled recording into an acoustic matrix to form a phoneme prompt and an acoustic prompt, both of which Used in AR and NAR models.

For AR models, use sampling-based decoding conditional on hints, because beam search may cause LM to enter an infinite loop; in addition, sampling-based methods can greatly increase the diversity of outputs.

For the NAR model, use greedy decoding to select the token with the highest probability.

Finally, a neural codec is used to generate waveforms conditioned on the eight encoding sequences.

Acoustic cues may not necessarily have a semantic relationship with the speech to be synthesized, so they can be divided into two cases:

VALL-E: The main goal is for unseen speakers Generate the given content.

The input of this model is a text sentence, a piece of enrolled speech and its corresponding transcription. Add the transcribed phonemes of the enrolled speech as phoneme cues to the phoneme sequence of the given sentence, and use the first-level acoustic token of the registered speech as the acoustic prefix. With phoneme cues and acoustic prefixes, VALL-E generates an acoustic token for a given text, cloning the speaker's voice.

VALL-E-continual: Uses the entire transcript and the first 3 seconds of the utterance as phonemic and acoustic cues respectively, and asks the model to generate continuous content.

The reasoning process is the same as setting VALL-E, except that the enrolled speech and the generated speech are semantically continuous.

Experimental Section

The researchers evaluated VALL-E on the LibriSpeech and VCTK datasets, where all tested speakers did not appear in the training corpus.

VALL-E significantly outperforms state-of-the-art zero-shot TTS systems in terms of speech naturalness and speaker similarity, with a 0.12 Comparative Average Option Score (CMOS) and a 0.93 Similarity Average on LibriSpeech Option Score (SMOS).

It only takes 3 seconds to steal your voice! Microsoft releases speech synthesis model VALL-E: Netizens exclaimed that the threshold for telephone fraud has been lowered again

VALL-E also surpasses the baseline system with performance improvements of 0.11 SMOS and 0.23 CMOS on VCTK, even reaching a 0.04CMOS score against ground truth, indicating that on VCTK on, synthetic speech from unseen speakers is as natural as human recordings.

It only takes 3 seconds to steal your voice! Microsoft releases speech synthesis model VALL-E: Netizens exclaimed that the threshold for telephone fraud has been lowered again

Furthermore, qualitative analysis shows that VALL-E is able to synthesize different outputs with 2 identical texts and target speakers, which may be beneficial for pseudo-data in speech recognition tasks create.

It can also be found in the experiment that VALL-E can maintain the sound environment (such as reverberation) and the emotion prompted by the sound (such as anger, etc.).

Security hazard

If powerful technology is misused, it may cause harm to society. For example, the threshold for phone fraud has been lowered again!

Due to VALL-E’s potential for mischief and deception, Microsoft has not opened VALL-E’s code or interfaces for testing.

Some netizens shared: If you call the system administrator, record a few words they say "Hello", and then re-synthesize the voice based on these words "Hello, I am the system administrator." "My voice is a unique identifier and can be safely verified." I always thought this was impossible. You couldn't accomplish this task with so little data. Now it seems that I may be wrong...

In the final Ethics Statement of the project, the researchers stated that "the experiments in this article were based on the model user as the target speaker and obtained performed under the assumption of speaker consent. However, when the model generalizes to unseen speakers, the relevant parts should be accompanied by speech editing models, including protocols to ensure that speakers agree to perform modifications and systems to detect edited speech.”

It only takes 3 seconds to steal your voice! Microsoft releases speech synthesis model VALL-E: Netizens exclaimed that the threshold for telephone fraud has been lowered again

The author also states in the paper that since VALL-E can synthesize speech that maintains the identity of the speaker, it may bring potential risks of misuse of the model, Such as spoofing voice recognition or imitating a specific speaker.

To reduce this risk, a detection model can be built to distinguish whether an audio clip is synthesized by VALL-E. As we further develop these models, we will also put Microsoft AI principles into practice.

Reference materials:

​https://www.php.cn/link/402cac3dacf2ef35050ca72743ae6ca7​

The above is the detailed content of It only takes 3 seconds to steal your voice! Microsoft releases speech synthesis model VALL-E: Netizens exclaimed that the threshold for 'telephone fraud' has been lowered again. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
How to Build Your Personal AI Assistant with Huggingface SmolLMHow to Build Your Personal AI Assistant with Huggingface SmolLMApr 18, 2025 am 11:52 AM

Harness the Power of On-Device AI: Building a Personal Chatbot CLI In the recent past, the concept of a personal AI assistant seemed like science fiction. Imagine Alex, a tech enthusiast, dreaming of a smart, local AI companion—one that doesn't rely

AI For Mental Health Gets Attentively Analyzed Via Exciting New Initiative At Stanford UniversityAI For Mental Health Gets Attentively Analyzed Via Exciting New Initiative At Stanford UniversityApr 18, 2025 am 11:49 AM

Their inaugural launch of AI4MH took place on April 15, 2025, and luminary Dr. Tom Insel, M.D., famed psychiatrist and neuroscientist, served as the kick-off speaker. Dr. Insel is renowned for his outstanding work in mental health research and techno

The 2025 WNBA Draft Class Enters A League Growing And Fighting Online HarassmentThe 2025 WNBA Draft Class Enters A League Growing And Fighting Online HarassmentApr 18, 2025 am 11:44 AM

"We want to ensure that the WNBA remains a space where everyone, players, fans and corporate partners, feel safe, valued and empowered," Engelbert stated, addressing what has become one of women's sports' most damaging challenges. The anno

Comprehensive Guide to Python Built-in Data Structures - Analytics VidhyaComprehensive Guide to Python Built-in Data Structures - Analytics VidhyaApr 18, 2025 am 11:43 AM

Introduction Python excels as a programming language, particularly in data science and generative AI. Efficient data manipulation (storage, management, and access) is crucial when dealing with large datasets. We've previously covered numbers and st

First Impressions From OpenAI's New Models Compared To AlternativesFirst Impressions From OpenAI's New Models Compared To AlternativesApr 18, 2025 am 11:41 AM

Before diving in, an important caveat: AI performance is non-deterministic and highly use-case specific. In simpler terms, Your Mileage May Vary. Don't take this (or any other) article as the final word—instead, test these models on your own scenario

AI Portfolio | How to Build a Portfolio for an AI Career?AI Portfolio | How to Build a Portfolio for an AI Career?Apr 18, 2025 am 11:40 AM

Building a Standout AI/ML Portfolio: A Guide for Beginners and Professionals Creating a compelling portfolio is crucial for securing roles in artificial intelligence (AI) and machine learning (ML). This guide provides advice for building a portfolio

What Agentic AI Could Mean For Security OperationsWhat Agentic AI Could Mean For Security OperationsApr 18, 2025 am 11:36 AM

The result? Burnout, inefficiency, and a widening gap between detection and action. None of this should come as a shock to anyone who works in cybersecurity. The promise of agentic AI has emerged as a potential turning point, though. This new class

Google Versus OpenAI: The AI Fight For StudentsGoogle Versus OpenAI: The AI Fight For StudentsApr 18, 2025 am 11:31 AM

Immediate Impact versus Long-Term Partnership? Two weeks ago OpenAI stepped forward with a powerful short-term offer, granting U.S. and Canadian college students free access to ChatGPT Plus through the end of May 2025. This tool includes GPT‑4o, an a

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
Will R.E.P.O. Have Crossplay?
1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor