search
HomeTechnology peripheralsAIOnly 1/4 the amount of data is used to restore 100% details of real-life voices, using the latest supernatural dialogue speech synthesis technology on Volcano Voice!

Counting the stars and hoping for the moon, thousands of Jay fans have been waiting for 6 years. Not long ago, Jay Chou finally released a new album! Once it went online, it sparked discussions across the Internet.

While everyone was immersed in the beautiful memories of those lush years, the friend who posted the viral audio said: This conversation was actually speech synthesis!

When it comes to "speech synthesis", the following may appear in your mind:

• ​ ​ Navigation has a rich variety but mechanical tone of "turn left at the intersection ahead"

•                                                                                                                                                                                                       ‐ ‐‐ Credit Card Center "

信 信, ten commentary videos have the same sounds in the same, and when you see it, you want to quickly draw away. "Little Handsome"...

Now it has directly subverted many people's stereotypes. Speech synthesis technology can already achieve the same perfect and natural effect as the audio above. The publisher of this audio -

Volcano Voice, ByteDance AI Lab Speech & Audio Intelligent Speech and Audio Team, and through two pieces of audio, we can better decipher the technical highlights to the public.

The text entered in these sentences is exactly the same, that is, "Southern cuisine prefers dipping sauces. For example, it was my first time in Shanghai that I learned that vegetables in barbecue also need to be served with dipping sauces." But The synthesized audio effect is obviously different, that is, the second audio is derived from the new supernatural dialogue speech synthesis technology launched by the Volcano Voice Team this time.

Recall the state of people's daily expressions. The brain needs thinking time to process information. When it comes to language, people will involuntarily hesitate, pronunciation, inversion, or even change their words mid-sentence, stutter and repeat. They will also deliberately emphasize pronunciation to emphasize the key information they want to express. This brings about a large number of subtle expressions that are difficult to observe. These phenomena are difficult to capture and restore in traditional TTS. The perfect reproduction of these subtleties is the source of the mystery that makes it difficult to distinguish the authenticity of the sound, and is also the mystery of the above-mentioned audio.

Specifically,

The latest supernatural dialogue speech synthesis technology released by the Volcano Voice Team is more realistic and natural than traditional TTS, that is, modal particles Details such as inhalation sounds, pauses during hesitation, and pronunciation of pronunciation are all perfectly reproduced. And only 1/4 of the data of the conventional sound library can be used to perfectly restore the subtle rhythmic characteristics and pronunciation habits of real people, allowing you to Compositing effects are more realistic. Professional evaluation results show that there is basically no difference between this new technology of Huoshan Voice and real-person recordings, and it is difficult for reviewers to distinguish it. In addition, this technology has been put into use in many scenarios such as video dubbing and telephone customer service. It will be launched on the official website of Volcano Engine Voice Technology in the near future.

#How on earth is such a powerful technology achieved?

According to reports, the above-mentioned manifestations such as gasping, swallowing, involuntary prolongation of word pronunciation when thinking, and low laughter that often occur in actual communication have been It is called paralinguistic phenomenon (paralanguage). Although this is the most realistic manifestation of the human brain's thinking and expression process, because the traditional speech synthesis technology framework cannot effectively model sparsely distributed paralinguistic phenomena, so in The restoration of rhythm when speaking is limited and too "correct".

Based on the above difficulties, the Volcano Voice supernatural speech synthesis technology makes breakthroughs from two levels:

text and speech modeling. Specifically, :

# •

On the text level, the volcanic voice uses ingredient style migration model , and the way of simulating people can speak text. Controlled colloquial transliteration allows the text to better embrace colloquialism and avoid the final effect being too written.

• ​ ​

At the speech level, the team has made a breakthrough in the text analysis model and added an additional paralanguage prediction to the input side of TTS. , imitating the pronunciation characteristics of real people to achieve natural and spontaneous speech effects.

It is worth mentioning that the team effectively improved the stability and expressiveness of the model by using the TTS modeling solution with unsupervised features, using only 1/4 of the data scale of conventional sound libraries. You can achieve very natural and changeable rhythmic effects, isn’t it great?

Only 1/4 the amount of data is used to restore 100% details of real-life voices, using the latest supernatural dialogue speech synthesis technology on Volcano Voice!

## Committed to colloquial text, making "real-person expression" vivid on the page

Text is the input of speech synthesis technology. Whether its style is close to the expression of real people is the first step to improve the synthesis effect. However, due to deep-rooted writing habits, most pre-synthesis texts are not natural enough. , or it requires a lot of effort and constant adjustment, which is time-consuming and labor-intensive. In order to solve such problems, the Huoshan Voice team adopted a two-stage solution and achieved good results:

##• ​ ​

Phase One: Adoption The self-supervised method uses pseudo data to pre-train the spoken language model, which reduces the amount of data required; at the same time, a pointer network structure is introduced into the model to enhance text controllability.

• ​ ​

Phase 2: Use a small amount of high-quality manually labeled data to fine-tune the pre-trained spoken language model, and finally achieve controllable and natural spoken language text effects.


#Original text



##Text after automated prediction


## Southern cuisine prefers dipping sauces, such as mine It was my first time in Shanghai that I learned that vegetables in barbecue also need to be served with dipping sauce




Well,

for southern cuisine, I prefer to use dipping sauce or something, For exampleMy first timeuh, my first time went to Shanghai, and I realized that the vegetables in the barbecue must also be accompanied by dipping sauce





, the northerners said I brought half a cart of cabbage


##Well this It’s almost like


when we go to the street to buy cabbage###, ### the southerner said I want half a cabbage, ### and then the ### northerner said I want half a cabbage## ##################################

In fact, southern cuisine places more emphasis on the taste of seasonings, that is, the chef uses seasonings to display his skills



Yes, in fact, southern cuisine pays more attention to the taste of its seasonings. In other words, the chef uses seasonings to display his skills


The rhythmic diversity of paralanguage modeling is remarkable and the voice realism has been fully upgraded

In order to better restore real people, it is different from traditional In terms of speech synthesis technology, Huoshan Speech has also conducted in-depth research on paralanguage modeling and prosodic diversity respectively. In terms of paralanguage modeling, the synthesis technology introduced by the team enables the acoustic model to model a variety of paralinguistic phenomena such as inhalation, laughter, hesitation, and correction that appear in natural expressions, and combines it with text Semantic information is automatically inserted into paralinguistic phenomena . Consider rationality and randomness at the same time during the insertion process, making the performance more natural and real.

## Like our morning basically##“In the exploration of prosody diversification, we combined unsupervised representation learning technology and independently developed a highly expressive acoustic model framework. Through pronunciation, rhythm, and timbre decoupling, we not only It reduces the demand for data volume and achieves efficient modeling of extremely low-frequency pronunciation phenomena. At the same time, it uses unsupervised representation features and combines phoneme-level fundamental frequency and energy information to achieve natural changes in prosody and promote high-quality dialogue. Speech generation,” concluded the Volcano Voice team.


text



supernatural



I think so

Inhale>It’s actually very good for the body.



#Audio

C.wav


## Look at our current work, in the morning

extended >Basically I don’t eat much breakfast.


#Audio

D.wav



is stuck >#are soy milk and fried dough sticks buns.


##AudioE.wav



##He must be

Slip correction##>, I really want to eat meat.



ParalangTest_is_000008_npy_01_new2

# Copy of ##.wav


Huoshan Voice, ByteDance AI Lab Speech&Audio intelligent voice and audio team, has long been serving Douyin, Jianying, Tomato Novels, and Feishu Other businesses provide leading AI voice technology capabilities and full-stack voice product solutions, and open technical services to external enterprises through the Volcano Engine.

Only 1/4 the amount of data is used to restore 100% details of real-life voices, using the latest supernatural dialogue speech synthesis technology on Volcano Voice!

The above is the detailed content of Only 1/4 the amount of data is used to restore 100% details of real-life voices, using the latest supernatural dialogue speech synthesis technology on Volcano Voice!. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
What is Model Context Protocol (MCP)?What is Model Context Protocol (MCP)?Mar 03, 2025 pm 07:09 PM

The Model Context Protocol (MCP): A Universal Connector for AI and Data We're all familiar with AI's role in daily coding. Replit, GitHub Copilot, Black Box AI, and Cursor IDE are just a few examples of how AI streamlines our workflows. But imagine

Building a Local Vision Agent using OmniParser V2 and OmniToolBuilding a Local Vision Agent using OmniParser V2 and OmniToolMar 03, 2025 pm 07:08 PM

Microsoft's OmniParser V2 and OmniTool: Revolutionizing GUI Automation with AI Imagine AI that not only understands but also interacts with your Windows 11 interface like a seasoned professional. Microsoft's OmniParser V2 and OmniTool make this a re

I Tried Vibe Coding with Cursor AI and It's Amazing!I Tried Vibe Coding with Cursor AI and It's Amazing!Mar 20, 2025 pm 03:34 PM

Vibe coding is reshaping the world of software development by letting us create applications using natural language instead of endless lines of code. Inspired by visionaries like Andrej Karpathy, this innovative approach lets dev

Replit Agent: A Guide With Practical ExamplesReplit Agent: A Guide With Practical ExamplesMar 04, 2025 am 10:52 AM

Revolutionizing App Development: A Deep Dive into Replit Agent Tired of wrestling with complex development environments and obscure configuration files? Replit Agent aims to simplify the process of transforming ideas into functional apps. This AI-p

Runway Act-One Guide: I Filmed Myself to Test ItRunway Act-One Guide: I Filmed Myself to Test ItMar 03, 2025 am 09:42 AM

This blog post shares my experience testing Runway ML's new Act-One animation tool, covering both its web interface and Python API. While promising, my results were less impressive than expected. Want to explore Generative AI? Learn to use LLMs in P

Top 5 GenAI Launches of February 2025: GPT-4.5, Grok-3 & More!Top 5 GenAI Launches of February 2025: GPT-4.5, Grok-3 & More!Mar 22, 2025 am 10:58 AM

February 2025 has been yet another game-changing month for generative AI, bringing us some of the most anticipated model upgrades and groundbreaking new features. From xAI’s Grok 3 and Anthropic’s Claude 3.7 Sonnet, to OpenAI’s G

How to Use YOLO v12 for Object Detection?How to Use YOLO v12 for Object Detection?Mar 22, 2025 am 11:07 AM

YOLO (You Only Look Once) has been a leading real-time object detection framework, with each iteration improving upon the previous versions. The latest version YOLO v12 introduces advancements that significantly enhance accuracy

Elon Musk & Sam Altman Clash over $500 Billion Stargate ProjectElon Musk & Sam Altman Clash over $500 Billion Stargate ProjectMar 08, 2025 am 11:15 AM

The $500 billion Stargate AI project, backed by tech giants like OpenAI, SoftBank, Oracle, and Nvidia, and supported by the U.S. government, aims to solidify American AI leadership. This ambitious undertaking promises a future shaped by AI advanceme

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

PhpStorm Mac version

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools