Only 1/4 the amount of data is used to restore 100% details of real-life voices, using the latest supernatural dialogue speech synthesis technology on Volcano Voice!-AI-php.cn

Only 1/4 the amount of data is used to restore 100% details of real-life voices, using the latest supernatural dialogue speech synthesis technology on Volcano Voice!

PHPz

Apr 08, 2023 pm 03:21 PM

Volcano VoiceSpeech modeling

Counting the stars and hoping for the moon, thousands of Jay fans have been waiting for 6 years. Not long ago, Jay Chou finally released a new album! Once it went online, it sparked discussions across the Internet.

While everyone was immersed in the beautiful memories of those lush years, the friend who posted the viral audio said: This conversation was actually speech synthesis!

When it comes to "speech synthesis", the following may appear in your mind:

• Navigation has a rich variety but mechanical tone of "turn left at the intersection ahead"

• ‐ ‐‐ Credit Card Center "

•

信信, ten commentary videos have the same sounds in the same, and when you see it, you want to quickly draw away. "Little Handsome"...

Now it has directly subverted many people's stereotypes. Speech synthesis technology can already achieve the same perfect and natural effect as the audio above. The publisher of this audio -

Volcano Voice, ByteDance AI Lab Speech & Audio Intelligent Speech and Audio Team, and through two pieces of audio, we can better decipher the technical highlights to the public.

The text entered in these sentences is exactly the same, that is, "Southern cuisine prefers dipping sauces. For example, it was my first time in Shanghai that I learned that vegetables in barbecue also need to be served with dipping sauces." But The synthesized audio effect is obviously different, that is, the second audio is derived from the new supernatural dialogue speech synthesis technology launched by the Volcano Voice Team this time.

Recall the state of people's daily expressions. The brain needs thinking time to process information. When it comes to language, people will involuntarily hesitate, pronunciation, inversion, or even change their words mid-sentence, stutter and repeat. They will also deliberately emphasize pronunciation to emphasize the key information they want to express. This brings about a large number of subtle expressions that are difficult to observe. These phenomena are difficult to capture and restore in traditional TTS. The perfect reproduction of these subtleties is the source of the mystery that makes it difficult to distinguish the authenticity of the sound, and is also the mystery of the above-mentioned audio.

Specifically,

The latest supernatural dialogue speech synthesis technology released by the Volcano Voice Team is more realistic and natural than traditional TTS, that is, modal particles Details such as inhalation sounds, pauses during hesitation, and pronunciation of pronunciation are all perfectly reproduced. And only 1/4 of the data of the conventional sound library can be used to perfectly restore the subtle rhythmic characteristics and pronunciation habits of real people, allowing you to Compositing effects are more realistic. Professional evaluation results show that there is basically no difference between this new technology of Huoshan Voice and real-person recordings, and it is difficult for reviewers to distinguish it. In addition, this technology has been put into use in many scenarios such as video dubbing and telephone customer service. It will be launched on the official website of Volcano Engine Voice Technology in the near future.

#How on earth is such a powerful technology achieved?

According to reports, the above-mentioned manifestations such as gasping, swallowing, involuntary prolongation of word pronunciation when thinking, and low laughter that often occur in actual communication have been It is called paralinguistic phenomenon (paralanguage). Although this is the most realistic manifestation of the human brain's thinking and expression process, because the traditional speech synthesis technology framework cannot effectively model sparsely distributed paralinguistic phenomena, so in The restoration of rhythm when speaking is limited and too "correct".

Based on the above difficulties, the Volcano Voice supernatural speech synthesis technology makes breakthroughs from two levels:

text and speech modeling. Specifically, :

# •

On the text level, the volcanic voice uses ingredient style migration model , and the way of simulating people can speak text. Controlled colloquial transliteration allows the text to better embrace colloquialism and avoid the final effect being too written.

•

At the speech level, the team has made a breakthrough in the text analysis model and added an additional paralanguage prediction to the input side of TTS. , imitating the pronunciation characteristics of real people to achieve natural and spontaneous speech effects.

It is worth mentioning that the team effectively improved the stability and expressiveness of the model by using the TTS modeling solution with unsupervised features, using only 1/4 of the data scale of conventional sound libraries. You can achieve very natural and changeable rhythmic effects, isn’t it great?

Only 1/4 the amount of data is used to restore 100% details of real-life voices, using the latest supernatural dialogue speech synthesis technology on Volcano Voice!

## Committed to colloquial text, making "real-person expression" vivid on the page

Text is the input of speech synthesis technology. Whether its style is close to the expression of real people is the first step to improve the synthesis effect. However, due to deep-rooted writing habits, most pre-synthesis texts are not natural enough. , or it requires a lot of effort and constant adjustment, which is time-consuming and labor-intensive. In order to solve such problems, the Huoshan Voice team adopted a two-stage solution and achieved good results:

##•

Phase One: Adoption The self-supervised method uses pseudo data to pre-train the spoken language model, which reduces the amount of data required; at the same time, a pointer network structure is introduced into the model to enhance text controllability.

•

Phase 2: Use a small amount of high-quality manually labeled data to fine-tune the pre-trained spoken language model, and finally achieve controllable and natural spoken language text effects.

#Original text

##Text after automated prediction

## Southern cuisine prefers dipping sauces, such as mine It was my first time in Shanghai that I learned that vegetables in barbecue also need to be served with dipping sauce

Well,

for southern cuisine, I prefer to use dipping sauce or something, For exampleMy first timeuh, my first time went to Shanghai, and I realized that the vegetables in the barbecue must also be accompanied by dipping sauce

, the northerners said I brought half a cart of cabbage

##Well this It’s almost like

when we go to the street to buy cabbage###, ### the southerner said I want half a cabbage, ### and then the ### northerner said I want half a cabbage## ##################################

In fact, southern cuisine places more emphasis on the taste of seasonings, that is, the chef uses seasonings to display his skills

Yes, in fact, southern cuisine pays more attention to the taste of its seasonings. In other words, the chef uses seasonings to display his skills

The rhythmic diversity of paralanguage modeling is remarkable and the voice realism has been fully upgraded

In order to better restore real people, it is different from traditional In terms of speech synthesis technology, Huoshan Speech has also conducted in-depth research on paralanguage modeling and prosodic diversity respectively. In terms of paralanguage modeling, the synthesis technology introduced by the team enables the acoustic model to model a variety of paralinguistic phenomena such as inhalation, laughter, hesitation, and correction that appear in natural expressions, and combines it with text Semantic information is automatically inserted into paralinguistic phenomena . Consider rationality and randomness at the same time during the insertion process, making the performance more natural and real.

## Like our morning basically##“In the exploration of prosody diversification, we combined unsupervised representation learning technology and independently developed a highly expressive acoustic model framework. Through pronunciation, rhythm, and timbre decoupling, we not only It reduces the demand for data volume and achieves efficient modeling of extremely low-frequency pronunciation phenomena. At the same time, it uses unsupervised representation features and combines phoneme-level fundamental frequency and energy information to achieve natural changes in prosody and promote high-quality dialogue. Speech generation,” concluded the Volcano Voice team.

text

supernatural

I think so

Inhale>It’s actually very good for the body.

#Audio

C.wav

## Look at our current work, in the morning

extended >Basically I don’t eat much breakfast.

#Audio

D.wav

is stuck >#are soy milk and fried dough sticks buns.

##AudioE.wav

##He must be

Slip correction##>, I really want to eat meat.

ParalangTest_is_000008_npy_01_new2

# Copy of ##.wav

Huoshan Voice, ByteDance AI Lab Speech&Audio intelligent voice and audio team, has long been serving Douyin, Jianying, Tomato Novels, and Feishu Other businesses provide leading AI voice technology capabilities and full-stack voice product solutions, and open technical services to external enterprises through the Volcano Engine.

Only 1/4 the amount of data is used to restore 100% details of real-life voices, using the latest supernatural dialogue speech synthesis technology on Volcano Voice!

The above is the detailed content of Only 1/4 the amount of data is used to restore 100% details of real-life voices, using the latest supernatural dialogue speech synthesis technology on Volcano Voice!. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

Tesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserApr 22, 2025 am 11:48 AM

Since 2008, I've championed the shared-ride van—initially dubbed the "robotjitney," later the "vansit"—as the future of urban transportation. I foresee these vehicles as the 21st century's next-generation transit solution, surpas

Sam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailApr 22, 2025 am 11:29 AM

Revolutionizing the Checkout Experience Sam's Club's innovative "Just Go" system builds on its existing AI-powered "Scan & Go" technology, allowing members to scan purchases via the Sam's Club app during their shopping trip.

Nvidia's AI Omniverse Expands At GTC 2025Apr 22, 2025 am 11:28 AM

Nvidia's Enhanced Predictability and New Product Lineup at GTC 2025 Nvidia, a key player in AI infrastructure, is focusing on increased predictability for its clients. This involves consistent product delivery, meeting performance expectations, and

Exploring the Capabilities of Google's Gemma 2 ModelsApr 22, 2025 am 11:26 AM

Google's Gemma 2: A Powerful, Efficient Language Model Google's Gemma family of language models, celebrated for efficiency and performance, has expanded with the arrival of Gemma 2. This latest release comprises two models: a 27-billion parameter ver

The Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaApr 22, 2025 am 11:21 AM

This Leading with Data episode features Dr. Kirk Borne, a leading data scientist, astrophysicist, and TEDx speaker. A renowned expert in big data, AI, and machine learning, Dr. Borne offers invaluable insights into the current state and future traje

AI For Runners And Athletes: We're Making Excellent ProgressApr 22, 2025 am 11:12 AM

There were some very insightful perspectives in this speech—background information about engineering that showed us why artificial intelligence is so good at supporting people’s physical exercise. I will outline a core idea from each contributor’s perspective to demonstrate three design aspects that are an important part of our exploration of the application of artificial intelligence in sports. Edge devices and raw personal data This idea about artificial intelligence actually contains two components—one related to where we place large language models and the other is related to the differences between our human language and the language that our vital signs “express” when measured in real time. Alexander Amini knows a lot about running and tennis, but he still

Jamie Engstrom On Technology, Talent And Transformation At CaterpillarApr 22, 2025 am 11:10 AM

Caterpillar's Chief Information Officer and Senior Vice President of IT, Jamie Engstrom, leads a global team of over 2,200 IT professionals across 28 countries. With 26 years at Caterpillar, including four and a half years in her current role, Engst

New Google Photos Update Makes Any Photo Pop With Ultra HDR QualityApr 22, 2025 am 11:09 AM

Google Photos' New Ultra HDR Tool: A Quick Guide Enhance your photos with Google Photos' new Ultra HDR tool, transforming standard images into vibrant, high-dynamic-range masterpieces. Ideal for social media, this tool boosts the impact of any photo,

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks agoByDDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

1 months agoByDDD

Where to find the Crane Control Keycard in Atomfall

3 weeks agoByDDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks agoByDDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

Dreamweaver Mac version

Visual web development tools

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Hot Topics

Where is the login entrance for gmail email?

7638

CakePHP Tutorial

1391

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

150