Home  >  Article  >  Technology peripherals  >  My ears are right, the sound is too real, the Seed-TTS technology of Byte Beanbao speech synthesis is revealed

My ears are right, the sound is too real, the Seed-TTS technology of Byte Beanbao speech synthesis is revealed

WBOY
WBOYOriginal
2024-06-26 20:37:12949browse

Seed-TTS is a large speech generation model recently released by the ByteDance Doubao model team.

, the speech it generates is almost **no different** from real people, even pronunciation **defects** can be generated, especially in terms of learning to imitate human speech, **fidelity** and **fluency **all have **excellent** performance.

For example, provide a piece of speech to Seed-TTS, It can generate a new speech based on the text, and bring the sound characteristics of the original material.

Original material (Prompt): My ears are right, the sound is too real, the Seed-TTS technology of Byte Beanbao speech synthesis is revealedSeed-TTS generated Chinese voice: My ears are right, the sound is too real, the Seed-TTS technology of Byte Beanbao speech synthesis is revealed

Suddenly, there was laughter around me. I looked at them, straightened my chest with high spirits, shook my fleshy arms, and chuckled: "The flesh on my body is to cover up my overwhelming charm, otherwise, wouldn't I scare you all? ?”

English speech can also be generated and can still “reproduce” the characteristics of Chinese speakers.

Seed-TTS generated English speech: My ears are right, the sound is too real, the Seed-TTS technology of Byte Beanbao speech synthesis is revealed
Suddenly, there was a burst of laughter beside me. I looked at them, stood up straight with high spirit, shook the slightly fleshy arms, and smiled lightly, saying, "The flesh on my body is to hide my bursting charm. Otherwise, wouldn't it scare you?" Realize it and bring out the character's "feeling" in the voice:
Hey, do you also want to have a sweet love? "A Little Smile Is Lovely" is your best choice. The male and female protagonists are school beauties. They got to know each other through games, and then when they met, there was no misunderstanding in the whole process. It was so sweet that I couldn't help but say "Auntie" when I think about it. Laugh"~
Little fool, well... it's a very cute and friendly name, a bit "unique", but I'm a little curious, why did you choose this nickname for me? My ears are right, the sound is too real, the Seed-TTS technology of Byte Beanbao speech synthesis is revealedMy ears are right, the sound is too real, the Seed-TTS technology of Byte Beanbao speech synthesis is revealed
Not only can it generate a "single" voice,
Seed-TTS can even present a "storyteller" corresponding to the characters and emotions based on the plot of the novel and different character traits.

"Is this pill... a drug or an aphrodisiac or something like that? Why does my scent smell so similar to what the two sisters said? Well, don't you think... Are you plotting against me?" Han Li was stunned for a long time after hearing this. He suddenly felt like he was vomiting blood. This girl's thoughts were too elusive. She could associate Yingxiang Pills with aphrodisiacs. Alas, Han Li didn't know whether to admire the other party's caution or to scream three times because he had been wronged for no reason. "It seems like what you said is true. However, I still have to take it to my second sister for testing before using it. After all, our daughter's family must be careful." "Cough, cough, uh, it's up to you. " Han Li was speechless and could only cough a few times to cover up the embarrassment on his face. He now felt that he had better stay away from this little goblin, otherwise, he would be depressed to death by her at some point. "Humph, but if this medicine is as effective as you say, then you have passed the test! If senior brother has any difficulties in Mo Mansion from now on, you can come to Caihuan for help. I just need to collect some small As a reward, I will definitely be able to help you solve the problem completely. "Okay, junior sister, if my senior brother has something to do, I will definitely ask you for help." Han Li returned to his normal state and responded to this with a smile on his face, but in his heart. Then he thought viciously: "It's strange that I'm looking for a little money fan like you.”

For more demonstrations and principles, please see the original paper and effect display:
My ears are right, the sound is too real, the Seed-TTS technology of Byte Beanbao speech synthesis is revealed
  • Paper link: https://arxiv.org/abs/2406.02430
  • Effect display : https://bytedancespeech.github.io/seedtts_tech_report/

Before the release of the technical report, part of Seed-TTS technology has been online for a period of time in C-side products, and has received many real praises from users, and has been widely praised by the outside world. Speech synthesis model and beanbag sound reproduction model are provided for technical commercialization services.

Want to listen to the team’s sharing about the technical highlights, research value, and challenges overcome

Large model of speech generation base

Q: Seed-TTS has been noticed by some insiders. What kind of recognition impressed you?

A: There is a professor who works in speech recognition and later worked in a company. He is an industry insider that I admire very much. At an academic conference not long ago, we demonstrated the demo of Seed-TTS. After watching it, he gave feedback that he wanted to watch it recently. Looking at what can be done in the direction of speech generation, I feel that there is nothing to do in this area. Although I feel that there is still room for improvement, I am very happy after listening to it. Q: Why. Are you happy?

A:
It’s more likely that people say you’re doing well, but this professor was looking for related research topics at the time. During this period, he saw our results and gave us. Positive comments, and I feel that our results are already very good, and we need to find other questions. This is really a high recognition for us

Q: Compared with previous results, what is the difference between Seed-TTS?

A: It is a base model for speech generation, which is slightly different from most speech generation models. Specifically, the traditional TTS is a single-task model, but for the base model, we hope that it can do anything. Task, make any sound, and allow us to control many dimensions at the same time, such as dialects, real people's oral habits, and even phonetic defects such as word swallowing

As long as there are speech methods in the world, English and Japanese. , Chinese, and even dialects in various languages, such as Shaanxi dialect and Henan dialect in Chinese... Or happy, sad, crying, angry, as long as human beings exist, we all want it to come out
.
Q: Have all the above ideas been achieved
?

A: A large part of it has been achieved. Of course, there are some places where it cannot be done, but technology is always moving forward. For example, the current language model is a base, which has a deep understanding at the text level. We also hope to truly make it a "base"

Q: The challenge of making a "base model" is. Where?

A:
The first thing is that the detailed modeling is better.
In the past, TTS was easy to implement as a broadcasting system, but it sounded like a "machine sound". Modeling, and sounding like a human, requires a lot of detail. In particular, humans are very sensitive to their own sounds. Even if the meows of puppies and kittens are not natural, they may not be heard. However, there is a problem with human speech, which sounds very "mechanical".
Second, it requires high naturalness and high stability. Most of the mainstream TTS in the past two years were based on prior knowledge and duration models, which were defined for each phone, but limited expressiveness from the bottom. If you remove these, there will be stability and naturalness issues, which is another challenge.

The third is that the data coverage (Data Coverage) is very large. We want to replicate anyone’s voice and various language dialects, including replicating imperfections in human pronunciation, such as word swallowing and non-standard pronunciation. In order to reconstruct these features and restore "imperfections", the data coverage (Data Coverage) must be high. Previously, the data used in the industry were on the order of hundreds or thousands of hours, and there were models on the order of tens of thousands of hours. The data used by Seed-TTS was much larger than before. Such a large amount of data will also bring about the balance between quality and quantity, which is also a difficulty.

Fourth , model design. In such a large-scale situation, how to design a model to achieve better effects in all aspects is also a big challenge.

Finally, there’s the engineering challenge. As mentioned above, our data is large in magnitude and model complexity is high, which will naturally bring about engineering problems, which few people have solved before.

Q: From a technical perspective, what is the value of solving these challenges?

A Favoring text and images, speech has the attributes of both text and images. Which of the two is more suitable for speech modeling is a question we have to answer.
Speech and text have many similarities. How to design the representation of speech to make it more suitable for language model modeling is also a problem that needs to be solved.
How to use reinforcement learning to integrate various subjective and objective preference information into the generation system is also one of the problems.
  • There are many other highlights, including the stability issue of the autoregressive speech generation model. In addition, through this study, we are also trying to look at TTS issues from a perspective outside the TTS field.
Q: You mentioned research on language models and diffusion models. What conclusions can we draw from them?

A:
Seed-TTS not only provides a technical solution based on language model, but also provides another Diffusion technical solution that is completely separated from the duration model, which is also the first in the industry.
In addition, after extensive comparisons between the two systems, we found that the language model is relatively friendly for streaming processing, and the diffusion model is more suitable for editing processing. I believe that in the future, the two will continue to merge.
Q: For these two systems, what technical difficulties does Seed-TTS specifically solve?


A:
For language model systems, it mainly solves the Tokenizer and stability of speech.
For language model modeling, speech tokenization is a core part. Currently, there are both continuous and discrete Tokenizers on the market, and the team has conducted a lot of exploration. We found that the design of the information contained in the token has a very critical impact on the performance and stability of the entire model in all aspects. This includes not only the information of the token, frame rate, etc., but also how to tokenize it and how to turn it back into sound. Currently, these are not explored much in the industry.
In terms of the stability of the language model, we have made various explorations in token, model design, decoding strategy, and data preparation, and truly met the requirements of industry and applications.

For the pure Diffusion system, since the extra duration model is removed, the difficulty is also focused on stability. After many attempts, we have also achieved very good indicators on this link.

Q: Regarding "speech and text models have many similarities", what does this inspire us?


A:
From the perspective of large text models, speech generation models can also be divided into Pretrain, Instruct Fine-Tuning and Post Training.
Among them, Pretrain can improve the basic capabilities of the model, which is specifically reflected in the Incontext Learning capabilities, such as timbre continuation, voice cloning and other capabilities.
For Instruct Fine-Tuning, the main purpose is to use Instruct to make the speech generation process more controllable, just like the director and the actor making requests, speak faster or slower, how to impress people, these are all integrated by us Go in.

Finally, we also found that reinforcement learning can improve the model in many dimensions, integrating various subjective and objective preference information into the generation system, including stability, control, expressiveness, naturalness, etc. Not many people in the industry are exploring this aspect.

On the basis of the above, we also explored the method of using synthetic data for Self-Distillation, and also obtained very good benefits. This is relatively more used in text LLM, and has been relatively less explored in the speech industry before.

Q: You mentioned three times that “some issues are less explored in the industry”. What caused this phenomenon?

A:On the one hand, previous research in the field of speech generation was relatively independent, and there were many traditional experiences in the industry, which may no longer be applicable under this AIGC trend. From a broader perspective, speech generation has a lot in common with text and image generation. The rapid development of large text models and image generation has also brought us a lot of new thinking. Since it takes time to promote new ideas, there is still relatively little exploration in the industry.

On the other hand, many researchers work in schools and do not have relevant resources. There are a lot of systematic projects here. Not only can we do it, but we have also explored it in detail and found some models that can take into account stability, expressiveness and computational complexity. But is this the best we can do? May still need to continue to explore.

Q: Are there any milestone moments in the entire research process?

A: The basic effect was released last year. Since then, we have iterated a lot using real cases. The work includes: finding real cases, various Post Training, and solving implementation problems (such as various stability, first packet delay, number of concurrencies, amount of computation, etc.) in this scenario. Compared with then, the effect now has improved a lot.

Where has the large speech generation model gone?

Q: Looking back now, what is the value of the entire study?

A: From the perspective of the value of Seed-TTS itself, voice is not entirely a tool, but the most direct form of human interaction. For example, from silent movies to talkies, a small change is a huge leap in the industry. The emotional connection between people relies more on voice. For example, when a child calls daddy, the emotional connection it gives you is completely different from reading text.

If we want to move towards true AI, the naturalness of speech is a key component. In the past, the machines we imagined were all machine voices, such as Moss in "The Wandering Earth". If AI can really be like your assistant and partner, the emotional connection brought by voice is essential. Jarvis in "Iron Man" is remembered by many people because he was voiced by a real person.

In addition, in terms of applications, there are many scenarios for voice application, such as novels and e-books, character design, video translation, virtual characters, broadcasting, and actor expressions, all of which have their uses, including stuttering and inability to pronounce sounds. of people can still express themselves with the help of voice technology. As long as the voice scenario is not purely information media, there is room for application. This is also our motivation to make the base model good.

Q: Scaling law has been regarded as "faith" by some practitioners. For speech generation models, what is the result after we scale the data and model?

A: Even at a very large scale, we can always see benefits as we continue to scale up. In general, by increasing the magnitude of Scale, we are pleasantly surprised to see that the model continues to acquire new capabilities.

Q: According to your observations, where is this limit?

A: At present, we can still see benefits every time, and we definitely need to continue to explore. However, we have proven that with correct model design, we can break the traditional thinking of TTS. In the past, we relied on a small amount of high-quality data, but now we continue to increase the magnitude and can achieve higher benefits.

Q: What enlightenment does GPT4-o have for us?

A:It is a unified model for generation and understanding. It has higher requirements for speech technology and requires a model to have the ability to listen, speak and think at the same time. These put forward many new requirements for our work.

Q: What is the current development stage of large models in the speech field?

A:On the one hand, we hope that the model has the expression and control of a professional actor. Most of the time, the speech generated by the model is not much different from that of real people. However, in movies and TV dramas, actors express emotions very intensely, and the information density is relatively high, so they are not completely aligned. We all want to complete the Corner Case.

On the other hand is the handling of details, including Bad Case processing and optimization to solve uncommon long-tail situations.

Large model work requires the participation of a large number of outstanding talents

Q: In this release of Seed-TTS, colleagues from all over the world have participated. Why are so many people participating?

A:With the development of the industry, cooperation between multiple people is inevitable. To achieve the ultimate goal of a large model while meeting the needs of industrialization, it cannot be supported by 1-2 ideas, and many people must participate. All participants were very professional. For example, our data requires professional students to participate in processing. Another example is that the implementation process involves many details and requires the cooperation of students who specialize in evaluation and engineering support. They all made great contributions.

We can see that among the mainstream players in AI cutting-edge research, a project has a very large number of participants, and professional students are responsible for each link. Such high-density, high-complexity talent collaboration and precise coordination , the requirements for organizational skills are also very high.

Q: What is the team atmosphere in your opinion?

A: I think it’s because of the “drive” and “details”. "Importance" is reflected in everyone taking the initiative to do things. It was also a self-driven process in itself, born out of curiosity and the idea of ​​changing the industry. This atmosphere is more like that of a start-up company, with fewer large companies.

Q: You also mentioned that the team will "pick out details". How do you understand this?

A: This is about picking out details in real scenes. For generation work, it is easy to do a beautiful demo in demo, but in actual application, the system will face various detailed problems. In order to ensure that the model is always generated with high quality and meets user needs, we have very strict requirements on system stability and robustness, which requires repeated polishing to ensure that every detail is of high quality. On the contrary, for Demo, we didn’t do much optimization.

Q: Do we have any internal debate about "not doing too much demo optimization"?

A: Yes, especially young students. After all, everyone wants to show the better side, but we still hope to get results that can be implemented to prevent users from actually using it. During the process, I discovered that there was a big gap between the product and the demo, which truly changed the industry.

Q: Is the relevant technology currently applied in Doubao App?

A: Some related technologies have been used for a period of time. We will only display them to the outside world after being approved by users in real scenarios. Some technologies are also undergoing some final online work.

Q: What keywords can summarize our team?

A: The first one is professional. This is reflected in many aspects, including data, infrastructure, model design, etc. We will pay attention to the details of every link very professionally, and strive to achieve the ultimate performance from the perspective of industrial implementation.

The second word is focus and drive. In order to achieve our goals, focus and drive are indispensable. Therefore, everyone is very invested. When the results are actually achieved, everyone feels a sense of accomplishment and gains confidence.

The third word is unity. When working in a team, everyone has no sense of territoriality and the cooperation is very smooth. This makes me feel very comfortable, which is rare in large companies.

Q: What qualities does our team hope to continue to attract people to join?

A: First of all, look at whether the values ​​​​can be consistent. Ability is certainly one aspect, but more importantly, we hope to find partners who are in the same boat so that everyone can achieve self-realization. Cooperation under this kind of values ​​will naturally be smooth.

The second is the diversity of backgrounds. At present, the methods used in various fields of AI are similar, and everyone is gradually integrating in the same direction. Therefore, experience in reinforcement learning, visual recognition, audio recognition and other fields play a crucial role in generation.We hope that students from different professional backgrounds can participate. I am a speech understanding person and have switched to TTS.

Finally, subjective initiative and learning ability, and high pursuit of work. Generative tasks also have many unique features. We hope that candidates can find the combination of tasks and their own experience. Among them, active learning ability is necessary. At the same time, we hope to make the best technology and products in the industry. Students are also required to keep moving forward with this vision in mind every day.



The above is what the Seed-TTS team members shared. The team is still continuing to recruit outstanding talents.

If you also have ideals and enthusiasm for large model technology, and recognize the atmosphere of the Doubao Model team, please log in to the official website of the Doubao Model Team at team.doubao.com or follow the team’s official public account, Learn more about technical progress, team stories, and recruitment information: My ears are right, the sound is too real, the Seed-TTS technology of Byte Beanbao speech synthesis is revealed
ByteDance Top Seed Talent Plan is recruiting. We hope to continue to attract and recruit top talents with ambitious goals and ambitions to "change the world with technology." Join us and you will work with the best scientists and engineers to participate in the industry's top technical challenges and tackle difficult problems.

Welcome to press and hold the QR code below or click to read the original text and submit your resume.

My ears are right, the sound is too real, the Seed-TTS technology of Byte Beanbao speech synthesis is revealed

Click this link to submit your job with one click!

The above is the detailed content of My ears are right, the sound is too real, the Seed-TTS technology of Byte Beanbao speech synthesis is revealed. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn