My ears are right, the sound is too real, the Seed-TTS technology of Byte Beanbao speech synthesis is revealed-AI-php.cn

Home

Technology peripherals

My ears are right, the sound is too real, the Seed-TTS technology of Byte Beanbao speech synthesis is revealed

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 26, 2024 pm 08:37 PM

ByteDanceindustryBean bag model

Seed-TTS is a large speech generation model recently released by the ByteDance Doubao model team.

, the speech it generates is almost **no different** from real people, even pronunciation **defects** can be generated, especially in terms of learning to imitate human speech, **fidelity** and **fluency **all have **excellent** performance.

For example, provide a piece of speech to Seed-TTS, It can generate a new speech based on the text, and bring the sound characteristics of the original material.

Original material (Prompt): Seed-TTS generated Chinese voice:

Suddenly, there was laughter around me. I looked at them, straightened my chest with high spirits, shook my fleshy arms, and chuckled: "The flesh on my body is to cover up my overwhelming charm, otherwise, wouldn't I scare you all? ?”

English speech can also be generated and can still “reproduce” the characteristics of Chinese speakers.

Seed-TTS generated English speech:

My ears are right, the sound is too real, the Seed-TTS technology of Byte Beanbao speech synthesis is revealed

Suddenly, there was a burst of laughter beside me. I looked at them, stood up straight with high spirit, shook the slightly fleshy arms, and smiled lightly, saying, "The flesh on my body is to hide my bursting charm. Otherwise, wouldn't it scare you?" Realize it and bring out the character's "feeling" in the voice:

Hey, do you also want to have a sweet love? "A Little Smile Is Lovely" is your best choice. The male and female protagonists are school beauties. They got to know each other through games, and then when they met, there was no misunderstanding in the whole process. It was so sweet that I couldn't help but say "Auntie" when I think about it. Laugh"~

Little fool, well... it's a very cute and friendly name, a bit "unique", but I'm a little curious, why did you choose this nickname for me? My ears are right, the sound is too real, the Seed-TTS technology of Byte Beanbao speech synthesis is revealed

Not only can it generate a "single" voice,
Seed-TTS can even present a "storyteller" corresponding to the characters and emotions based on the plot of the novel and different character traits.

"Is this pill... a drug or an aphrodisiac or something like that? Why does my scent smell so similar to what the two sisters said? Well, don't you think... Are you plotting against me?" Han Li was stunned for a long time after hearing this. He suddenly felt like he was vomiting blood. This girl's thoughts were too elusive. She could associate Yingxiang Pills with aphrodisiacs. Alas, Han Li didn't know whether to admire the other party's caution or to scream three times because he had been wronged for no reason. "It seems like what you said is true. However, I still have to take it to my second sister for testing before using it. After all, our daughter's family must be careful." "Cough, cough, uh, it's up to you. " Han Li was speechless and could only cough a few times to cover up the embarrassment on his face. He now felt that he had better stay away from this little goblin, otherwise, he would be depressed to death by her at some point. "Humph, but if this medicine is as effective as you say, then you have passed the test! If senior brother has any difficulties in Mo Mansion from now on, you can come to Caihuan for help. I just need to collect some small As a reward, I will definitely be able to help you solve the problem completely. "Okay, junior sister, if my senior brother has something to do, I will definitely ask you for help." Han Li returned to his normal state and responded to this with a smile on his face, but in his heart. Then he thought viciously: "It's strange that I'm looking for a little money fan like you.”

For more demonstrations and principles, please see the original paper and effect display:

Paper link: https://arxiv.org/abs/2406.02430
Effect display : https://bytedancespeech.github.io/seedtts_tech_report/

Before the release of the technical report, part of Seed-TTS technology has been online for a period of time in C-side products, and has received many real praises from users, and has been widely praised by the outside world. Speech synthesis model and beanbag sound reproduction model are provided for technical commercialization services.

Want to listen to the team’s sharing about the technical highlights, research value, and challenges overcome

Large model of speech generation base

Q: Seed-TTS has been noticed by some insiders. What kind of recognition impressed you?

A: There is a professor who works in speech recognition and later worked in a company. He is an industry insider that I admire very much. At an academic conference not long ago, we demonstrated the demo of Seed-TTS. After watching it, he gave feedback that he wanted to watch it recently. Looking at what can be done in the direction of speech generation, I feel that there is nothing to do in this area. Although I feel that there is still room for improvement, I am very happy after listening to it. Q: Why. Are you happy?

It’s more likely that people say you’re doing well, but this professor was looking for related research topics at the time. During this period, he saw our results and gave us. Positive comments, and I feel that our results are already very good, and we need to find other questions. This is really a high recognition for us

Q: Compared with previous results, what is the difference between Seed-TTS?

A: It is a base model for speech generation, which is slightly different from most speech generation models. Specifically, the traditional TTS is a single-task model, but for the base model, we hope that it can do anything. Task, make any sound, and allow us to control many dimensions at the same time, such as dialects, real people's oral habits, and even phonetic defects such as word swallowing

As long as there are speech methods in the world, English and Japanese. , Chinese, and even dialects in various languages, such as Shaanxi dialect and Henan dialect in Chinese... Or happy, sad, crying, angry, as long as human beings exist, we all want it to come out

Q: Have all the above ideas been achieved

A: A large part of it has been achieved. Of course, there are some places where it cannot be done, but technology is always moving forward. For example, the current language model is a base, which has a deep understanding at the text level. We also hope to truly make it a "base"

Q: The challenge of making a "base model" is. Where?

The first thing is that the detailed modeling is better.

In the past, TTS was easy to implement as a broadcasting system, but it sounded like a "machine sound". Modeling, and sounding like a human, requires a lot of detail. In particular, humans are very sensitive to their own sounds. Even if the meows of puppies and kittens are not natural, they may not be heard. However, there is a problem with human speech, which sounds very "mechanical".

Second, it requires high naturalness and high stability. Most of the mainstream TTS in the past two years were based on prior knowledge and duration models, which were defined for each phone, but limited expressiveness from the bottom. If you remove these, there will be stability and naturalness issues, which is another challenge.

The third is that the data coverage (Data Coverage) is very large. We want to replicate anyone’s voice and various language dialects, including replicating imperfections in human pronunciation, such as word swallowing and non-standard pronunciation. In order to reconstruct these features and restore "imperfections", the data coverage (Data Coverage) must be high. Previously, the data used in the industry were on the order of hundreds or thousands of hours, and there were models on the order of tens of thousands of hours. The data used by Seed-TTS was much larger than before. Such a large amount of data will also bring about the balance between quality and quantity, which is also a difficulty.

Fourth , model design. In such a large-scale situation, how to design a model to achieve better effects in all aspects is also a big challenge.

Finally, there’s the engineering challenge. As mentioned above, our data is large in magnitude and model complexity is high, which will naturally bring about engineering problems, which few people have solved before.

Q: From a technical perspective, what is the value of solving these challenges?

A Favoring text and images, speech has the attributes of both text and images. Which of the two is more suitable for speech modeling is a question we have to answer.

Speech and text have many similarities. How to design the representation of speech to make it more suitable for language model modeling is also a problem that needs to be solved.

How to use reinforcement learning to integrate various subjective and objective preference information into the generation system is also one of the problems.

Q: You mentioned research on language models and diffusion models. What conclusions can we draw from them?

Seed-TTS not only provides a technical solution based on language model, but also provides another Diffusion technical solution that is completely separated from the duration model, which is also the first in the industry.

In addition, after extensive comparisons between the two systems, we found that the language model is relatively friendly for streaming processing, and the diffusion model is more suitable for editing processing. I believe that in the future, the two will continue to merge.

Q: For these two systems, what technical difficulties does Seed-TTS specifically solve?

For language model systems, it mainly solves the Tokenizer and stability of speech.

For language model modeling, speech tokenization is a core part. Currently, there are both continuous and discrete Tokenizers on the market, and the team has conducted a lot of exploration. We found that the design of the information contained in the token has a very critical impact on the performance and stability of the entire model in all aspects. This includes not only the information of the token, frame rate, etc., but also how to tokenize it and how to turn it back into sound. Currently, these are not explored much in the industry.

In terms of the stability of the language model, we have made various explorations in token, model design, decoding strategy, and data preparation, and truly met the requirements of industry and applications.

For the pure Diffusion system, since the extra duration model is removed, the difficulty is also focused on stability. After many attempts, we have also achieved very good indicators on this link.

Q: Regarding "speech and text models have many similarities", what does this inspire us?

From the perspective of large text models, speech generation models can also be divided into Pretrain, Instruct Fine-Tuning and Post Training.

Among them, Pretrain can improve the basic capabilities of the model, which is specifically reflected in the Incontext Learning capabilities, such as timbre continuation, voice cloning and other capabilities.

For Instruct Fine-Tuning, the main purpose is to use Instruct to make the speech generation process more controllable, just like the director and the actor making requests, speak faster or slower, how to impress people, these are all integrated by us Go in.

Finally, we also found that reinforcement learning can improve the model in many dimensions, integrating various subjective and objective preference information into the generation system, including stability, control, expressiveness, naturalness, etc. Not many people in the industry are exploring this aspect.

On the basis of the above, we also explored the method of using synthetic data for Self-Distillation, and also obtained very good benefits. This is relatively more used in text LLM, and has been relatively less explored in the speech industry before.

Q: You mentioned three times that “some issues are less explored in the industry”. What caused this phenomenon?

A:On the one hand, previous research in the field of speech generation was relatively independent, and there were many traditional experiences in the industry, which may no longer be applicable under this AIGC trend. From a broader perspective, speech generation has a lot in common with text and image generation. The rapid development of large text models and image generation has also brought us a lot of new thinking. Since it takes time to promote new ideas, there is still relatively little exploration in the industry.

On the other hand, many researchers work in schools and do not have relevant resources. There are a lot of systematic projects here. Not only can we do it, but we have also explored it in detail and found some models that can take into account stability, expressiveness and computational complexity. But is this the best we can do? May still need to continue to explore.

Q: Are there any milestone moments in the entire research process?

A: The basic effect was released last year. Since then, we have iterated a lot using real cases. The work includes: finding real cases, various Post Training, and solving implementation problems (such as various stability, first packet delay, number of concurrencies, amount of computation, etc.) in this scenario. Compared with then, the effect now has improved a lot.

Where has the large speech generation model gone?

Q: Looking back now, what is the value of the entire study?

A: From the perspective of the value of Seed-TTS itself, voice is not entirely a tool, but the most direct form of human interaction. For example, from silent movies to talkies, a small change is a huge leap in the industry. The emotional connection between people relies more on voice. For example, when a child calls daddy, the emotional connection it gives you is completely different from reading text.

If we want to move towards true AI, the naturalness of speech is a key component. In the past, the machines we imagined were all machine voices, such as Moss in "The Wandering Earth". If AI can really be like your assistant and partner, the emotional connection brought by voice is essential. Jarvis in "Iron Man" is remembered by many people because he was voiced by a real person.

In addition, in terms of applications, there are many scenarios for voice application, such as novels and e-books, character design, video translation, virtual characters, broadcasting, and actor expressions, all of which have their uses, including stuttering and inability to pronounce sounds. of people can still express themselves with the help of voice technology. As long as the voice scenario is not purely information media, there is room for application. This is also our motivation to make the base model good.

Q: Scaling law has been regarded as "faith" by some practitioners. For speech generation models, what is the result after we scale the data and model?

A: Even at a very large scale, we can always see benefits as we continue to scale up. In general, by increasing the magnitude of Scale, we are pleasantly surprised to see that the model continues to acquire new capabilities.

Q: According to your observations, where is this limit?

A: At present, we can still see benefits every time, and we definitely need to continue to explore. However, we have proven that with correct model design, we can break the traditional thinking of TTS. In the past, we relied on a small amount of high-quality data, but now we continue to increase the magnitude and can achieve higher benefits.

Q: What enlightenment does GPT4-o have for us?

A:It is a unified model for generation and understanding. It has higher requirements for speech technology and requires a model to have the ability to listen, speak and think at the same time. These put forward many new requirements for our work.

Q: What is the current development stage of large models in the speech field?

A:On the one hand, we hope that the model has the expression and control of a professional actor. Most of the time, the speech generated by the model is not much different from that of real people. However, in movies and TV dramas, actors express emotions very intensely, and the information density is relatively high, so they are not completely aligned. We all want to complete the Corner Case.

On the other hand is the handling of details, including Bad Case processing and optimization to solve uncommon long-tail situations.

Large model work requires the participation of a large number of outstanding talents

Q: In this release of Seed-TTS, colleagues from all over the world have participated. Why are so many people participating?

A:With the development of the industry, cooperation between multiple people is inevitable. To achieve the ultimate goal of a large model while meeting the needs of industrialization, it cannot be supported by 1-2 ideas, and many people must participate. All participants were very professional. For example, our data requires professional students to participate in processing. Another example is that the implementation process involves many details and requires the cooperation of students who specialize in evaluation and engineering support. They all made great contributions.

We can see that among the mainstream players in AI cutting-edge research, a project has a very large number of participants, and professional students are responsible for each link. Such high-density, high-complexity talent collaboration and precise coordination , the requirements for organizational skills are also very high.

Q: What is the team atmosphere in your opinion?

A: I think it’s because of the “drive” and “details”. "Importance" is reflected in everyone taking the initiative to do things. It was also a self-driven process in itself, born out of curiosity and the idea of changing the industry. This atmosphere is more like that of a start-up company, with fewer large companies.

Q: You also mentioned that the team will "pick out details". How do you understand this?

A: This is about picking out details in real scenes. For generation work, it is easy to do a beautiful demo in demo, but in actual application, the system will face various detailed problems. In order to ensure that the model is always generated with high quality and meets user needs, we have very strict requirements on system stability and robustness, which requires repeated polishing to ensure that every detail is of high quality. On the contrary, for Demo, we didn’t do much optimization.

Q: Do we have any internal debate about "not doing too much demo optimization"?

A: Yes, especially young students. After all, everyone wants to show the better side, but we still hope to get results that can be implemented to prevent users from actually using it. During the process, I discovered that there was a big gap between the product and the demo, which truly changed the industry.

Q: Is the relevant technology currently applied in Doubao App?

A: Some related technologies have been used for a period of time. We will only display them to the outside world after being approved by users in real scenarios. Some technologies are also undergoing some final online work.

Q: What keywords can summarize our team?

A: The first one is professional. This is reflected in many aspects, including data, infrastructure, model design, etc. We will pay attention to the details of every link very professionally, and strive to achieve the ultimate performance from the perspective of industrial implementation.

The second word is focus and drive. In order to achieve our goals, focus and drive are indispensable. Therefore, everyone is very invested. When the results are actually achieved, everyone feels a sense of accomplishment and gains confidence.

The third word is unity. When working in a team, everyone has no sense of territoriality and the cooperation is very smooth. This makes me feel very comfortable, which is rare in large companies.

Q: What qualities does our team hope to continue to attract people to join?

A: First of all, look at whether the values can be consistent. Ability is certainly one aspect, but more importantly, we hope to find partners who are in the same boat so that everyone can achieve self-realization. Cooperation under this kind of values will naturally be smooth.

The second is the diversity of backgrounds. At present, the methods used in various fields of AI are similar, and everyone is gradually integrating in the same direction. Therefore, experience in reinforcement learning, visual recognition, audio recognition and other fields play a crucial role in generation.We hope that students from different professional backgrounds can participate. I am a speech understanding person and have switched to TTS.

Finally, subjective initiative and learning ability, and high pursuit of work. Generative tasks also have many unique features. We hope that candidates can find the combination of tasks and their own experience. Among them, active learning ability is necessary. At the same time, we hope to make the best technology and products in the industry. Students are also required to keep moving forward with this vision in mind every day.

The above is what the Seed-TTS team members shared. The team is still continuing to recruit outstanding talents.

If you also have ideals and enthusiasm for large model technology, and recognize the atmosphere of the Doubao Model team, please log in to the official website of the Doubao Model Team at team.doubao.com or follow the team’s official public account, Learn more about technical progress, team stories, and recruitment information: My ears are right, the sound is too real, the Seed-TTS technology of Byte Beanbao speech synthesis is revealed

ByteDance Top Seed Talent Plan is recruiting. We hope to continue to attract and recruit top talents with ambitious goals and ambitions to "change the world with technology." Join us and you will work with the best scientists and engineers to participate in the industry's top technical challenges and tackle difficult problems.

Welcome to press and hold the QR code below or click to read the original text and submit your resume.

My ears are right, the sound is too real, the Seed-TTS technology of Byte Beanbao speech synthesis is revealed

Click this link to submit your job with one click!

The above is the detailed content of My ears are right, the sound is too real, the Seed-TTS technology of Byte Beanbao speech synthesis is revealed. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

字节跳动旗下视频编辑 App CapCut 全球用户总支出超 1 亿美元Sep 14, 2023 pm 09:41 PM

字节跳动旗下的创意视频剪辑工具CapCut在中国、美国和东南亚拥有大量用户。该工具支持安卓、iOS和PC平台市场调研机构data.ai最新报告指出，截至2023年9月11日，CapCut在iOS和GooglePlay上的用户总支出已突破1亿美元（本站备注：当前约7.28亿元人民币），成功超越Splice（2022年下半年排名第一）成为2023年上半年全球最吸金的视频剪辑应用，与2022年下半年相比增长了180%。截至2023年8月，全球有4.9亿人通过iPhone和安卓手机使用CapCut。da

字节跳动模型大规模部署实战Apr 12, 2023 pm 08:31 PM

一. 背景介绍在字节跳动，基于深度学习的应用遍地开花，工程师关注模型效果的同时也需要关注线上服务一致性和性能，早期这通常需要算法专家和工程专家分工合作并紧密配合来完成，这种模式存在比较高的 diff 排查验证等成本。随着 PyTorch/TensorFlow 框架的流行，深度学习模型训练和在线推理完成了统一，开发者仅需要关注具体算法逻辑，调用框架的 Python API 完成训练验证过程即可，之后模型可以很方便的序列化导出，并由统一的高性能 C++ 引擎完成推理工作。提升了开发者训练到部署的体验

深圳字节跳动后海中心总建筑面积 7.74 万平方米完成主体结构封顶Jan 24, 2024 pm 05:27 PM

据南山区政府官方微信公众号“创新南山”透露，深圳字节跳动后海中心项目最近取得了重要进展。根据中建一局建设发展公司的消息，该项目主体结构提前3天全部完成封顶工作。这一消息意味着南山后海核心区将迎来一个新的地标建筑。深圳字节跳动后海中心项目位于南山区后海核心区，是今日头条科技有限公司在深圳市的总部办公大楼。总建筑面积为7.74万平方米，高约150米，共有地下4层和地上32层。据悉，深圳字节跳动后海中心项目将成为一座创新型超高层建筑，集办公、娱乐、餐饮等功能为一体。该项目将有助于深圳推动互联网产业的集

NUS和字节跨界合作，通过模型优化实现训练提速72倍，并荣获AAAI2023杰出论文。May 06, 2023 pm 10:46 PM

近日，人工智能国际顶会AAAI2023公布评选结果。新加坡国立大学（NUS）与字节跳动机器学习团队(AML)合作的CowClip技术论文入围杰出论文（DistinguishedPapers）。CowClip是一项模型训练优化策略，可以在保证模型精度的前提下，实现在单张GPU上的模型训练速度提升72倍，相关代码现已开源。论文地址：https://arxiv.org/abs/2204.06240开源地址：https://github.com/bytedance/LargeBatchCTRAAA

字节跳动拓展全球研发中心，派遣工程师加拿大和澳大利亚等地Jan 18, 2024 pm 04:00 PM

IT之家1月18日消息，针对近日TikTok国内员工转岗海外的传言，据接近字节跳动的人士透露，该公司正在加拿大、澳大利亚等地筹建研发中心。目前，部分研发中心已试运营半年左右，未来将支持TikTok、CapCut、Lemon8等多个海外业务研发。字节跳动计划以当地招聘为主，并辅助少量外派的方式筹建相关研发中心。据了解，过去半年，该公司已从美国、中国、新加坡等地选派少量工程师参与筹建。其中，从中国向两地研发中心累计派出包括产品、研发和运营岗位120人。相关人士表示，此举是为了应对海外业务的发展，更好

PICO 4 销量远远低于预期，消息称字节跳动将取消下一代 VR 头显 PICO 5Dec 15, 2023 am 09:34 AM

本站12月13日消息，据TheInformation，字节跳动准备砍掉其PICO新一代VR头显PICO5，因为现款PICO4的销量远远低于预期。根据EqualOcean在今年10月的一篇文章，据称字节跳动将逐步关闭PICO，并放弃元宇宙领域。文章指出，字节跳动认为PICO所处的硬件领域并非其专长，几年来的成绩未达到预期，并且对未来缺乏希望在当时，字节跳动的相关负责人对于关于“逐步放弃PICO业务”的传闻进行了回应，称这一消息是不实的。他们表示PICO业务仍在正常运营，并且公司将会长期投入扩展现实

抖音子公司推出基于云雀模型的 AI 机器人“豆包”Aug 23, 2023 am 10:53 AM

本站8月17日消息，字节跳动旗下LLM人工智能机器人“豆包”现已开始小范围邀请测试，用户可通过手机号、抖音或者AppleID登录。根据报道，据称字节跳动公司开发了一款名为"豆包"的AI工具，该工具基于云雀模型，提供聊天机器人、写作助手和英语学习助手等功能。它可以回答各种问题并进行对话，帮助人们获取信息。"豆包"支持网页Web平台、iOS和安卓平台，但在iOS平台上需要通过TestFlight进行安装官网用户协议显示，“豆包”软件及相关服务系指北京春田知韵科

Pico疑似即将发布全新VR头显Pico 4S，硬件升级引期待Mar 16, 2024 pm 08:49 PM

近期，科技圈再次掀起了一股虚拟现实（VR）的热潮。据称，字节跳动旗下的VR子公司Pico即将推出全新的独立VR头显——Pico4S。一位名为@Lunayian的用户在社交媒体上发布了一张3D模型图片，声称该图片来自PicoConnectPC客户端，展示了Pico4S的右控制器设计。这款控制器的外观与去年9月在网络上泄露的"Pico5"控制器非常相似，但与Pico4的控制器有一些明显的差异，主要体现在取消了定位环。这一设计调整可能预示着Pico4S将带来全新的用户体验和交互方式。据了解，Pico在

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

1 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Where to find the Crane Control Keycard in Atomfall

1 weeks agoByDDD

Hot Tools

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.