Home >Technology peripherals >AI >Chen Gen: Meta takes the lead and launches AI large model MMS
文/Chen Gen
How many languages do you know? According to relevant information, there are more than 7,000 languages in the world. However, we may only know a few or dozens of them. The current computer speech recognition technology can cover more than 100 types. For many people, this is an astronomical figure. But Meta's new open source language model has made even greater breakthroughs.
Since parting ways with OpenAI and Google, Meta has gone deeper and deeper in the direction of open source large models. Recently, Meta's newly open sourced AI language model on GitHub - Massively Multilingual Speech (MMS, Massively Multilingual Speech) can recognize more than 4,000 spoken languages, 40 times more than currently known technologies; it also expands text and Speech-to-speech conversion technology ranges from about 100 languages to more than 1,100. Not only that, the most outstanding feature of Meta's open source MMS is that it not only supports ASR, but also supports TTS, which means that it can not only convert speech to text, but also convert text to speech.
Meta’s official website blog specifically mentioned Tatuyo, a small language spoken by only a few hundred people. Although it is of little use for daily use, it is a good assistant for research. So, how can we find and effectively refine data sets for this small language that is only spoken by a few hundred people?
Meta said that in the process of collecting audio data in thousands of languages, they used an unconventional method-recordings of religious texts. "We turn to religious texts (such as the Bible) that have been translated into many different languages and whose translations have been widely studied for text-based language translation research. Furthermore, these translations have publicly available recordings of people adopting different The situation in which language is used for reading.”
At the same time, Meta uses the company's "self-supervised speech representation learning" model wav2vec 2.0 in the training of the MMS model, allowing the machine to learn without relying on labeled training data; with it, it can Train speech recognition models on less data.
Regarding the model bias that may result from this approach, Meta claimed, “While these data come from a specific field and are typically read by men; our analysis shows that our model performs equally well on male and female voices. .While the content of the recordings is religious, our analysis shows that this does not bias the model toward producing more religious language.”
When using the wav2vec 2.0 model with 1B parameters to train a multilingual speech recognition model for more than 1,100 languages, the developers found that as the number of languages increases, the performance will decrease, but very slightly: from 61 languages to 1107 languages, the character error rate only increased by about 0.4%, but the language coverage increased by more than 17 times. ”
On this issue, Meta also made a detailed comparison with OpenAI's Whisper. The model trained on the data achieved half the word error rate, and the training data was less: Meta's training data only had 45k hours of annotated data, which required 10 times less than Whisper and 10 times more language support, which is a big improvement. However, Meta also said that its new model is not perfect. "For example, there is a risk that the speech-to-text model may mistranscribe selected words or phrases. However, we still believe that the collaboration of the entire AI community is important for responsible It is critical to develop AI technology independently." Currently, Meta has open sourced the relevant models and code so that others in the research community can build on this work.
Meta has not fully envisioned the future of large speech models, but they hope to be able to solve multiple speech tasks in all languages through one model. "We trained different models for speech recognition, speech synthesis and language recognition, but we have reason to believe that in the future, one model will be able to do all these tasks and more, leading to better overall performance," Meta said .
Looking to the future, Meta hopes to expand MMS's coverage to support more languages and improve its handling of dialects. Further breaking down language barriers between people around the world, allowing people from every corner of the world to communicate normally through sound. This is a beautiful vision, but we believe that this day will come sooner or later.
The above is the detailed content of Chen Gen: Meta takes the lead and launches AI large model MMS. For more information, please follow other related articles on the PHP Chinese website!