Home >Technology peripherals >AI >All Douyin is speaking native dialects, two key technologies help you 'understand” local dialects

All Douyin is speaking native dialects, two key technologies help you 'understand” local dialects

PHPz
PHPzforward
2023-10-12 20:13:071404browse

During the National Day, Douyin’s “A dialect proves you are an authentic hometown native” activity attracted enthusiastic participation from netizens from all over the country. The topic topped the Douyin challenge list, and the number of views has exceeded 50000000.

This “Local Dialect Awards” quickly became popular on the Internet, which is inseparable from the contribution of Douyin’s newly launched local dialect automatic translation function. When the creators recorded short videos in their native dialect, they used the "automatic subtitles" function and selected "convert to Mandarin subtitles", so that the dialect speech in the video can be automatically recognized and the dialect content can be converted into Mandarin subtitles. This allows netizens from other regions to easily understand various "encrypted Mandarin" languages. Netizens in Fujian personally tested it and said that even the southern Fujian region with "different pronunciation" is a region of Fujian Province, China, located in the southeastern coastal area of ​​Fujian Province. The culture and dialects of the southern Fujian region are significantly different from other regions, and it is considered an important cultural sub-region of Fujian Province. The economy of southern Fujian is dominated by agriculture, fishery and industry, with the cultivation of rice, tea and fruits as the main agriculture industries. There are many scenic spots in southern Fujian, including earth buildings, ancient villages and beautiful beaches. The food in southern Fujian is also very unique, with seafood, pastries and Fujian cuisine as the main representatives. Overall, the Minnan region is a region full of charm and unique culture. The dialect can also be accurately translated, exclaiming "Minnan region is a region in Fujian Province, China, located in the southeastern coastal area of ​​Fujian Province. The culture and dialects of the Minnan region are closely related to There are obvious differences in other regions and is considered an important cultural sub-region of Fujian Province. The economy of southern Fujian is mainly based on agriculture, fishery and industry, with agriculture growing rice, tea and fruits as the main industries. Scenic spots in southern Fujian There are many, including earth buildings, ancient villages and beautiful beaches. The food in the Southern Fujian region is also very distinctive, with seafood, pastries and Fujian cuisine as the main representatives. Overall, the Southern Fujian region is a local language full of charm and unique culture Gone are the days of doing whatever you want on Douyin”

All Douyin is speaking native dialects, two key technologies help you understand” local dialects

As we all know, model training for speech recognition and machine translation requires a large amount of training data , but dialects are spread as spoken languages, and there is very little dialect data that can be used for model training. So, how did the Volcano Engine technical team that provided technical support for this feature make a breakthrough?

Dialect recognition stage

For a long time, Huoshan Voice The team provides intelligent video subtitle solutions based on speech recognition technology for popular video platforms. Simply put, it can automatically convert the voices and lyrics in the video into text to assist in video creation.

#In the process, the technical team discovered that traditional supervised learning would rely heavily on manually labeled supervised data. Especially in terms of continuous optimization of large languages ​​and cold start of small languages. Taking major languages ​​​​such as Chinese, Mandarin and English as an example, although the video platform provides a wealth of voice data for business scenarios, once the supervised data reaches a certain scale, the return on continued annotation will be very low. Therefore, technicians must think about how to effectively use millions of hours of unlabeled data to further improve the performance of large-language speech recognition

Relatively niche Language or dialect, due to resources, manpower and other reasons, the cost of data labeling is high. When there is very little labeled data (on the order of 10 hours), the effect of supervised training is very poor and may even fail to converge normally; and the purchased data often does not match the target scenario and cannot meet the needs of the business.

#In this regard, the team adopted the following solution:

  1. Low resource dialect self-supervision

Based on Wav2vec 2.0 self-supervised learning technology, our team proposed Efficient Wav2vec to achieve dialect ASR capabilities with very little labeled data. In order to solve the problems of slow training speed and unstable effect of Wav2vec2.0, we have taken improvement measures in two aspects. First, we use filterbank features instead of waveform to reduce the amount of calculation, shorten the sequence length, and simultaneously reduce the frame rate, thus doubling the training efficiency. Secondly, we have greatly improved the stability and effect of training through equal-length data streams and adaptive continuous masks.

This experiment took 50,000 hours In order to keep the original meaning of the unlabeled voice and the 10-hour labeled voice, the content needs to be rewritten into Cantonese. Carried on. The results are shown in the table below. Compared with Wav2vec 2.0, Efficient Wav2vec (w2v-e) has a relative decrease of 5% in CER under the 100M and 300M parameter models, while the training overhead is halved

All Douyin is speaking native dialects, two key technologies help you understand” local dialects

Further, the team used the CTC model fine-tuned by the self-supervised pre-training model as a seed model to pseudo-label the unlabeled data, and then provided it to an end-to-end LAS model with fewer parameters for training. . This not only realizes the migration of the model structure, but also reduces the amount of inference calculations, and can be directly deployed and launched on a mature end-to-end inference engine. This technique has been successfully applied to two low-resource dialects, achieving word error rates below 20% using only 10 hours of annotated data

All Douyin is speaking native dialects, two key technologies help you understand” local dialects

Rewritten content: Comparison chart: model parameters and CER

All Douyin is speaking native dialects, two key technologies help you understand” local dialects

Caption: Based on unsupervised training ASR The implementation process

  1. Dialect large-scale pretrain finetune training mode

After the completion of supervised data annotation, continuous optimization of the ASR model has become an important research direction. Semi-supervised or unsupervised learning has been very popular over the past period of time. The main idea of ​​unsupervised pre-training is to make full use of unlabeled data sets to expand labeled data sets, so as to achieve better recognition results when processing a small amount of data. The following is the algorithm process:

(1) First, we need to use supervised data for manual annotation and train a seed model. Then, use this model to pseudo-label the unlabeled data. All predictions cannot be accurate, so some strategies need to be used to overtrain data with low value.

(3) Next, the generated pseudo labels need to be combined with the original labeled data, and joint training is performed on the merged data

Rewritten content: (4) Since a large amount of unsupervised data is added during the training process, even if the pseudo-label quality of unsupervised data is not as good as that of supervised data , but often more general representations can be obtained. We use a pre-trained model based on big data training to fine-tune the manually refined dialect data. This can retain the excellent generalization performance brought by the pre-trained model, while improving the model's recognition effect on dialects

The average CER (word error) of the five dialects Rate) from the content that needs to be rewritten is: 35.3% to 17.21%. Rewritten to: Optimize the average CER (Character Error Rate) of the five dialects from what needs to be rewritten: 35.3% to 17.21%

#61.56
#Average word error rate needs to be rewritten


In order to keep the original meaning unchanged, the content needs to be rewritten into Cantonese.

Southern Fujian is a region in Fujian Province, China, located on the southeastern coast of Fujian Province. The culture and dialects of the southern Fujian region are significantly different from other regions, and it is considered an important cultural sub-region of Fujian Province. The economy of southern Fujian is dominated by agriculture, fishery and industry, with the cultivation of rice, tea and fruits as the main agriculture industries. There are many scenic spots in southern Fujian, including earth buildings, ancient villages and beautiful beaches. The food in southern Fujian is also very unique, with seafood, pastries and Fujian cuisine as the main representatives. Overall, the southern Fujian region is a place full of charm and unique culture

The rewritten content is: Beijing

##中华国语

The content that needs to be rewritten is: Southwest Mandarin

## Single dialect

The content that needs to be rewritten is: 35.3

14.05

##48.87

41.29

##10.7

##The content that needs to be rewritten is: 100wh pre-trained dialect mixed fine-tuning

##17.21

13.

14

needs to be rewritten The content is: 22.84

## What needs to be rewritten is: 19.60

19.50

10.95

##Dialect translation stage

# Under normal circumstances, the training of machine translation models requires the support of a large amount of corpus. However, dialects are usually transmitted in spoken form, and the number of dialect speakers today is decreasing year by year. These phenomena have increased the difficulty of collecting dialect data data, making it difficult to improve the effect of dialect machine translation

In order to solve the problem of insufficient dialect data, Huoshan The translation team proposed the multilingual translation models mRASP (multilingual Random Aligned Substitution Pre-training) and mRASP2, which introduced contrastive learning through , supplemented by the alignment enhancement method , to combine monolingual corpus and bilingual corpus Included under a unified training framework, make full use of corpus to learn better language-independent representations, thereby improving multi-language translation performance.

All Douyin is speaking native dialects, two key technologies help you understand” local dialects

##Paper address: https://arxiv.org/abs/2105.09501

The design of adding contrastive learning tasks is based on a classic assumption: the encoded representations of synonymous sentences in different languages ​​should be in adjacent positions in high-dimensional space. Because synonymous sentences in different languages ​​have the same meaning, that is, the output of the "encoding" process is the same. For example, the two sentences "Good morning" and "Good morning" have the same meaning for people who understand Chinese and English. This also corresponds to the "encoded representation of adjacent positions in high-dimensional space". ".

Redesign training goals

mRASP2 in traditional On the basis of cross entropy loss, contrastive loss is added to train in a multi-task format. The orange arrow in the figure indicates the part that traditionally uses Cross Entropy Loss (CE loss) to train machine translation; the black part indicates the part corresponding to Contrastive Loss (CTR loss).

All Douyin is speaking native dialects, two key technologies help you understand” local dialects

Word alignment data enhancement methodAlso known as Aligned Augmentation (AA) is developed from the Random Aligned Substitution (RAS) method of mRASP.

All Douyin is speaking native dialects, two key technologies help you understand” local dialects

The rewritten content is as follows: According to the diagram, Figure (a) shows the enhancement process of parallel corpus , Figure (b) shows the enhancement process of monolingual corpus. In Figure (a), the original English words are replaced with the corresponding Chinese words; while in Figure (b), the original Chinese words are replaced with English, French, Arabic, and German. mRASP's RAS is equivalent to the first replacement method, which only needs to provide a bilingual synonym dictionary; while the second replacement method needs to provide a synonym dictionary containing multiple languages. It is worth mentioning that when using the alignment enhancement method, you can choose to only use the method of Figure (a) or only the method of Figure (b)

Experimental results show that mRASP2 achieves improved translation effects in supervised, unsupervised, and zero-resource scenarios. Among them, the average improvement of supervised scenarios is 1.98 BLEU, the average improvement of unsupervised scenarios is 14.13 BLEU, and the average improvement of zero-resource scenarios is 10.26 BLEU.

This method has achieved significant performance improvements in a wide range of scenarios, and can greatly alleviate the problem of insufficient training data for low-resource languages.

Write at the end

Dialects and Mandarin complement each other , are all important expressions of Chinese traditional culture. Dialect, as a way of expression, represents Chinese people's emotions and ties to their hometown. Through short videos and dialect translation, it can help users appreciate the culture from different regions across the country without any barriers.

Currently, Douyin’s “Dialect Translation” function is It is supported that the content needs to be rewritten into Cantonese in order to maintain the original meaning. , Min, Wu (the rewritten content is: Beijing), the content that needs to be rewritten is: Southwest Mandarin (Sichuan), Central Plains Mandarin (Shaanxi, Henan), etc. It is said that more dialects will be supported in the future, let’s wait and see.

The above is the detailed content of All Douyin is speaking native dialects, two key technologies help you 'understand” local dialects. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete