Home > Article > Technology peripherals > Can't find the Chinese speech pre-trained model? Chinese version Wav2vec 2.0 and HuBERT are coming
Wav2vec 2.0 [1], HuBERT [2] and WavLM [3] and other speech pre-training models, through self-supervised learning on tens of thousands of hours of unlabeled speech data (such as Libri-light) , significantly improving the performance of downstream speech tasks such as Automatic Speech Recognition (ASR), Text-to-speech (TTS) and Voice Conversion (VC). However, these models do not have public Chinese versions, making them inconvenient to apply in Chinese speech research scenarios.
WenetSpeech [4] is a multi-domain speech with more than 10,000 hours jointly released by the Audio, Speech and Language Processing Research Group (ASLP@NPU) of NPU, Mobvoi, and Hillshell data set. In order to fill the gap in Chinese speech pre-training models, we have open sourced the Chinese versions of Wav2vec 2.0 and HuBERT models based on 10,000 hours of data training from WenetSpeech.
In order to verify the performance of the pre-trained model, we performed verification on the ASR task. Experimental results show that on the ASR task with 100 hours of supervised data, the speech representation learned by the pre-training model has a significant performance improvement compared to the traditional acoustic FBank features. It can even be obtained with only 100 hours of supervised data and 1000 hours of supervision. Data comparable results.
Model link: https://github.com/TencentGameMate/chinese_speech_pretrain
##Model introductionWav2vec 2.0 ModelFigure 1: Wav2vec 2.0 model structure (Baevski et al., 2020)
Wav2vec 2.0 [1] is an unsupervised speech pre-training model published by Meta in 2020. Its core idea is to construct a self-constructed supervised training target through Vector Quantization (VQ), mask the input in large quantities and then use the contrastive learning loss function for training. The model structure is shown in Figure 1 above. The feature extractor based on the Convolutational Neural Network (CNN) encodes the original audio into a sequence of frame features, and converts each frame feature into a discrete feature Q through the VQ module, which is used as a self-supervised target. At the same time, the frame feature sequence undergoes a masking operation and then enters the Transformer [5] model to obtain the context representation C. Finally, the distance between the context representation of the mask position and the corresponding discrete feature q is shortened by comparing the learning loss function, that is, the positive sample pair. In the original paper, the Wav2vec 2.0 BASE model uses a 12-layer Transformer structure and is trained with 1,000 hours of LibriSpeech data. The LARGE model uses a 24-layer Transformer structure and is trained with 60,000 hours of Libri-light data. In terms of training time, the BASE model uses 64 V100 graphics cards to train for 1.6 days, and the LARGE model uses 128 V100 graphics cards to train for 5 days. In the downstream ASR evaluation, even using only 10 minutes of supervised data, the system still achieved a Word Error Rate (WER) result of 4.8.
HuBERT model##Figure 2: HuBERT model structure (Hsu et al., 2021)
HuBERT [2] is a model published by Meta in 2021. The model structure is similar to Wav2vec 2.0, but the difference is the training method. Wav2vec 2.0 discretizes speech features as a self-supervised target during training, while HuBERT obtains the training target by performing K-means clustering on MFCC features or HuBERT features. The HuBERT model adopts an iterative training method. The first iteration of the BASE model performs clustering on the MFCC features. The second iteration performs clustering on the middle layer features of the HuBERT model obtained in the first iteration. The LARGE and XLARGE models use The second iteration of the BASE model extracts features for clustering. Judging from the experimental results of the original paper, the HuBERT model is better than Wav2vec 2.0, especially when the downstream tasks have very little supervised training data, such as 1 hour and 10 minutes.Experimental configurationWe use 10,000 hours of Chinese data from the WenetSpeech [4] train_l set as unsupervised pre-training data. The data mainly comes from YouTube and Podcasts, covering various types of recording scenes, background noise, speaking styles, etc. Its fields mainly include audiobooks, narrations, documentaries, TV series, interviews, news, readings, speeches, variety shows and others, etc. 10 major Scenes. We trained Wav2vec 2.0 and HuBERT models respectively based on the Fairseq toolkit [6], following the model configuration of [1, 2], and each pre-trained model model includes two sizes: BASE and LARGE. For the BASE model, we use 8 A100 graphics cards, the gradient accumulation is 8, and simulate 64 graphics cards for training. For the LARGE model, we use 16 A100 graphics cards with a gradient accumulation of 8, simulating 128 graphics cards for training. Downstream speech recognition task verificationIn order to verify the effect of the pre-trained model on the downstream ASR task, we follow the Conformer in the ESPnet [7,8,9] toolkit [10] Model experimental configuration, that is, the pre-trained model is used as a feature extractor, and the hidden layer representations of the input speech extraction pre-trained model are weighted and summed. The obtained speech representation will replace the traditional FBank features as the input of the Conformer ASR model. We used the Aishell 178-hour training set as supervised data for training, and compared Character Error Rate (CER) results using FBank features, Wav2vec 2.0 BASE/LARGE model features, and HuBERT BASE/LARGE model features. At the same time, we additionally compared its effect on the Aishell test set when using the WenetSpeech train_l set of 10,000 hours of Chinese data for training. The training data uses variable speed (0.9, 1.0, 1.1 times) and SpecAugment data augmentation technology, the decoding method is beam search, and a Transformer-based language model is used for rescoring. Table 1: Word error rate (CER%) results of different models on the Aishell test set According to the results in Table 1, we can see that by combining the pre-trained model with tens of thousands of hours of unsupervised data training, the performance of downstream ASR tasks has been significantly improved. Especially when using the HuBERT LARGE model, a relative improvement of about 30% in CER was achieved on the Test set, achieving the best results in the industry under 178 hours of supervised training data. We use the WenetSpeech train_s set of 100 hours of Chinese data as supervised data for training, respectively. The Character Error Rate (CER) results using FBank features, Wav2vec 2.0 BASE/LARGE model features and HuBERT BASE/LARGE model features were compared. At the same time, we additionally compared the model results using the WenetSpeech train_m set for 1,000 hours and the train_l set for 10,000 hours of Chinese data FBank features. The training data does not use variable speed or SpecAugment data augmentation technology, the decoding method is beam search, and no language model rescoring is used. Table 2: Word error rate (CER%) results of different models on the WenetSpeech test set As can be seen from the results in Table 2, by combining the pre-trained model with tens of thousands of hours of unsupervised data training, the downstream ASR results have been greatly improved. Especially when using HuBERT LARGE as the speech representation extractor, the ASR model trained with 100 hours of supervised data performs better than the model trained with 1000 hours of FBank features, and is even close to the model trained with 10,000 hours of data. For more experimental results on speech downstream tasks, please follow the GitHub link (https://github.com/TencentGameMate/chinese_speech_pretrain). Everyone is welcome to use the Chinese speech pre-training model we provide to carry out research work, and explore the application of the speech pre-training model in Chinese and related scenarios. Chinese pre-training model
The above is the detailed content of Can't find the Chinese speech pre-trained model? Chinese version Wav2vec 2.0 and HuBERT are coming. For more information, please follow other related articles on the PHP Chinese website!