Home >Technology peripherals >AI >An in-depth exploration of the implementation of unsupervised pre-training technology and 'algorithm optimization + engineering innovation' of Huoshan Voice

An in-depth exploration of the implementation of unsupervised pre-training technology and 'algorithm optimization + engineering innovation' of Huoshan Voice

WBOY
WBOYforward
2023-04-08 12:44:181463browse

For a long time, Volcano Engine has provided intelligent video subtitle solutions based on speech recognition technology for popular video platforms. To put it simply, it is a function that uses AI technology to automatically convert the voices and lyrics in the video into text to assist in video creation. However, with the rapid growth of platform users and the requirement for richer and more diverse language types, the traditionally used supervised learning technology has increasingly reached its bottleneck, which has put the team in real trouble.

As we all know, traditional supervised learning will rely heavily on manually annotated supervised data, especially in the continuous optimization of large languages ​​and the cold start of small languages. Taking major languages ​​​​such as Chinese, Mandarin and English as an example, although the video platform provides sufficient voice data for business scenarios, after the supervised data reaches a certain scale, the ROI of continued annotation will be very low, and technical personnel will inevitably need to consider how to effectively utilize hundreds of supervised data. Tens of thousands of hours of unlabeled data to further improve the performance of large-language speech recognition.

For relatively niche languages ​​or dialects, the cost of data labeling is high due to resources, manpower and other reasons. When there is very little labeled data (on the order of 10 hours), the effect of supervised training is very poor and may even fail to converge normally; and the purchased data often does not match the target scenario and cannot meet the needs of the business.

Therefore, the Volcano Engine Speech Team urgently needs to study how to make full use of a large amount of unlabeled data at the lowest possible labeling cost, improve the recognition effect with a small amount of labeled data, and implement it in actual business. Therefore, unsupervised pre-training technology has become the key to promoting the video platform ASR (Automatic Speech Recognition / Automatic Speech Recognition) capabilities to small languages.

Although the academic community has made many significant progress in the field of speech unsupervised pre-training in recent years, including Wav2vec2.0[1], HuBERT[2], etc., but there are few implementation cases in the industry for reference. Overall, The Volcano Voice team believes that the following three reasons hinder the implementation of unsupervised pre-training technology:

  1. The model parameters are large and the inference overhead is high. A large amount of unlabeled data requires unsupervised pre-training with a larger model to obtain high-quality speech representation. However, if such a model is directly deployed online, it will bring high inference costs.
  2. Unsupervised pre-training only focuses on the learning of speech representations. It needs to be combined with a large number of plain text-trained language models to jointly decode to achieve the desired effect, and is incompatible with the end-to-end ASR inference engine.
  3. Unsupervised pre-training is expensive, long-term and unstable. Taking Wav2vec2.0 as an example, a model with 300M parameters was pre-trained for 600,000 steps using 64 V100 GPUs, which took up to half a month. In addition, due to differences in data distribution, training on business data is prone to divergence.

In view of the above three major pain points, algorithm improvements and engineering optimization have been carried out to form a complete and easy-to-promote implementation plan. This article will introduce the solution in detail from the implementation process, algorithm optimization and engineering optimization.

Implementation process

The following figure is the implementation process of unsupervised pre-training of low-resource language ASR, which can be roughly divided into data There are three stages: collection, seed model training and model migration.

An in-depth exploration of the implementation of unsupervised pre-training technology and algorithm optimization + engineering innovation of Huoshan Voice

##ASR implementation process based on unsupervised pre-training

# #Specifically, the first stage of data collection can collect unlabeled speech, labeled speech and plain text data in the target language through language diversion, procurement and other means.

The second stage of seed model training is the classic "unsupervised pre-training and supervised fine-tuning" process. At this stage, an acoustic model will be obtained, which is usually fine-tuned based on the Connectionist Temporal Classification (CTC[3]) loss function. The acoustic model combined with the language model trained on pure text forms a complete speech recognition system, which can achieve good recognition results. The reason why it is called a seed model is because this model is not suitable for directly being launched into the business. The Volcano Engine prefers to use LAS (Listen, Attend and Spell[4]) or RNN-T (Recurrent Neural). Network Transducer[5]) This type of end-to-end model is deployed online.

The main reason is that LAS/RNN-T has excellent end-to-end modeling capabilities. At the same time, it has achieved better results than the traditional CTC model in recent years, and has been widely used in It is increasingly used in industry. The Volcano Engine has done a lot of optimization work on the inference and deployment of end-to-end speech recognition models, and has formed a relatively mature solution to support many businesses. While maintaining the effect without loss, if the end-to-end inference engine can be used, the operation and maintenance cost of the engine can be significantly reduced.

Based on this, the team designed the third phase, which is the model migration phase. Mainly draw on the idea of ​​knowledge distillation, use the seed model to pseudo-label unlabeled data, and then provide a LAS model with a smaller number of parameters for training, synchronously realizing the migration of the model structure. and compression of inference calculations. The effectiveness of the entire process has been verified on Cantonese ASR. The specific experimental results are shown in the following table:

An in-depth exploration of the implementation of unsupervised pre-training technology and algorithm optimization + engineering innovation of Huoshan Voice

##First of all, the team purchased 1kh of finished product data for experimental comparison. The performance of directly training the LAS model was poor, with a Character Error Rate (CER) as high as 44.2%. After analysis, Volcano Engine believes that the main reason is the mismatch between the procurement data (conversation) and business test set (video) fields. Preliminary experiments on wav2vec2.0 also found a similar phenomenon.

Compared with using procurement data for pre-training, the Volcano Engine uses data consistent with the target field for pre-training, and the CER on the business test set can be reduced from 42.0% to 29.4%; when the unlabeled data of the business scenario is accumulated to 50kh, the model parameters increase from 100M to 300M, and the CER further drops to 23.1%.

Finally, the Volcano Engine verified the effect of model migration, and combined the Cantonese language model to decode 50kh of unlabeled data to obtain pseudo Label, train LAS model. It can be seen that the LAS model based on pseudo-label training can basically maintain the recognition effect of the CTC seed model, and the number of model parameters is reduced by one-third, and can be directly deployed based on a mature end-to-end inference engine. online.

An in-depth exploration of the implementation of unsupervised pre-training technology and algorithm optimization + engineering innovation of Huoshan Voice

Comparison of model parameters and CER

Finally, in the model structure Under the premise that the number of parameters remains unchanged, the team used 50kh of unlabeled business data and 10h of labeled business data to achieve a CER of 23.0%, which was a 48% decrease compared to the baseline model. After solving the problems of online calculation amount and compatibility, we focused on the core unsupervised pre-training technology in the entire process. For wav2vec2.0, the Volcano Engine carried out the work from two dimensions: algorithm and engineering. Optimized.

Algorithm optimization

wav2vec2.0, as a self-supervised pre-training model proposed by Meta AI in 2020, opens up unsupervised representation of speech A new chapter in learning. The core idea is to use the quantization module to discretize the input features, and through comparative learning optimization, the main body of the model realizes random mask partial input features similar to BERT.

An in-depth exploration of the implementation of unsupervised pre-training technology and algorithm optimization + engineering innovation of Huoshan Voice

wav2vec2.0 model structure diagram (Source: wav2vec 2.0 Figure 1 [1])

There are two difficulties encountered when training the wav2vec 2.0 model on business data Problems: One is that the training efficiency is low, and a 300M large model with 64 cards takes more than ten days to complete; the other is that the training is unstable and easy to diverge. This Volcano Engine proposes Efficient wav2vec to alleviate the above two problems.

Regarding the problem of low training efficiency, the team accelerated the training speed by reducing the frame rate of the model, replacing the input features from waveform to filterbanks, and the frame rate was changed from the original 20ms becomes 40ms. This not only greatly reduces the calculation amount of feature extraction convolution, but also greatly reduces the length of Transformer's internal encoding, thereby improving training efficiency. For the problem of unstable training, it is solved by analyzing the learning method of unsupervised pre-training and comprehensive judgment combined with the actual situation of business data. The comparative learning loss can be expressed by the following formula:

An in-depth exploration of the implementation of unsupervised pre-training technology and algorithm optimization + engineering innovation of Huoshan Voice

For each frame t, ct represents the encoder output of the frame, qt represents the quantized output of the frame. In addition, several other frames need to be sampled as negative samples, so the set of the current frame and the negative sample frame is equivalent to a dynamically constructed vocabulary Qt .

The optimization goal of contrastive learning is to maximize the similarity between the current frame encoding and the quantization result of the frame, while minimizing the similarity between the current frame encoding and the quantization results of other frames. It is not difficult to find that the similarity between negative samples and positive samples and the number of negative samples directly determine the effect of contrastive learning. In actual operation, the average length of business data is short, and it is far from enough to only provide 50 negative samples in one sentence. Considering that the similarity between adjacent frames of speech is very high, it is necessary to ensure the continuity of the mask area, thereby increasing the difficulty of representation reconstruction.

In order to solve the above two problems, the Volcano Engine has proposed two improvements:

  1. Equal-length data stream: During the pre-training process, the entire training set is regarded as a piece of audio spliced ​​from the beginning and end of each sentence, and each training sample is intercepted from it Fixed length is obtained. This is done to ensure that the number of negative samples is sufficient and that the length within the context encoding network is consistent at different frame rates, thereby ensuring the robustness of training.
  2. Adaptive continuous mask: To alleviate the impact of data noise on training, select a smaller mask length And each mask area is forced to be continuous, and the audio length corresponding to the mask area is equivalent at different frame rates. This not only reduces the difficulty of comparative learning under noisy data, but also adapts to different frame rates.

After comparing the effects of wav2vec2.0 (w2v) and Efficient wav2vec (w2v-e) on business data, the results shown in the table below are obtained (all models are Using 64 V100 GPUs for training):

An in-depth exploration of the implementation of unsupervised pre-training technology and algorithm optimization + engineering innovation of Huoshan Voice

You can see that the improved Efficient wav2vec has a stable 5% performance improvement compared to the original wav2vec 2.0 , and the training efficiency is almost doubled.

Engineering Optimization

Although the Efficient wav2vec proposed by the team has nearly doubled the training efficiency from the algorithm level, due to the large communication volume of the 300M model, there are still fluctuations in training communication and multi-machine expansion efficiency Low. In this regard, the Volcano Engine Voice Team concluded: "In order to improve the communication efficiency of model pre-training in synchronous gradient scenarios, we have completed the Bucket group communication optimization technology on the communication backend based on the BytePS distributed training framework, and the data parallel efficiency can be achieved 10% improvement; at the same time, an adaptive parameter reordering (Parameter Reorder) strategy is also implemented to address the waiting problem caused by the different order of model parameter definition and gradient update order." Based on these optimizations, further Combined with gradient accumulation and other technologies, the single-card expansion efficiency of the 300M model increased from 55.42% to 81.83%, and the multi-machine expansion efficiency increased from 60.54% to 91.13%. The model that originally took 6.5 days to train can now be trained in only 4 days. , time-consuming shortened by 40%.

In addition, in order to support large model big data scenarios explored in the future, the Volcano Engine voice team further completed a series of ultra-large-scale models Atomic capability building. Firstly, local OSS technology was implemented, which solved the problem of inter-machine expansion efficiency while removing most of the redundant memory occupied by the optimizer; later, it supported buckets in synchronous gradient communication. Lazy init reduces the video memory usage by twice the number of parameters, greatly reduces the peak memory value and adapts to very large model scenarios where video memory resources are tight; finally, based on data parallelism, model parallelism and pipeline parallelism are supported, and in 1B and 10B models Verification and customization support are completed. This series of optimizations lays a solid foundation for the training of large models and big data.

Currently, by adopting the low-resource ASR implementation process, two low-resource languages ​​have successfully implemented video subtitles and content security services. In addition to speech recognition, the pre-training model based on wav2vec2.0 has also achieved significant gains in many other downstream tasks, including audio event detection, language recognition, emotion detection, etc., and will be gradually implemented in video content security, recommendation, and analysis in the future. , audio offloading, e-commerce customer service sentiment analysis and other related businesses. The implementation of unsupervised pre-training technology will significantly reduce the cost of labeling various types of audio data, shorten the labeling cycle, and achieve rapid response to business needs.

Summary and Outlook

In practice, Volcano Engine has explored a set of low-resource language ASR implementation solutions based on wav2vec2.0, which solves the problem It solves the problem of high reasoning overhead and achieves seamless connection with the end-to-end engine. To address the core issues of low training efficiency and instability of wav2vec2.0, Efficient wav2vec was proposed. Compared with wav2vec2.0, the effect on downstream tasks is improved by 5%, and the pre-training time is reduced by half. Combined with engineering optimization, the final pre-training time is reduced by 70% compared to the original version. In the future, Volcano Engine will continue to explore in the following three directions:

  1. Unsupervised algorithm upgrade: After wav2vec 2.0, research work on unsupervised voice pre-training has sprung up, and the team will continue Follow up on the latest research and internalize it into business scenarios. At this stage, we mainly try unsupervised models such as HuBERT[2], MAE[6] and data2vec[7], and explore their respective downstream applications. performance on task. In the future, the performance of unsupervised models will be improved from two aspects: designing efficient and adaptive unsupervised solutions according to different business scenarios; designing general unsupervised models to improve the performance of various downstream tasks.
  2. Multi-language and multi-modal: There are currently many research works on the combination of unsupervised and multi-language, such as XLSR[8]. On this basis, Volcano Engine proposed S3Net[9], which effectively alleviates the conflicts between different languages ​​by dividing multiple sparse sub-networks in the pre-training model to model different languages. The problem of mutual interference (Language Interference) has a significant performance improvement effect on large corpus languages. Existing research work mainly focuses on the audio encoder side, and the current mainstream end-to-end models all adopt the encoder-decoder structure, that is, audio text multi-modal modeling. The team determines that pure audio end pre-training can no longer meet the needs of the end-to-end model. In the future, it will explore multi-modal pre-training of audio text, including joint modeling of massive non-aligned audio text and end-to-end models and pure Unsupervised multimodal pretraining.
  3. Big data big model: The performance of the existing model is close to saturation at the scale of 100,000 hours , the team used 1 million hours of unlabeled data to do NST[10] training based on the model trained with 100,000 hours of Chinese labeled data, and achieved a relative 7% CER reduction on the general test set. At the same time, the model The generalization ability has been significantly improved, and the average CER on the 20-domain test set has dropped by 15%. To fully absorb the massive data on the order of millions of hours, a larger model is required. At present, the Volcano Engine has made preliminary progress on the model with 1B parameter level. The performance limit of large models is high, and the problem that comes with it is that it is difficult to implement. In order to implement large models into actual business, various model compression methods will be tried in the future, such as matrix decomposition, weight clipping, knowledge distillation, etc., to achieve lossless compression effects as much as possible.

Volcano Voice, long-term service ByteDance’s cutting-edge voice technology for each business line is opened through the Volcano engine, providing industry-leading AI voice technology capabilities and excellent full-stack voice products Solutions include audio understanding, audio synthesis, virtual digital humans, conversational interaction, music retrieval, intelligent hardware, etc. Currently, Volcano Engine's speech recognition and speech synthesis cover multiple languages ​​and dialects. Many technical papers have been selected into various top AI conferences, providing leading voice capabilities for Douyin, Jianying, Feishu, Tomato Novels, Pico and other businesses. , and is suitable for diverse scenarios such as short videos, live broadcasts, video creation, office and wearable devices.

References

[1] Baevski, A., Zhou, Y., Mohamed, A. and Auli, M ., 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, pp.12449-12460.

# #[2] Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R. and Mohamed, A., 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE /ACM Transactions on Audio, Speech, and Language Processing, 29, pp.3451-3460.

##[3] Graves, A., Fernández, S. , Gomez, F. and Schmidhuber, J., 2006, June. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning (pp. 369-376).

[4] Chan, W., Jaitly, N., Le, Q. and Vinyals, O., 2016, March. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4960-4964). IEEE.

[5] Graves, A., 2012. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711.

[6] He, K., Chen, X., Xie, S., Li, Y., Dollár, P. and Girshick, R., 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16000-16009).

[7] Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J. and Auli, M., 2022. Data2vec: A general framework for self-supervised learning in speech, vision and language. arXiv preprint arXiv:2202.03555.

[8] Conneau, A., Baevski, A., Collobert, R., Mohamed, A. and Auli, M., 2020. Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979.

[9] Lu, Y., Huang, M., Qu, X., Wei, P. and Ma, Z., 2022, May. Language adaptive cross-lingual speech representation learning with sparse sharing sub-networks. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6882-6886). IEEE.

[10] Park, D.S., Zhang, Y., Jia, Y., Han, W., Chiu, C.C., Li, B., Wu, Y. and Le, Q.V., 2020. Improved noisy student training for automatic speech recognition. arXiv preprint arXiv:2005.09629.

The above is the detailed content of An in-depth exploration of the implementation of unsupervised pre-training technology and 'algorithm optimization + engineering innovation' of Huoshan Voice. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete