Home  >  Article  >  Technology peripherals  >  Use NVIDIA Riva to quickly deploy enterprise-level Chinese voice AI services and optimize and accelerate them

Use NVIDIA Riva to quickly deploy enterprise-level Chinese voice AI services and optimize and accelerate them

WBOY
WBOYOriginal
2024-06-10 21:57:48957browse

1. Riva Overview

1. Overview

利用 NVIDIA Riva 快速部署企业级中文语音 AI 服务并进行优化加速

Riva is a product launched by NVIDIA SDK for real-time Speech AI services. It is a highly customizable tool and uses GPU acceleration. Many pre-trained models are provided on NGC. These models are ready to use out of the box and can be deployed directly using the ASR and TTS solutions provided by Riva.

In order to meet the needs of specific fields or develop customized functions, users can also use NeMo to retrain or fine-tune these models. This further improves the performance of the model and makes it more adaptable to user needs.

Riva+Skills is a highly customizable tool that leverages GPUs to accelerate real-time streaming speech recognition and speech synthesis, and is capable of handling thousands of concurrent requests simultaneously . It supports multiple deployment platforms, including local, cloud and end-side.

2. Riva ASR

利用 NVIDIA Riva 快速部署企业级中文语音 AI 服务并进行优化加速

In terms of speech recognition, Riva uses highly accurate SOTA models, such as Citrinet, Conformer and NeMo's self-developed FastConformer, etc. Currently, Riva supports more than 10 single-language models and also supports multilingual speech recognition, including English-Spanish, English-Chinese, and English-Japanese speech recognition.

Through customized functions, the accuracy of the model can be further improved. For example, support for specific industry terminology, accents or dialects, and customization for noisy environments can help improve speech recognition performance.

Riva's overall framework can be applied to a variety of scenarios, such as customer service and conference systems. In addition to general scenarios, Riva's services can also be customized according to the needs of different industries, such as CSP, education, finance and other industries.

3. ASR Pipeline & Customization

In the entire process of Riva ASR, there are some customizable modules, which can be customized according to Difficulty is divided into three categories.

利用 NVIDIA Riva 快速部署企业级中文语音 AI 服务并进行优化加速

First of all, the orange box is the customization that can be done on the client during the inference process. For example, it supports the hot word function. By adding product names or proper nouns during the inference process, the speech model can more accurately identify these specific words. This feature is natively supported by Riva and can be customized without retraining the model or restarting the Riva server.

In the purple box are some customizations that can be made when deploying. For example, Riva's streaming recognition provides two modes: latency optimization or throughput optimization, which can be selected according to business needs to obtain better performance. In addition, during the deployment process, the pronunciation dictionary can also be customized. With a customized pronunciation dictionary, you can ensure the correct pronunciation of a specific term, name, or industry jargon and improve the accuracy of speech recognition.

The green box is the customization that can be performed during the training process, that is, the training and adjustment performed on the server side. For example, in the text regularization phase at the beginning of training, some processing of specific text can be added. In addition, the acoustic model can be fine-tuned or retrained to solve problems such as accents and noise in specific business scenarios to make the model more robust. You can also retrain language models, fine-tune punctuation models, inverse text regularization, etc.

The above are the customizable parts of Riva.

4. Riva TTS

利用 NVIDIA Riva 快速部署企业级中文语音 AI 服务并进行优化加速

The Riva TTS process is shown on the right side of the figure above, which includes The following modules:

  • The first step is text regularization.
  • #The second step is G2P, which converts the basic units of text into the basic units of pronunciation or spoken language. For example, convert words to phonemes.
  • #The third step is spectral synthesis, which converts the text into an acoustic spectrum.
  • #The fourth step is audio synthesis, also called vocoder. In this step, the spectrum obtained in the previous step is converted into audio.

In the above figure, taking the sentence "Hello World" as an example, first enter the text regularization module to standardize the text, such as changing the size Write normalization. Then enter the G2P module to convert the text into a phoneme sequence. Then enter the spectrum synthesis module and obtain the spectrum through neural network training. Finally enter the vocoder to convert the spectrum into the final sound.

Riva provides streaming TTS support using a combination of the currently popular FastPitch and HiFi-GAN models. Currently supports multiple languages, including English, Mandarin Chinese, Spanish, Italian and German.

5. TTS Pipeline & Customization

利用 NVIDIA Riva 快速部署企业级中文语音 AI 服务并进行优化加速

##In Riva's TTS process, it provides for customization Two ways. The first way is to use Speech Synthesis Markup Language (SSML), which is an easier way to customize. Through some configurations, the pitch, speed, volume, etc. of pronunciation can be adjusted. This is usually chosen if you want to change the pronunciation of a specific word.

Another way is to fine-tune or retrain a FastPitch or HiFi-GAN model. Both models can be fine-tuned or retrained using your own specific data.

2. The latest update of Chinese speech recognition model

1. Overview

利用 NVIDIA Riva 快速部署企业级中文语音 AI 服务并进行优化加速

Over the past year, Riva has made some updates and improvements to the Chinese model. Here are some of the important updates.

First of all, continue to optimize the Chinese speech recognition (ASR) model. The latest ASR models can be found at the corresponding links.

Secondly, support for the Unified Model is introduced. This means that speech recognition punctuation mark prediction can be done simultaneously in one inference.

Third, support for Chinese and English mixed models has been added. This means that the model can handle both Chinese and English speech input.

In addition, some new modules and feature support have been introduced. Includes neural network-based Voice Activity Detection (VAD) and Speaker Diarization modules. The function of Chinese inverse text regularization is also introduced. Details of these models can be found in the corresponding links.

2. Word Boosting

In addition, we also provide detailed tutorials for Chinese. The first part is a tutorial on hot words (Word Boosting).

利用 NVIDIA Riva 快速部署企业级中文语音 AI 服务并进行优化加速

Hot words adjust the weight of a specific word during recognition to make the word recognition more accurate. In the tutorial, an example of a Chinese model using hot words such as "Wangyue", which is the name of an ancient poem, is shown, and we give this word a weight of 20. Next, use the add_word_boosting_to_config method provided by Riva to configure the words we want to add and their scores into the client. Then, send the configured request to the ASR server to obtain the recognition results after adding hot words.

When configuring hot words, you need to set two parameters: boosted_lm_words and boosted_lm_score. boosted_lm_words is a list of words for which we wish to improve recognition accuracy. Boosted_lm_score is the score set for these words, usually between 20 and 100.

利用 NVIDIA Riva 快速部署企业级中文语音 AI 服务并进行优化加速

In addition to the previous basic configuration, Riva's hot word function also supports some advanced usage. For example, the weight of multiple words can be increased at the same time. For example, in the example we set weights of 20 and 30 for the words "five G" and "four G" respectively.

In addition, we can also use word boosting to reduce the accuracy of certain words, that is, assign them negative weights, thereby reducing their probability of occurrence. For example, in the example, we are given a Chinese character "she" and its score is set to -100. In this way, the model will tend not to recognize the Chinese character. Theoretically, we can set any number of hot words without affecting latency. It is also worth noting that the boosting process is implemented on the client side and has no impact on the server side.

3. Finetuning Conformer AM

The second tutorial is about how to fine-tune the Conformer acoustic model.

利用 NVIDIA Riva 快速部署企业级中文语音 AI 服务并进行优化加速

Fine-tuning ASR uses NeMo tools. After configuring the NGC account, you can use the "NGC download" command to directly download the pre-trained Chinese model provided by Riva. In this example, the fifth version of the Chinese ASR model was downloaded. After the download is complete, you need to load the pre-trained model.

First, you need to import some packages. The parameter model path is set to the path of the model just downloaded. Next, use the ASRModel.restore_from function provided by NeMo to obtain the model configuration file. The target parameter can be used to obtain the category of the original ASR model. Next, use the import_class_by_path function to get the actual model class. Finally, use the restore_from method of the model under this category to load the ASR model parameters under the specified path.

利用 NVIDIA Riva 快速部署企业级中文语音 AI 服务并进行优化加速

After loading the model, you can use the training script provided by NeMo to fine-tune it. In this example, we take training the CTC model as an example, and the script used is speech_to_text_ctc.py. Some parameters that need to be configured include train_ds.manifest_filepath, which is the JSON file path of the training data, as well as whether to use the GPU, optimizer, and the maximum number of iteration rounds.

After training the model, it can be evaluated. When evaluating, you need to pay attention to setting the use_cer parameter to true, because for Chinese, we use the character error rate (Character Error Rate) as the indicator. After you have completed training and evaluating the model, you can use the nemo2riva command to convert the NeMo model to a Riva model. Then use Riva's Quickstart tool to deploy the model.

##3. Riva TTS (Text-to-Speech) Service

Next, we will introduce the Riva TTS service.

1. Demo

利用 NVIDIA Riva 快速部署企业级中文语音 AI 服务并进行优化加速

In this demonstration, the customization functions provided by Riva TTS, Make the synthesized speech more natural.

Next we will introduce the two customization methods provided by Riva TTS.

2. SSML

利用 NVIDIA Riva 快速部署企业级中文语音 AI 服务并进行优化加速

The first is the SSML (Speech Synthesis Markup Language) mentioned earlier, which is performed through a script configuration. Through SSML, the prosody in TTS can be adjusted, including pitch and rate, and the volume can also be adjusted.

As shown in the picture above, for the first sentence "Today is a sunny day", change the pitch of its rhyme to 2.5. For the second sentence, two configurations were made. One was to set its rate to high, and the other was to increase the volume by 1DB. This way you can get a customized result.

3. TTS Finetuning using NeMo

利用 NVIDIA Riva 快速部署企业级中文语音 AI 服务并进行优化加速

In addition to SSML, you can also use NeMo tools to fine-tune or retrain Riva TTS's FastPitch or HiFi-GAN models.

Riva provides tutorials, and some pre-trained models are also available on NGC (see link in the image above).

The figure shows an example of fine-tuning the HiFi-GAN model. Use the hifigan_finetune.py command and configure parameters such as model configuration name, batch size, maximum number of iteration steps, and learning rate. Set the dataset path required to fine-tune HiFi-GAN by setting the train_dataset parameter. If you downloaded a pretrained model from NGC, you can also use the init_from_pretrained_model parameter to load the pretrained model. This way the HiFi-GAN model can be retrained.

4. Riva Quickstart Tool

The customized model can be deployed using the Quickstart tool.

1. Preparation

利用 NVIDIA Riva 快速部署企业级中文语音 AI 服务并进行优化加速

Before you start, you need to register an NGC account and ensure GPU support Riva, and the Docker environment has been installed.

Once preparations are complete, download Riva Quickstart via the link provided. If the NGC CLI has been configured, you can also use the NGC CLI to download Riva Quickstart directly.

2. Server startup and shutdown

After downloading Riva Quick Start, you can use the script provided to initialize, start and shut down the server.

利用 NVIDIA Riva 快速部署企业级中文语音 AI 服务并进行优化加速

Take the latest version of Riva (2.13.1) as an example. After the download is complete, just run riva_init.sh, riva_start.sh or riva_stop .sh can complete the server initialization, startup and shutdown operations.

If you want to use a Chinese model, just set the language code to zh-CN, and the tool will automatically download the corresponding pre-trained model. You can start the service to use the Chinese ASR (automatic speech recognition) and TTS (text-to-speech) functions.

3. Riva Client

利用 NVIDIA Riva 快速部署企业级中文语音 AI 服务并进行优化加速

Once the server starts successfully, you can use the script riva_start_client provided by Riva .sh to call the service. If you want offline speech recognition, just run the riva_asr_client command and specify the path to the audio file you want to recognize. If you want to do streaming speech recognition, you can use the riva_streaming_asr_client command. If you want to perform speech synthesis, you can use the riva_tts_client command to send the audio to be processed or synthesized to the server you just started.

5. Reference resources

The following are some Riva-related document resources:

利用 NVIDIA Riva 快速部署企业级中文语音 AI 服务并进行优化加速

Riva Official Documentation: This document provides detailed information about Riva, including installation, configuration and usage guides. You can find Riva's official documentation here to learn more about and learn all aspects of Riva.

Riva Quick Start User Guide: This guide provides users with detailed instructions for Riva Quick Start, including installation and configuration steps, as well as answers to frequently asked questions. If you encounter any problems using Riva Quick Start, you can find the answers in this user guide.

Riva Release Notes: This document provides updated information about Riva's latest models. You can find out what's new and improved in each version here.

The above resources will help users better understand and use Riva.

The above is the content shared this time, thank you all.

6. Question and Answer Session

Q1: What is the relationship between Riva and Triton? Are there some functional overlaps?

A1: Yes, Riva uses Nvidia Triton's inference framework and is based on some development of Nvidia Triton.

Q2: Has Riva actually been implemented in the RAG field? Or an open source project?

A2: Riva should currently focus mainly on the field of Speech AI.

Q3: Is there any relationship between Riva and Nemo?

A3: Riva is more focused on deployment solutions. Models trained with Nemo can be deployed with Riva. We can also use Nemo to do some fine-tuning and training work. Then fine-tune good models can also be deployed in Riva.

Q4: Can models trained by other frameworks be applied?

A4: Training with other frameworks is temporarily not supported, or requires some additional development work.

Q5: Can Riva deploy models from the PyTorch or TensorFlow training framework?

A5: Riva currently mainly supports models trained by Nemo. Nemo is actually developed based on PyTorch.

Q6: If I customize a new model in Nemo, do I need to write deployment code in Riva?

A6: For self-developed models, if you want to support them in Riva, you need to do some additional development.

Q7: Can Riva be used with small memory GPUs?

A7: You can refer to the adaptation platform related documents provided by Riva, which includes the adaptation of different types of GPUs.

Q8: How to quickly try Riva?

A8: You can try Riva by downloading the Riva Quickstart toolkit directly on NGC.

Q9: If Riva wants to support Chinese dialects, will Riva need customized training?

A9: Right. You can use data in some of your own dialects. Just fine-tune it based on the pre-trained model provided by Riva, and then deploy it in Riva.

Q10: Are there any overlaps or differences in the positioning of Riva and Tensor LM?

A10: Riva’s acceleration actually uses Tensor RT. Riva is a product based on Tensor RT and Triton.

The above is the detailed content of Use NVIDIA Riva to quickly deploy enterprise-level Chinese voice AI services and optimize and accelerate them. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn