


Demystifying some of the AI-based voice enhancement techniques used in real-time calls
Background Introduction
After real-time audio and video communication RTC has become an indispensable infrastructure in people’s lives and work, the various technologies involved are also constantly changing. Evolved to deal with complex multi-scenario problems, such as how to provide users with a clear and realistic hearing experience in multi-device, multi-person, and multi-noise scenarios in audio scenarios.
As the flagship international conference in the field of speech signal processing research, ICASSP (International Conference on Acoustics, Speech and Signal Processing) has always represented the most cutting-edge research direction in the field of acoustics. ICASSP 2023 includes a number of articles related to audio signal speech enhancement algorithms. Among them, Volcano Engine RTC audio team has a total of 4 research papers were accepted by the conference, covering the topics of speaker-specific speech enhancement, echo cancellation, multi-channel speech enhancement, and sound quality restoration. This article will introduce the core scene problems and technical solutions solved by these four papers, and share the thinking and practice of the Volcano Engine RTC audio team in the fields of voice noise reduction, echo cancellation, and interference human voice elimination.
"Speaker-specific enhancement based on frequency band segmentation recurrent neural network"
Paper address:
https ://www.php.cn/link/73740ea85c4ec25f00f9acbd859f861d
There are many problems that need to be solved in real-time speaker-specific speech enhancement tasks. First, collecting the full frequency bandwidth of sound increases the processing difficulty of the model. Secondly, compared with non-real-time scenarios, it is more difficult for models in real-time scenarios to locate the target speaker. How to improve the information interaction between the speaker embedding vector and the speech enhancement model is a difficulty in real-time processing. Inspired by human auditory attention, Volcano Engine proposes a Speaker Attentive Module (SAM) that introduces speaker information, and combines it with a single-channel speech enhancement model-band segmentation recurrent neural network (Band- Split Recurrent Neural Network, BSRNN) fusion, build a specific human speech enhancement system as a post-processing module of the echo cancellation model, and optimize the cascade of the two models.
Model framework structure
Band-split recurrent neural network (BSRNN)
Band-split recurrent neural network (Band-split RNN, BSRNN) ) is a SOTA model for full-band speech enhancement and music separation. Its structure is shown in the figure above. BSRNN consists of three modules, namely the Band-Split Module, the Band and Sequence Modeling Module and the Band-Merge Module. The frequency band segmentation module first divides the spectrum into K frequency bands. After the features of each frequency band are batch normalized (BN), they are compressed to the same feature dimension C by K fully connected layers (FC). Subsequently, the features of all frequency bands are concatenated into a three-dimensional tensor and further processed by the frequency band sequence modeling module, which uses GRU to alternately model the time and frequency band dimensions of the feature tensor. The processed features are finally passed through the frequency band merging module to obtain the final spectrum masking function as the output. The enhanced speech can be obtained by multiplying the spectrum mask and the input spectrum. In order to build a speaker-specific speech enhancement model, we add a speaker attention module after the modeling module of each frequency band sequence.
Speaker Attention Mechanism Module (SAM)
The structure of the Speaker Attentive Module (Speaker Attentive Module) is as shown in the figure above. The core idea is to use the speaker embedding vector e as the attractor of the intermediate features of the speech enhancement model, and calculate the correlation s between it and the intermediate features at all times and frequency bands, which is called attention. value. This attention value will be used to scale and regularize the intermediate features h. The specific formula is as follows:
First transform e and h into k and q through full connection and convolution:
K and q are multiplied to get attention Force value:
Finally scale the original features by this attention value:
Model training data
Regarding the model training data, we used the data from the 5th DNS speaker-specific speech enhancement track and the high-quality speech data of DiDispeech. Through data cleaning, we obtained about 3500 speeches. Clear human voice data. In terms of data cleaning, we used the pre-trained model based on ECAPA-TDNN [1] speaker recognition to remove the residual interfering speaker speech in the speech data, and also used the pre-trained model that won the first place in the 4th DNS Challenge to Remove residual noise from speech data. In the training phase, we generated more than 100,000 4s voice data, added reverberation to these audios to simulate different channels, and randomly mixed them with noise and interference vocals, setting them into one kind of noise, two kinds of noise, noise and interference speech There are 4 interference scenarios: human and only interfering speakers. At the same time, the levels of the noisy speech and the target speech are randomly scaled to simulate inputs of different sizes.
"Technical Solution for Integrating Specific Speaker Extraction and Echo Cancellation"
Paper address:
https: //www.php.cn/link/7c7077ca5231fd6ad758b9d49a2a1eeb
Echo cancellation has always been an extremely complex and crucial issue in external broadcast scenarios. In order to extract high-quality near-end clean speech signals, Volcano Engine proposes a lightweight echo cancellation system that combines signal processing and deep learning technology. Based on Personalized Deep Noise Suppression (pDNS), we further built a Personalized Acoustic Echo Cancellation (pAEC) system, which includes a pre-processing module based on digital signal processing, a pre-processing module based on A two-stage model of deep neural network and a speaker-specific speech extraction module based on BSRNN and SAM.
Overall framework of speaker-specific echo cancellation
Pre-processing of linear echo cancellation based on digital signal processing Module
The pre-processing module mainly includes two parts: time delay compensation (TDC) and linear echo cancellation (LAEC), which are both performed on sub-band characteristics.
Linear echo cancellation algorithm framework based on signal processing sub-band
Delay compensation
TDC is based on subband cross-correlation, which first estimates a delay in each subband separately, and then uses a voting method to determine the final time delay.
Linear Echo Cancellation
LAEC is a sub-band adaptive filtering method based on NLMS, consisting of two filters: pre-filter (Pre-filter) and post-filter (Post-filter), the post-filter uses dynamic steps to adaptively update parameters, and the pre-filter is the backup of the stable post-filter. Based on the comparison of the residual energy output by the pre-filter and the post-filter, which error signal is finally decided to use.
LAEC processing flow chart
Based on multi-level convolution-cyclic convolutional neural network (CRN )'s two-stage model
We recommend decoupling the pAEC task into two tasks: "echo suppression" and "specific speaker extraction" to reduce model modeling pressure. Therefore, the post-processing network mainly consists of two neural network modules: a lightweight CRN-based module for preliminary echo cancellation and noise suppression, and a pDNS-based post-processing module for better near-end speech signal reconstruction. .
The first stage: CRN-based lightweight module
The CRN-based lightweight module consists of a band compression module, an encoder, two dual-path GRUs, a decoder and It consists of a frequency band decomposition module. At the same time, we also introduced a Voice Activity Detection (VAD) module for multi-task learning, which helps improve the perception of near-end speech. CRN takes the compression amplitude as input and outputs a preliminary complex ideal ratio mask (cIRM) and near-field VAD probability of the target signal.
Second stage: post-processing module based on pDNS
The pDNS module at this stage includes the frequency band segmentation recurrent neural network BSRNN introduced above and the speaker attention mechanism module SAM, cascade module It is connected in series after the lightweight CRN module. Since our pDNS system has achieved relatively excellent performance in the characteristic speaker speech enhancement task, we use a pre-trained pDNS model parameter as the second stage initialization parameter of the model to further process the output of the previous stage.
Cascade system training optimization loss function
We improve the two-stage model through cascade optimization so that it can predict near-end speech in the first stage and predict a specific speaker in the second stage near-end voice. We also include a speech activity detection penalty for proximity to the speaker to enhance the model's ability to recognize speech at close range. The specific loss function is defined as follows:
Among them,
corresponds to the STFT features predicted in the first and second stages of the model respectively, representing the near-end speech and The STFT features of the near-end specific speaker's speech,
represent the model prediction and target VAD state respectively.
Model training data
In order for the echo cancellation system to handle echoes from multiple devices, multiple reverberations, and multiple noise collection scenes, we obtained 2,000 hours of training data by mixing echoes and clean speech. , among which, the echo data uses AEC Challenge 2023 remote single speech data, the clean speech comes from DNS Challenge 2023 and LibriSpeech, and the RIR set used to simulate near-end reverberation comes from DNS Challenge. Since the echo in the AEC Challenge 2023 far-end single-talk data contains a small amount of noise data, directly using these data as echo can easily lead to near-end speech distortion. In order to alleviate this problem, we adopted a simple but effective data cleaning strategy, using pre-processing A trained AEC model processes remote single-channel data, identifies data with higher residual energy as noise data, and repeatedly iterates the cleaning process shown below.
Cascading Optimization Scheme System Effect
Such a speech enhancement system based on fused echo cancellation and specific speaker extraction was used in ICASSP 2023 AEC Challenge Blind Its advantages in subjective and objective indicators were verified on the test set [2] - it achieved a subjective opinion score of 4.44 (Subjective-MOS) and a speech recognition accuracy rate of 82.2% (WAcc).
##"Multi-channel speech enhancement based on Fourier convolution attention mechanism"
Paper address:
https://www.php.cn/link/373cb8cd58cad5f1309b31c56e2d5a83
Beam weight estimation based on deep learning is one of the mainstream methods currently used to solve multi-channel speech enhancement tasks, that is, filtering multi-channel signals by solving beam weights through the network to obtain pure speech. In the estimation of beam weights, the role of spectrum information and spatial information is similar to the principle of solving the spatial covariance matrix in the traditional beam forming algorithm. However, many existing neural beamformers are unable to optimally estimate beam weights. To deal with this challenge, Volcano Engine proposes a Fourier Convolutional Attention Encoder (FCAE), which can provide a global receptive field on the frequency feature axis and enhance the context features of the frequency axis. of extraction. At the same time, we also proposed a FCAE-based Convolutional Recurrent Encoder-Decoder (CRED) structure to capture spectral contextual features and spatial information from input features. Model framework structureBeam weight estimation network##
The CRED structure we use is shown in the figure above. Among them, FCAE is the Fourier convolutional attention encoder, and FCAD is the decoder that is symmetrical to FCAE; the loop module uses the Deep Feedward Sequential Memory Network (DFSMN) to model the temporal dependence of the sequence. Reduce model size without affecting model performance; the jump connection part uses serial channel attention (Channel Attention) and spatial attention (Spatial Attention) modules to further extract cross-channel spatial information and connect deep layers Features and shallow features facilitate the transmission of information in the network.
FCAE Structure
The structure of the Fourier Convolutional Attention Encoder (FCAE) is shown in the figure above. Inspired by the Fourier convolution operator [3], this module takes advantage of the fact that the update of the discrete Fourier transform at any point in the transform domain will have a global impact on the signal in the original domain, and performs an on-frequency analysis of the frequency axis features. Through dimensional FFT transformation, the global receptive field can be obtained on the frequency axis, thereby enhancing the extraction of context features on the frequency axis. In addition, we introduced a spatial attention module and a channel attention module to further enhance the convolutional expression ability, extract beneficial spectral-spatial joint information, and enhance the network's learning of distinguishable features of pure speech and noise. In terms of final performance, the network achieved excellent multi-channel speech enhancement with only 0.74M parameters.
Model training data
In terms of data set, we used the open source data set provided by the ConferencingSpeech 2021 competition. The clean speech data includes AISHELL-1, AISHELL-3, VCTK and LibriSpeech (train-clean -360), select the data with a signal-to-noise ratio greater than 15dB to generate multi-channel mixed speech, and use MUSAN and AudioSet as noise data sets. At the same time, in order to simulate actual multi-room reverberation scenarios, the open source data was convolved with more than 5,000 room impulse responses by simulating changes in room size, reverberation time, sound sources, noise source locations, etc., and finally generated more than 60,000 responses. Multi-channel training samples.
"Sound quality restoration system based on two-stage neural network model"
Paper address:
https: //www.php.cn/link/e614f646836aaed9f89ce58e837e2310
The Volcano Engine has also made some attempts at sound quality repair, including enhancing the speech of specific speakers, eliminating echoes and enhancing Multi-channel audio. In the process of real-time communication, different forms of distortion will affect the quality of the speech signal, resulting in a decrease in the clarity and intelligibility of the speech signal. Volcano Engine proposes a two-stage model that uses a staged divide-and-conquer strategy to repair various distortions that affect speech quality.
Model frame structure
The picture below shows the overall framework composition of the two-stage model. Among them, the first-stage model mainly repairs the missing part of the spectrum, and the second-stage model mainly suppresses noise, reverberation and Possible artifacts from the first stage model.
First-stage model: Repairing Net
The overall model adopts Deep Complex Convolution Recurrent Network (DCCRN) [4] architecture , including three parts: Encoder, timing modeling module and Decoder. Inspired by image repair, we introduce Gate complex-valued convolution and Gate complex-valued transposed convolution to replace the complex-valued convolution and complex-valued transposed convolution in Encoder and Decoder. In order to further improve the naturalness of the audio repair part, we introduced Multi-Period Discriminator and Multi-Scale Discriminator for auxiliary training.
Second stage model: Denoising Net
The overall adopts S-DCCRN architecture, including three parts: Encoder, two lightweight DCCRN sub-modules and Decoder, of which two lightweight DCCRN The sub-modules perform sub-band and full-band modeling respectively. In order to improve the model's ability in time domain modeling, we replaced the LSTM in the DCCRN sub-module with the Squeezed Temporal Convolutional Module (STCM).
Model training data
The clean audio, noise, and reverb used for training here to repair sound quality are all from the 2023 DNS competition data set, in which the total duration of clean audio is 750 hours, and the total duration of noise is is 170 hours. In the data augmentation of the first stage model, we use full-band audio to convolve with randomly generated filters, with a window length of 20ms to randomly set audio sampling points to zero and randomly downsample the audio to simulate spectrum loss. On the other hand, the audio amplitude frequency and audio collection points are multiplied by random scales respectively; in the second stage of data augmentation, we use the data already generated in the first stage to convolve various types of room impulses. The response is to obtain audio data with different levels of reverberation.
Audio processing effect
In the ICASSP 2023 AEC Challenge, the Volcano Engine RTC audio team, General echo cancellation (Non-personalized AEC) and specific speaker echo cancellation (Personalized AEC) Won the championship on the track, and won the dual-talk echo suppression, dual-talk near-end voice protection, near-end single-talk background noise suppression, comprehensive subjective audio quality scoring and final speech recognition accuracy etc. The indicators are significantly better than other participating teams and have reached the international leading level.
Let’s take a look at the voice enhancement processing effects of Volcano Engine RTC in different scenarios after the above technical solutions.
Echo cancellation in different signal-to-noise-echo ratio scenarios
The following two examples show the comparative effects of the echo cancellation algorithm before and after processing in different signal-to-echo energy ratio scenarios.
Medium letter echo ratio scenario
Ultra-low signal echo ratio scenes pose the greatest challenge to echo cancellation. At this time, we not only need to effectively remove high-energy echoes, but also retain the weak target speech to the greatest extent at the same time. The non-target speaker's voice (echo) almost completely overshadows the target speaker's (female) voice, making it difficult to identify.
Super low signal echo ratio scene
Speaker extraction under different background interference speaker scenarios
The following two examples respectively show the comparative effects of specific speaker extraction algorithms before and after processing in noise and background person interference scenarios.
In the following sample, the specific speaker has both doorbell-like noise interference and background noise interference. Only using AI noise reduction can only remove the doorbell noise, so it is also necessary to perform vocal processing for the specific speaker. eliminate.
Target speaker and background interference vocals and noise
When the voiceprint features of the target speaker's voice and the background interfering voice are very close, the challenge for the specific speaker extraction algorithm is greater at this time, and it can test the robustness of the specific speaker extraction algorithm. In the following sample, the target speaker and the background interfering voice are two similar female voices.
Target female voice mixed with interfering female voice
Summary and Outlook
The above introduces some solutions and effects made by the Volcano Engine RTC audio team based on deep learning in specific speaker noise reduction, echo cancellation, multi-channel speech enhancement, etc. Future scenarios are still faced with Challenges in multiple directions, such as how to adapt voice noise reduction to noise scenes, how to perform multi-type repair of audio signals in a wider range of sound quality repair, and how to run lightweight and low-complexity models on various terminals, these challenges will also This will be our next focus research direction.
The above is the detailed content of Demystifying some of the AI-based voice enhancement techniques used in real-time calls. For more information, please follow other related articles on the PHP Chinese website!

Running large language models at home with ease: LM Studio User Guide In recent years, advances in software and hardware have made it possible to run large language models (LLMs) on personal computers. LM Studio is an excellent tool to make this process easy and convenient. This article will dive into how to run LLM locally using LM Studio, covering key steps, potential challenges, and the benefits of having LLM locally. Whether you are a tech enthusiast or are curious about the latest AI technologies, this guide will provide valuable insights and practical tips. Let's get started! Overview Understand the basic requirements for running LLM locally. Set up LM Studi on your computer

Guy Peri is McCormick’s Chief Information and Digital Officer. Though only seven months into his role, Peri is rapidly advancing a comprehensive transformation of the company’s digital capabilities. His career-long focus on data and analytics informs

Introduction Artificial intelligence (AI) is evolving to understand not just words, but also emotions, responding with a human touch. This sophisticated interaction is crucial in the rapidly advancing field of AI and natural language processing. Th

Introduction In today's data-centric world, leveraging advanced AI technologies is crucial for businesses seeking a competitive edge and enhanced efficiency. A range of powerful tools empowers data scientists, analysts, and developers to build, depl

This week's AI landscape exploded with groundbreaking releases from industry giants like OpenAI, Mistral AI, NVIDIA, DeepSeek, and Hugging Face. These new models promise increased power, affordability, and accessibility, fueled by advancements in tr

But the company’s Android app, which offers not only search capabilities but also acts as an AI assistant, is riddled with a host of security issues that could expose its users to data theft, account takeovers and impersonation attacks from malicious

You can look at what’s happening in conferences and at trade shows. You can ask engineers what they’re doing, or consult with a CEO. Everywhere you look, things are changing at breakneck speed. Engineers, and Non-Engineers What’s the difference be

Simulate Rocket Launches with RocketPy: A Comprehensive Guide This article guides you through simulating high-power rocket launches using RocketPy, a powerful Python library. We'll cover everything from defining rocket components to analyzing simula


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool

Atom editor mac version download
The most popular open source editor