Home >Technology peripherals >AI >Artificial Intelligence: Speech Recognition Technology
Today I will introduce to you some knowledge about speech recognition, I hope it will be helpful to you!
Speech refers to the sound that humans emit through vocal organs, which has a certain meaning and is used for communication.
Speech storage in the computer: It is stored in the form of waveform files. The changes in the speech are reflected through the waveform, so that parameter information such as sound intensity and sound length can be obtained.
Vocal range parameters: Fourier spectrum, Mel frequency to spectral coefficient, mainly used to extract the difference in speech content and timbre to further identify speech information.
Speech recognition is simply the process of automatically converting speech content into text. It is a technology for human-machine interaction.
Involved fields: acoustics, artificial intelligence, digital signal processing, psychology, etc.
Input for speech recognition: a sequence of playing a sound file.
Output of speech recognition: The output result is a text sequence.
Speech recognition requires four parts: feature extraction, acoustic model, speech model, speech decoding and search algorithm.
Feature extraction: Extract the signal to be analyzed from the original signal. This stage mainly includes pre-processing operations such as speech amplitude standardization, frequency response correction, framing, windowing, and start and end point detection. The acoustic model provides the required feature vectors.
Acoustic model: Rely on the acoustic model to analyze speech parameters (speech formant frequency, amplitude, etc.) and analyze the linear prediction parameters of speech.
Language model: Based on relevant linguistic theories, calculate the probability of possible phrase sequences of sound clips.
Speech decoding and search algorithm: Find the most appropriate path based on the search space constructed by the acoustic model, pronunciation dictionary, and speech model. The text is finally output after decoding is completed.
A complete speech recognition system includes: preprocessing, feature extraction, acoustic model training, language model training, and speech decoder.
4.1 Preprocessing
Process the input original sound signal, filter out the background noise and non-important information, and also find the beginning and end of the speech signal. Operations such as ending, voice framing, and improving the high-frequency part of the signal.
4.2 Feature Extraction
The most commonly used feature extraction method is Melton Spectral Coefficient (MFCC) because it has good noise immunity and robustness.
4.3 Acoustic model training
The acoustic model parameters are trained according to the characteristic parameters of the Xuanlian speech library, so that they can be matched with the acoustic model during recognition to obtain corresponding results. . At present, mainstream speech recognition systems generally use HMM for acoustic model modeling.
4.4 Language model training
is used to predict which word sequence is more likely to be correct.
4.5 Speech decoder
The decoder is the recognition process in speech recognition technology. Based on the input speech signal, it is combined with the trained HMM acoustic model and language The model and pronunciation dictionary establish a search space and find the most appropriate path according to the search algorithm. So as to find the most suitable string of words.
5. Speech recognition usage scenarios
Speech recognition is widely used in daily life and is mainly divided into closed and open applications.
Closed application: mainly refers to the application of specific control instructions.
For example, there are common smart homes, such as controlling light switches, water heater switches, temperature adjustment, turning on air conditioners, etc. through voice commands, which greatly enriches our daily life;
Open applications: Open main The manufacturer provides speech recognition services, which are generally deployed in public clouds or private clouds to provide corresponding SDKs, allowing customers who use the services to call speech recognition services.
Common scenarios include input methods, real-time output of conference subtitles, video editing subtitle configuration, etc.
The above is the detailed content of Artificial Intelligence: Speech Recognition Technology. For more information, please follow other related articles on the PHP Chinese website!