Home > Article > Technology peripherals > How to develop speech recognition
Deep understanding of natural language through the use of deep learning technology has always been the focus of people's attention. You don’t need to look it up to listen to music, you don’t need to use your hands to turn on the lights, and the air conditioner can understand your voice... These scenes have been shown in many film and television works, and they also represent the concept of "smart life" in many people's minds. Based on this, in the upsurge of artificial intelligence development, natural language processing has become a battlefield for major enterprises and scientific research institutions.
At present, the voice interaction track has brought together Internet giants, well-known hardware companies, e-commerce platforms, traditional home appliance manufacturers and various artificial intelligence start-ups, especially in recent years. In 2017, the popularity of voice interaction products represented by smart speakers at home and abroad has greatly stimulated the application and development of voice interaction technology.
In recent times, the most popular smart home hardware is undoubtedly the Xiaomi AI speaker. Once this product was launched, it caused great repercussions in the market. It was called "the speaker with the best interactive experience", "the responsible person in the smart speaker industry", "currently the 'most popular' smart hardware" by many media... In the editor's opinion, Xiaomi AI speakers are excellent, yes, but not to the extent they boast. Judging from the specific usage experience of people around me, its speech recognition capabilities are not particularly outstanding and are not much different from mainstream similar products currently on the market. Its biggest advantage is in the ecological chain. Through Xiaomi AI speakers, it can control Xiaomi table lamps, Xiaomi sweeping robots, Xiaomi floor fans and other supporting furniture equipment. There is no doubt that this will bring people one step closer to smart life.
In the field of automobiles and smart mobile devices, voice interaction functions have become very popular. When driving, people often cannot free their hands and should not use their hands to operate their mobile phones. At this time, in-car voice has become a necessity and a standard feature of the Internet of Vehicles. In this current era of intense hype for smart interconnection and driverless driving, new cars that don’t have some black technology like voice recognition seem too embarrassed to use it. Ford's SYNC system, Ford's in-vehicle multimedia communication and entertainment system specially equipped for mobile phones and digital media players, is currently a successful case of using voice interaction technology in in-vehicle systems and has been widely used in multiple series of Ford vehicles. After the Internet giant Apple launched the intelligent voice assistant application Siri in its iPhone 4S, Google also launched the GoogleNow intelligent voice search and question and answer service in its Android smartphone operating system. Microsoft also applied voice technology to Windows Phone, and Samsung also launched it in due course. Bixby.
In the financial field. Speech recognition technology also has its place. Recently, China Construction Bank opened an automated service branch in Huangpu District, Shanghai, where robots serve customers. The robot is equipped with facial scanning recognition software, which can answer most of customers' questions and solve most of the business needs of ordinary high street banks. It is also equipped with human assistance services and other professional services to meet personalized needs. Customers are received by robots, which use voice recognition functions to communicate with people and answer customers' questions. They can also complete most of the things that human services can do, including opening accounts, transferring money, and investing.
In addition, in the new retail field, the application of intelligent voice technology is also constantly expanding. For example, on December 18, 2017, iFlytek and Red Star Macalline announced a strategic cooperation plan. In the future, the intelligent shopping guide robot "Meimei" developed by iFlytek will be launched in Red Star Macalline stores nationwide.
In addition to voice interaction, speech-to-text is also a hot topic in current speech recognition technology. In the early days, this function was the favorite of journalists. Using this function to organize interview manuscripts and speech manuscripts can greatly improve work efficiency. Nowadays, this function is being accepted by ordinary people, and can be used by the elderly and young people suffering from laziness and cancer. This function replaces typing.
Today, the influx of capital, policy support, and repeated market expansion have made voice technology increasingly mature, and the global voice market has also ushered in a golden development period for application implementation. According to relevant statistics, the scale of the intelligent voice industry in 2016 was close to the 6 billion yuan mark, and will exceed 10 billion yuan in 2017, a year-on-year increase of about 69%.
In contrast to the proliferation of speech recognition in many fields, the development of speech recognition technology is quite slow. Under this situation, Speech recognition technology encounters many problems in practical applications.
Many companies now say that their speech recognition rate has reached 97% or even 98%, but in actual applications, the results are not satisfactory. To give a more convincing example, the Chinese speech recognition system developed by IBMT.JWatson Research Institute has ranked first in the competition sponsored by DARPA in the United States for three consecutive years. When the system recognized the CCTV "News Network" program, its error rate was Less than 5%, but when identifying other content, the gap is very large. In practical applications, the recognition rate is mainly affected by the following factors:
For Chinese speech recognition, dialect or accent will reduce the recognition rate.
Strong noise in public places has a great impact on the recognition effect. Even in a laboratory environment, typing on the keyboard and moving the microphone will become background noise.
Interruption question, if a person pauses when speaking, the machine will not be able to connect the context well to make the meaning smooth.
Here, there is also the issue of "oral language". It involves both natural language understanding and acoustics. The ultimate goal of speech recognition technology is to enable users to have a "human-computer conversation" as natural as a "human-to-human conversation." However, once the user performs voice input in the manner of talking to a human being, the grammar of spoken language is not standardized and The abnormal word order will bring difficulties to semantic analysis and understanding.
Previously, some people pointed out that issues such as accent and new vocabulary can be solved through data collection in practical applications of speech recognition technology. As the amount of data increases, this problem can be solved.
Other problems such as "interruption" require various deep learning models, such as DNN, CNN, BLSTM (bidirectional long short-term memory neural network), etc., as well as new algorithms to gradually solve them.
The use of technology often requires an iterative process. It needs to be online first, and then collect data in the scene to evaluate, optimize the model, and improve the user experience. It takes several iterations to achieve the best results. Other AI technologies are similar. Many users of AI technology today easily idealize the capabilities of the technology, feeling that they should see immediate results as soon as they are introduced. When you see that the actual results are unsatisfactory, you will feel a big gap, disappointment and give up. It is true that intelligent voice technology has reached the level of GF application, but when it is actually implemented, we must fully understand the difficulties that may be encountered and be mentally prepared for a protracted battle.
It can be predicted that in the past five to ten years, the application of speech recognition systems will be more widespread. A variety of speech recognition system products will appear on the market. People will also adapt their speech patterns to accommodate a variety of recognition systems. It is not possible to build a speech recognition system that is comparable to humans in the short term. Building such a system is still a big challenge for mankind. We can only move forward step by step in the direction of improving speech recognition systems. . It is difficult to predict when a speech recognition system as complete as a human can be built. Just like in the 1960s, who could have predicted that today's VLSI technology would have such a big impact on our society.
The above is the detailed content of How to develop speech recognition. For more information, please follow other related articles on the PHP Chinese website!