In the next ten years, AI speech recognition will develop in these five directions-AI-php.cn

Home

Technology peripherals

In the next ten years, AI speech recognition will develop in these five directions

王林

Apr 11, 2023 pm 08:10 PM

fieldtechnologyasr

Author|Migüel Jetté

Compiler|bluemin

Editor|Chen Caixian

In the past two years, Automatic Speech Recognition (ASR) has been widely used in Important development has been achieved in commercial use. One of the measurement indicators is that multiple enterprise-level ASR models based entirely on neural networks have been successfully launched, such as Alexa, Rev, AssemblyAI, ASAPP, etc. In 2016, Microsoft Research published an article announcing that their model had reached human-level performance (as measured by word error rate) on the 25-year-old "Switchboard" data set. ASR accuracy continues to improve, reaching human-level performance across more data sets and use cases.

未来十年，AI 语音识别将朝着这五个方向发展

Image source: Awni Hannun's blog post "Speech Recognition is not Solved"

With the recognition accuracy of ASR technology greatly improved, the application scenarios are becoming more and more popular. We believe that it is not yet the peak of commercial use of ASR, and research and market applications in this field have yet to be explored. We predict that AI voice related research and commercial systems will focus on the following five areas in the next ten years:

1. Multilingual ASR model

“Over the next decade, we will deploy truly multilingual models in production, enabling developers to build applications that anyone can understand in any language, truly unleashing the power of speech recognition to the world.”

未来十年，AI 语音识别将朝着这五个方向发展

Source: "Unsupervised cross-lingual representation learning for speech recognition" paper published by Alexis Conneau et al. in 2020

Today's commercial ASR models mainly use It is trained on the English dataset and therefore has higher accuracy on English input. There is greater long-term interest in English in academia and industry due to data availability and market demand. Although the recognition accuracy of popular commercial languages such as French, Spanish, Portuguese and German is also reasonable, there is clearly a long tail of languages with limited training data and relatively low ASR output quality.

In addition, most business systems are based on a single language, which cannot be applied to the multilingual scenarios unique to many societies. Multilingualism can take the form of back-to-back languages, such as media programming in bilingual countries. Amazon has made great strides in dealing with this problem by recently launching a product that integrates language identification (LID) and ASR. In contrast, translanguaging (also known as code-switching) is a language system used by an individual to combine words and grammar from two languages in the same sentence. This is an area where academia continues to make interesting progress.

Just as the field of natural language processing adopts a multilingual approach, we will see ASR follow suit in the next decade. As we learn how to leverage emerging end-to-end technologies, we will train large-scale multilingual models that can transfer learning between multiple languages. Meta's XLS-R is a good example: in one demo, users could speak any of 21 languages without specifying a language, and the model would eventually translate to English. By understanding and applying similarities between languages, these smarter ASR systems will provide high-quality ASR availability for low-resource language and mixed-language use cases and will enable commercial-grade applications.

#2. Rich standardized output objects

“In the next ten years, we believe that commercial ASR systems will output richer transcription objects. There will be more to it than just simple words. Furthermore, we anticipate that this richer output will be endorsed by standards bodies like the W3C so that all APIs will return similarly constructed output. This will further free up everyone in the world Potential for speech applications.

" Although the National Institute of Standards and Technology (NIST) has a long tradition of exploring "rich transcription," it has only scratched the surface in incorporating it into a standardized and scalable format for ASR output. The concept of rich transcription initially involved capitalization, punctuation, and diaryization, but to some extent expanded to speaker roles and a range of nonverbal speech events. Anticipated innovations include transcribing overlapping speech from different speakers, varying emotions and other paralinguistic features, as well as a range of non-linguistic and even non-human speech scenes and events, as well as transcribing text-based or linguistic diversity. Tanaka et al. depict a scenario in which a user may wish to choose among transcription options of varying richness, and obviously the amount and nature of the additional information we predict is specifiable, depending on the downstream application.

Traditional ASR systems are capable of generating a grid of multiple hypotheses in the process of recognizing spoken words, which have proven to be of great benefit in human-assisted transcription, spoken dialogue systems, and information retrieval. Including n-best information in a rich output format will encourage more users to use the ASR system, thereby improving user experience. While no standard currently exists for structuring or storing the additional information currently or potentially generated during speech decoding, CallMiner’s Open Speech Transcription Standard (OVTS) is a solid step in this direction, making it easy for enterprises to explore and choose Multiple ASR vendors.

We predict that in the future, ASR systems will produce richer output in standard formats, supporting more powerful downstream applications. For example, an ASR system might output the full range of possible meshes, and an application could use this additional data to do intelligent automated transcription when editing the transcript. Similarly, ASR transcriptions that include additional metadata such as detected regional dialects, accents, ambient noise, or mood can enable more powerful search applications.

3. Large-scale ASR for everyone

“In this decade, large-scale ASR (i.e., privatization, Affordable, reliable, and fast) will become part of everyone's daily lives. These systems will be able to search for videos, index all the media content we engage with, and make every video accessible to hearing-impaired consumers around the world. ASR will be the answer to Every audio and video key is to make it accessible and actionable.”

未来十年，AI 语音识别将朝着这五个方向发展

We probably all use audio and video software heavily: Podcasts, social media streams, online videos, live group chats, Zoom meetings and more. Yet very little of the relevant content is actually transcribed. Today, content transcription has become one of the largest markets for ASR APIs and will grow exponentially over the next decade, especially given their accuracy and affordability. Having said that, ASR transcription is currently only used for specific applications (broadcast video, certain conferences and podcasts, etc.). As a result, many people cannot access this media content and find it difficult to find relevant information after a broadcast or event.

In the future, this situation will change. As Matt Thompson predicted in 2010, at some point ASR will become cheap and widespread enough that we will experience what he called "speechability." We predict that in the future nearly all audio and video content will be transcribed and made instantly accessible, storable, and searchable at scale. But the development of ASR will not stop here, we also hope that these contents will be actionable. We hope that each audio and video consumed or engaged will provide additional context, such as automatically generated insights from a podcast or conference, or automatic summarization of key moments in the video, etc. We hope that NLP systems can routinize the above processing.

#4. Human-machine collaboration

“By the end of this century, we will have an evolving ASR system that is like a living organism , learning continuously with human help or self-supervision. These systems will learn from different sources in the real world, understand new words and language variants in real-time rather than asynchronously, self-debug and automatically monitor different usages."

未来十年，AI 语音识别将朝着这五个方向发展

As ASR becomes mainstream and covers an increasing number of use cases, human-machine collaboration will play a key role. The training of the ASR model reflects this well. Today, open source datasets and pre-trained models lower the barrier to entry for ASR vendors. However, the training process is still fairly simple: collect data, annotate data, train model, evaluate results, improve model. But this is a slow process and, in many cases, error-prone due to difficulty in tuning or insufficient data. Garnerin et al. observed that missing metadata and inconsistencies in representation across corpora make it difficult to guarantee equal accuracy in ASR performance, which is also the problem that Reid and Walker tried to solve when developing the metadata standard.

In the future, humans will efficiently supervise ASR training through intelligent means and play an increasingly important role in accelerating machine learning. Human-in-the-loop approaches place human reviewers in the machine learning/feedback loop, allowing for continuous review and adjustment of model results. This will make machine learning faster and more efficient, resulting in higher quality output. Earlier this year, we discussed how improvements to ASR would allow Rev's human transcribers (called "Revvers") to make post-editing ASR drafts, making them more productive. Revver's transcription can be directly input into the improved ASR model, forming a virtuous cycle.

One area where human language experts remain integral to ASR is inverse text normalization (ITN), where they convert recognized strings (like "five dollars") into their expected written form (like " $5”). Pusateri et al. proposed a hybrid approach using "hand-made grammar and statistical models", and Zhang et al. continued along these lines by constraining RNNs with hand-crafted FSTs.

5. Responsible ASR

“As with all AI systems, future ASR systems will adhere to stricter AI ethics principles so that the system treats everyone equally, has a higher degree of explainability, is accountable for its decisions, and respects the privacy of users and their data.”

未来十年，AI 语音识别将朝着这五个方向发展

Future ASR Systems The four principles of AI ethics will be followed: fairness, explainability, respect for privacy and accountability.

Fairness: Fair ASR systems recognize speech regardless of the speaker’s background, socioeconomic status, or other characteristics. It is worth noting that building such a system requires identifying and reducing biases in our models and training data. Fortunately, governments, NGOs, and businesses are already working to create the infrastructure to identify and mitigate bias.

Interpretability: ASR systems will no longer be "black boxes": they will interpret data collection and analysis, model performance and output processes as required. This additional transparency requirement allows for better human oversight of model training and performance. Like Gerlings et al., we view interpretability from the perspective of a range of stakeholders (including researchers, developers, customers, and in Rev's case, transcriptionists). Researchers may want to know the reason for outputting erroneous text in order to mitigate the problem; while transcriptionists may want some evidence of why ASR thinks that way to help them evaluate its effectiveness, especially in noisy situations where ASR may be more efficient than People "hear" better. Weitz et al. took important first steps toward interpretability for end users in the context of audio keyword recognition. Laguarta and Subirana have incorporated clinician-guided interpretation into a speech biomarker system for Alzheimer's disease detection.

Respect Privacy: "Voices" are considered "personal data" under various U.S. and international laws, and therefore, the collection and processing of voice recordings are subject to strict personal privacy protections. At Rev, we already provide data security and control capabilities, and future ASR systems will further respect the privacy of user data and the privacy of models. In many cases, this will most likely involve pushing the ASR model to the edge (on the device or browser). Voice privacy challenges are driving research in this area, and many jurisdictions, such as the European Union, have initiated legislative efforts. The field of privacy-preserving machine learning promises to draw attention to this critical aspect of the technology so that it can be widely accepted and trusted by the public.

Accountability: We will monitor the ASR system to ensure it adheres to the first three principles. This in turn requires the investment of resources and infrastructure to design and develop the necessary monitoring systems and to take action in response to findings. Companies deploying ASR systems will be responsible for their use of the technology and make specific efforts to adhere to ASR ethical principles. It is worth mentioning that humans, as designers, maintainers, and consumers of ASR systems, will be responsible for implementing and enforcing these principles—yet another example of human-machine collaboration.

Reference link: https://thegradient.pub/the-future-of-speech-recognition/https://awni.github.io/speech-recognition/

The above is the detailed content of In the next ten years, AI speech recognition will develop in these five directions. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

Personal Hacking Will Be A Pretty Fierce BearMay 11, 2025 am 11:09 AM

Cyberattacks are evolving. Gone are the days of generic phishing emails. The future of cybercrime is hyper-personalized, leveraging readily available online data and AI to craft highly targeted attacks. Imagine a scammer who knows your job, your f

Pope Leo XIV Reveals How AI Influenced His Name ChoiceMay 11, 2025 am 11:07 AM

In his inaugural address to the College of Cardinals, Chicago-born Robert Francis Prevost, the newly elected Pope Leo XIV, discussed the influence of his namesake, Pope Leo XIII, whose papacy (1878-1903) coincided with the dawn of the automobile and

FastAPI-MCP Tutorial for Beginners and Experts - Analytics VidhyaMay 11, 2025 am 10:56 AM

This tutorial demonstrates how to integrate your Large Language Model (LLM) with external tools using the Model Context Protocol (MCP) and FastAPI. We'll build a simple web application using FastAPI and convert it into an MCP server, enabling your L

Dia-1.6B TTS : Best Text-to-Dialogue Generation Model - Analytics VidhyaMay 11, 2025 am 10:27 AM

Explore Dia-1.6B: A groundbreaking text-to-speech model developed by two undergraduates with zero funding! This 1.6 billion parameter model generates remarkably realistic speech, including nonverbal cues like laughter and sneezes. This article guide

3 Ways AI Can Make Mentorship More Meaningful Than EverMay 10, 2025 am 11:17 AM

I wholeheartedly agree. My success is inextricably linked to the guidance of my mentors. Their insights, particularly regarding business management, formed the bedrock of my beliefs and practices. This experience underscores my commitment to mentor

AI Unearths New Potential In The Mining IndustryMay 10, 2025 am 11:16 AM

AI Enhanced Mining Equipment The mining operation environment is harsh and dangerous. Artificial intelligence systems help improve overall efficiency and security by removing humans from the most dangerous environments and enhancing human capabilities. Artificial intelligence is increasingly used to power autonomous trucks, drills and loaders used in mining operations. These AI-powered vehicles can operate accurately in hazardous environments, thereby increasing safety and productivity. Some companies have developed autonomous mining vehicles for large-scale mining operations. Equipment operating in challenging environments requires ongoing maintenance. However, maintenance can keep critical devices offline and consume resources. More precise maintenance means increased uptime for expensive and necessary equipment and significant cost savings. AI-driven

Why AI Agents Will Trigger The Biggest Workplace Revolution In 25 YearsMay 10, 2025 am 11:15 AM

Marc Benioff, Salesforce CEO, predicts a monumental workplace revolution driven by AI agents, a transformation already underway within Salesforce and its client base. He envisions a shift from traditional markets to a vastly larger market focused on

AI HR Is Going To Rock Our Worlds As AI Adoption SoarsMay 10, 2025 am 11:14 AM

The Rise of AI in HR: Navigating a Workforce with Robot Colleagues The integration of AI into human resources (HR) is no longer a futuristic concept; it's rapidly becoming the new reality. This shift impacts both HR professionals and employees, dem

See all articles