Google Recorder implements automatic speaker annotation, and its functionality and iOS voice memos are once again expanded-AI-php.cn

Google Recorder implements automatic speaker annotation, and its functionality and iOS voice memos are once again expanded

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Apr 10, 2023 pm 07:31 PM

AIGoogle

In 2019, Google launched the recording software Recorder under Android system for its Pixel mobile phones, which is comparable to voice memos under iOS and supports the recording, management and editing of audio files. Since then, Google has successively added a large number of machine learning-based features to Recorder, including speech recognition, audio event detection, automatic title generation, and smart browsing.

However, when the recording file is long and contains multiple speakers, some Recorder users will feel inconvenienced during use. Because the text obtained through speech recognition alone cannot determine who said each sentence. At this year’s Made By Google conference, Google announced the automatic speaker annotation feature of the Recorder app. This feature will add anonymous speaker tags (such as "Speaker 1" or "Speaker 2") to speech-recognized text in real time. This feature will greatly improve the readability and practicality of recorded texts. The technology behind this feature is called speaker diarization. Google first introduced its voiceprint segmentation and clustering system called Turn-to-Diarize at the 2022 ICASSP conference.

Google Recorder implements automatic speaker annotation, and its functionality and iOS voice memos are once again expanded

Left picture: The recording text with speaker annotation turned off. Right: The recording text with speaker annotation turned on.

System Architecture

Google’s Turn-to-Diarize system contains multiple highly optimized models and algorithms to implement mobile devices On the Internet, real-time voiceprint segmentation and clustering processing of hours-long audio is completed with very few computing resources. The system mainly consists of three components: a speaker switching detection model to detect speaker identity switching, a voiceprint encoder model to extract the voice characteristics of each speaker, and a multi-stage system that can efficiently complete speaker annotation. Clustering Algorithm. All components run entirely on the user's device and do not rely on any server connection.

Google Recorder implements automatic speaker annotation, and its functionality and iOS voice memos are once again expanded

Architecture diagram of the Turn-to-Diarize system.

Speaker Switch Detection

The first component of the system is a speaker switch detection model based on Transformer Transducer (T-T) . This model can convert the acoustic feature sequence into a text sequence containing the special character . The special character indicates a speaker switching event. Previous papers published by Google used special characters such as or to represent the identity of a specific speaker. In the latest system, since the character is not limited to specific identities, its application is also more widespread.

For most applications, the output of the voiceprint segmentation and clustering system is generally not presented directly to the user, but is combined with the output of the speech recognition model. Since the speech recognition model has been optimized for the word error rate during the training process, the speaker switch detection model is more tolerant to the word error rate, but pays more attention to the accuracy of the special character . On this basis, Google proposed a new character-based loss function, which enables accurate detection of speaker switching events with only a smaller model.

Extract voiceprint features

After the audio signal is segmented according to speaker conversion events, the system extracts the features of each speaker segment through the voiceprint encoder model. The embedding code of voiceprint information, that is, d-vector. In previous papers published by Google, voiceprint embedding codes were generally extracted from fixed-length audio. In contrast, this new system has many improvements. First, the new system avoids extracting voiceprint embeddings from segments that contain multiple speaker information, thus improving the overall quality of the embeddings. Secondly, the speech fragment corresponding to each voiceprint embedding code is relatively long, so it contains more voiceprint information corresponding to the speaker. Finally, the final voiceprint embedding code sequence obtained by this method is shorter in length, making the subsequent clustering algorithm less computationally expensive.

Multi-stage clustering

The last step of voiceprint segmentation and clustering is to cluster the voiceprint embedding code sequences obtained in the previous steps. Since the recordings users generate using the Recorder app can range from just a few seconds to as long as 18 hours, a key challenge for clustering algorithms is being able to handle voiceprint embedding sequences of varying lengths.

To this end, Google’s multi-stage clustering strategy cleverly combines the advantages of several different clustering algorithms. For shorter sequences, the strategy uses aggregate hierarchical clustering (AHC). For sequences of medium length, this method uses spectral clustering and utilizes the maximum margin method of eigenvalues to accurately estimate the number of speakers. For longer sequences, this method first uses aggregated hierarchical clustering to preprocess the sequence, and then calls spectral clustering, thereby reducing the computational cost of the clustering step. During the entire streaming processing process, by dynamically caching and reusing the previous clustering results, the upper limit of the time complexity and space complexity of each clustering algorithm call can be set to a constant.

Multi-stage clustering strategy is a key optimization for device-side applications. Because on the device side, resources such as CPU, memory, and battery are usually scarce. This strategy can still operate in a low-power state even after processing audio for several hours. The upper limit of the constant complexity of this strategy can usually be adjusted according to the specific device model to achieve a balance between accuracy and performance.

Google Recorder implements automatic speaker annotation, and its functionality and iOS voice memos are once again expanded

Schematic diagram of multi-stage clustering strategy.

Real-time correction and user annotation

Because Turn-to-Diarize is a real-time streaming processing system, when the model is processed, it will be updated. With more audio, the predicted speaker labels will become more accurate. To this end, the Recorder application will continuously correct the previously predicted speaker labels during the user's recording process to ensure that the speaker labels that the user sees on the current screen are always more accurate labels.

At the same time, the user interface of the Recorder application also allows users to rename the speaker tag in each recording, for example, rename "Speaker 2" to "Car Dealership" "Business", thus making it easier for users to read and remember.

Google Recorder implements automatic speaker annotation, and its functionality and iOS voice memos are once again expanded

Recorder allows users to rename speaker tags to improve readability.

Future Work

Google has launched its self-developed chip Google Tensor on the latest Pixel phones. The current voiceprint segmentation and clustering system mainly runs on the CPU module of Google Tensor. In the future, Google plans to run the voiceprint segmentation and clustering system on the TPU module of Google Tensor to further reduce energy consumption. In addition, Google also hopes to expand this feature to other languages in addition to English with the help of multi-lingual voiceprint encoders and speech recognition models.

The above is the detailed content of Google Recorder implements automatic speaker annotation, and its functionality and iOS voice memos are once again expanded. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

How to Build an Intelligent FAQ Chatbot Using Agentic RAGMay 07, 2025 am 11:28 AM

AI agents are now a part of enterprises big and small. From filling forms at hospitals and checking legal documents to analyzing video footage and handling customer support – we have AI agents for all kinds of tasks. Compan

From Panic To Power: What Leaders Must Learn In The AI AgeMay 07, 2025 am 11:26 AM

Life is good. Predictable, too—just the way your analytical mind prefers it. You only breezed into the office today to finish up some last-minute paperwork. Right after that you’re taking your partner and kids for a well-deserved vacation to sunny H

Why Convergence-Of-Evidence That Predicts AGI Will Outdo Scientific Consensus By AI ExpertsMay 07, 2025 am 11:24 AM

But scientific consensus has its hiccups and gotchas, and perhaps a more prudent approach would be via the use of convergence-of-evidence, also known as consilience. Let’s talk about it. This analysis of an innovative AI breakthrough is part of my

The Studio Ghibli Dilemma – Copyright In The Age Of Generative AIMay 07, 2025 am 11:19 AM

Neither OpenAI nor Studio Ghibli responded to requests for comment for this story. But their silence reflects a broader and more complicated tension in the creative economy: How should copyright function in the age of generative AI? With tools like

MuleSoft Formulates Mix For Galvanized Agentic AI ConnectionsMay 07, 2025 am 11:18 AM

Both concrete and software can be galvanized for robust performance where needed. Both can be stress tested, both can suffer from fissures and cracks over time, both can be broken down and refactored into a “new build”, the production of both feature

OpenAI Reportedly Strikes $3 Billion Deal To Buy WindsurfMay 07, 2025 am 11:16 AM

However, a lot of the reporting stops at a very surface level. If you’re trying to figure out what Windsurf is all about, you might or might not get what you want from the syndicated content that shows up at the top of the Google Search Engine Resul

Mandatory AI Education For All U.S. Kids? 250-Plus CEOs Say YesMay 07, 2025 am 11:15 AM

Key Facts Leaders signing the open letter include CEOs of such high-profile companies as Adobe, Accenture, AMD, American Airlines, Blue Origin, Cognizant, Dell, Dropbox, IBM, LinkedIn, Lyft, Microsoft, Salesforce, Uber, Yahoo and Zoom.

Our Complacency Crisis: Navigating AI DeceptionMay 07, 2025 am 11:09 AM

That scenario is no longer speculative fiction. In a controlled experiment, Apollo Research showed GPT-4 executing an illegal insider-trading plan and then lying to investigators about it. The episode is a vivid reminder that two curves are rising to

See all articles