EEG synthesis of natural speech! LeCun forwards new results of Nature sub-journal, and the code is open source-AI-php.cn

Home

Technology peripherals

EEG synthesis of natural speech! LeCun forwards new results of Nature sub-journal, and the code is open source

王林

Apr 17, 2024 pm 07:01 PM

gitAIdeep learningbrain computer interface

The latest progress in brain-computer interfaces was published in the Nature sub-journal, and LeCun, one of the three giants of deep learning, also forwarded it.

EEG synthesis of natural speech! LeCun forwards new results of Nature sub-journal, and the code is open source

This time, neural signals are used for speech synthesis to help people with aphasia due to neurological defects regain the ability to communicate.

It is reported that a research team from New York University has developed a new type of differentiable speech synthesizer that can use a lightweight convolutional neural network to encode speech into a series of interpretable speech parameters (such as pitch, loudness, formant frequency, etc.) and resynthesize the speech through a differentiable speech synthesizer.

By mapping neural signals to these speech parameters, the researchers built a neural speech decoding system that is highly interpretable and applicable to small data volume situations, and can generate natural-sounding speech.

A total of 48 researchers collected data from subjects and conducted experiments to provide validation for speech decoding to evaluate future high-accuracy brain-computer interfaces.

The results show that the framework can handle high and low spatial sampling densities, and can process EEG signals from the left and right hemispheres, showing strong speech decoding capabilities.

EEG synthesis of natural speech! LeCun forwards new results of Nature sub-journal, and the code is open source

Speech decoding of neural signals is difficult!

Previously, Musk's Neuralink company had successfully implanted electrodes in a subject's brain, which could complete simple cursor operations to achieve functions such as typing.

However, neural-speech decoding is generally considered to be more complex.

Most attempts to develop neuro-speech decoders and other high-precision brain-computer interface models rely on a special kind of data: electrocorticography (ECoG) recordings of subjects, often from epilepsy collected during the patient's treatment.

Use electrodes implanted in patients with epilepsy to collect cerebral cortex data during pronunciation. These data have high spatial and temporal resolution and have helped researchers obtain a series of remarkable results in the field of speech decoding.

However, speech decoding of neural signals still faces two major challenges.

The data used to train personalized neural to speech decoding models is very limited in time, usually only about ten minutes, while deep learning models often require a large amount of training data to drive.
Human pronunciation is very diverse. Even if the same person speaks the same word repeatedly, the speech speed, intonation and pitch will change, which adds complexity to the representation space built by the model.

Early attempts to decode neural signals into speech mainly relied on linear models. The models usually did not require huge training data sets and were highly interpretable, but the accuracy was very low.

Recently based on deep neural networks, especially the use of convolutional and recurrent neural network architectures, many attempts have been made in the two key dimensions of simulating the intermediate latent representation of speech and the quality of synthesized speech. For example, there are studies that decode cerebral cortex activity into mouth movement space and then convert it into speech. Although the decoding performance is powerful, the reconstructed voice sounds unnatural.

On the other hand, some methods successfully reconstruct natural-sounding speech by using wavenet vocoder, generative adversarial network (GAN) , etc., but the accuracy is limited.

A recent study published in Nature, in a patient with an implanted device, achieved this by using quantized HuBERT features as an intermediate representation space and a pre-trained speech synthesizer to convert these features into speech. A voice waveform that is both accurate and natural.

However, HuBERT features cannot represent speaker-specific acoustic information and can only generate a fixed and unified speaker's voice, so additional models are needed to convert this universal voice into a specific patient's voice. Furthermore, this study and most previous attempts adopted a non-causal architecture, which may limit its use in brain-computer interfaces requiring temporal causal (causal) operations. Use in practical applications.

Building a Differentiable Speech Synthesizer

The research team of New York University Video Lab and Flinker Lab introduced a new type of decoding from electroencephalogram

(ECoG) signal to speech The framework constructs a low dimension latent representation (low dimension latent representation) , which is generated by a speech encoding and decoding model using only the speech signal.

△Neural Speech Decoding Framework

Specifically, the framework consists of two parts:

One part is the ECoG decoder, which can Convert the ECoG signal into acoustic speech parameters that we can understand (such as pitch, whether it is uttered, loudness, and formant frequency, etc.);

もう 1 つの部分は音声合成器で、これらの音声パラメータをスペクトログラムに変換します。

研究者らは、微分可能な音声合成器を構築しました。これにより、音声合成器は ECoG デコーダのトレーニング中にトレーニングに参加し、スペクトログラム再構成のエラーを減らすために共同で最適化することができます。

この低次元潜在空間は、参照音声パラメータを生成する軽量の事前トレーニング済み音声エンコーダと組み合わせることで強力な解釈可能性を備えており、研究者が効率的なニューラル音声デコードフレームワークを構築するのに役立ち、データが非常に不足しているという問題を克服します。ニューラル音声復号化の分野。

このフレームワークは、話者自身の声に非常に近い自然な音声を生成でき、ECoG デコーダー部分はさまざまな深層学習モデルアーキテクチャに接続でき、因果的操作もサポートします。

研究者らは、ECoG デコーダーとして複数の深層学習アーキテクチャ (畳み込み、リカレントニューラルネットワーク、トランスフォーマーを含む) を使用して、48 人の脳神経外科患者から ECoG データを収集および処理しました。

このフレームワークはさまざまなモデルで高い精度を実証しており、その中でも畳み込み (ResNet) アーキテクチャが最高のパフォーマンスを達成しました。この記事で研究者によって提案されたフレームワークは、因果演算と比較的低いサンプリングレート (低密度、10 mm 間隔) によってのみ高精度を達成できます。彼らはまた、脳の左半球と右半球の両方から効率的に音声をデコードすることを実証し、神経音声デコードの応用を右半球に拡張しました。

△微分可能スピーチシンセサイザーのアーキテクチャ

微分可能スピーチシンセサイザー

(スピーチシンセサイザー)

は、音声の再合成タスクを可能にします。非常に効率的になり、非常に小さな音声を使用して、元のサウンドと一致する高忠実度のオーディオを合成できます。微分可能音声合成の原理は人間の生成システムの原理を利用しており、音声を Voice

(母音のモデリングに使用)

と Unvoice (子音のモデリングに使用) ## に分割します。 # 2 つの部分。音声パートでは、まず基本周波数信号を使用して高調波を生成し、それを F1 ～ F6 のフォルマントピークで構成されるフィルターでフィルター処理して、母音パートのスペクトル特性を取得します。

無声部分については、研究者は対応するフィルターを使用してホワイトノイズをフィルタリングし、対応するスペクトルを取得しました。その後、信号の音量を通じて、学習可能なパラメーターで 2 つの部分の混合比を制御できます。が増幅され、背景ノイズが追加されて、最終的な音声スペクトルが得られます。

△音声エンコーダとECoGデコーダ

研究結果

1. 時間的因果関係のある音声復号結果

最初に研究者らは、異なるモデルアーキテクチャ Convolution

(ResNet)

、Recurrent

(LSTM) 、および Transformer (3D Swin) の音声デコードパフォーマンスの違いを直接比較しました。これらのモデルは、非因果的 (非因果的)

または因果的操作を時間内に実行できることに注目する価値があります。

デコードモデルの因果的性質は、脳とコンピューターのインターフェイス (BCI)

アプリケーションに大きな影響を及ぼします。因果モデルは音声を生成するために過去と現在の神経信号のみを使用しますが、非因果モデルは未来も使用します。神経信号ですが、これはリアルタイムアプリケーションでは実現できません。

したがって、彼らは、非因果的操作と因果的操作を実行するときの同じモデルのパフォーマンスを比較することに重点を置きました。

ResNet モデルの因果関係のあるバージョンでも、非因果関係のあるバージョンと同等であり、両者に大きな違いはないことがわかりました。同様に、Swin モデルの因果バージョンと非因果バージョンのパフォーマンスは似ていますが、LSTM モデルの因果バージョンのパフォーマンスは非因果バージョンよりも大幅に低くなります。 EEG synthesis of natural speech! LeCun forwards new results of Nature sub-journal, and the code is open source

研究者らは、音の重み (母音と子音を区別するために使用)、ラウドネス、ピッチ f0、最初のフォルマント f1 と 2 番目の共鳴ピーク f2 など、いくつかの主要な音声パラメータの平均デコード精度 (N=48) を実証しました。。これらの音声パラメータ、特にピッチ、音の重み、最初の 2 つのフォルマントを正確に再構成することは、参加者の声を自然に模倣する正確な音声デコードと再構成を実現するために重要です。

結果は、非因果モデルと因果モデルの両方が合理的な解読結果を取得できることを示しており、これは将来の研究と応用に前向きな指針を提供します。

2. 音声デコードと左脳と右脳の神経信号の空間サンプリングレートに関する研究

研究者らはさらに、左脳半球と右脳半球の音声デコード結果を比較しました。ほとんどの研究は、音声および言語機能を支配する左半球に焦点を当てており、右半球からの言語情報の解読にはあまり注意が払われてきませんでした。

これを受けて、彼らは参加者の左脳半球と右脳半球のデコード性能を比較し、音声回復に右半球を使用できる可能性を検証しました。

研究で収集された 48 人の被験者のうち、16 人の被験者は右脳から ECoG 信号を収集しました。

EEG synthesis of natural speech! LeCun forwards new results of Nature sub-journal, and the code is open source

ResNet デコーダと Swin デコーダの性能を比較すると、右脳半球でも音声デコードを安定して実行でき、デコード効果は左脳半球に比べて小さいことがわかりました。。

これは、左半球に損傷があり言語能力を失った患者にとって、右半球からの神経信号を使用して言語を回復することが実現可能な解決策である可能性があることを意味します。

次に、彼らは音声デコード効果に対する電極サンプリング密度の影響も調査しました。

以前の研究では主に高密度の電極グリッド (0.4 mm)が使用されていましたが、臨床現場で一般的に使用される電極グリッドの密度はより低い (LD 1 cm)です。 5 人の参加者はハイブリッドタイプ (HB) 電極グリッドを使用しました。これは主に低密度サンプリングですが、追加の電極が組み込まれています。残りの 43 人の参加者は低密度でサンプリングされました。これらのハイブリッドサンプル (HB) のデコードパフォーマンスは、従来の低密度サンプル (LD) のデコードパフォーマンスと同様です。これは、モデルがさまざまな空間サンプリング密度で大脳皮質から音声情報を学習できることを示しており、臨床現場で一般的に使用されているサンプリング密度が将来のブレイン-コンピューターインターフェイスアプリケーションには十分である可能性があることも意味しています。

3. 音声解読に対する左脳と右脳の異なる脳領域の寄与に関する研究

研究者らは、脳の音声関連領域の寄与も調べました。これは、音声復号化プロセスの学位で、将来、左脳と右脳の半球に音声回復装置を埋め込むための重要な参考資料となります。オクルージョンテクノロジー

(オクルージョン分析)

を使用して、音声デコードに対する脳のさまざまな領域の寄与を評価します。

ResNet デコーダーと Swin デコーダーの因果モデルと非因果モデルを比較することにより、聴覚皮質が非因果モデルにより多く寄与していることがわかりました。これは、次の重要性を裏付けています。リアルタイム音声デコードではニューロフィードバック信号を利用できないため、アプリケーションでは因果モデルを使用する必要があります。

EEG synthesis of natural speech! LeCun forwards new results of Nature sub-journal, and the code is open source さらに、感覚運動皮質、特に腹部の寄与は右半球でも左半球でも同様であり、右半球に神経プロテーゼを移植することが実現可能な解決策である可能性があることが示唆されています。

最後に、要約すると、この研究はブレインコンピューターインターフェースにおいて一連の進歩をもたらしましたが、研究者らはまた、デコードプロセスにはECoG録音と組み合わせた音声トレーニングデータが必要であるなど、現在のモデルのいくつかの制限にも言及しました。、失語症の患者には適用できない場合があります。

将来的には、非グリッドデータを処理し、複数患者のマルチモーダル EEG データをより有効に活用できるモデルアーキテクチャを開発したいと考えています。

ブレインコンピューターインターフェイスの分野では、ハードウェア技術の反復と深層学習技術の急速な進歩により、現在の研究はまだ初期段階にあります。 SF映画に登場するコンピュータインターフェースは、ますます現実に近づいていくだろう。

論文リンク: https://www.nature.com/articles/s42256-024-00824-8。
GitHub リンク: https://github.com/flinkerlab/neural_speech_decoding。

その他の生成された音声の例: https://xc1490.github.io/nsd/。

The above is the detailed content of EEG synthesis of natural speech! LeCun forwards new results of Nature sub-journal, and the code is open source. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

How to Build Your Personal AI Assistant with Huggingface SmolLMApr 18, 2025 am 11:52 AM

Harness the Power of On-Device AI: Building a Personal Chatbot CLI In the recent past, the concept of a personal AI assistant seemed like science fiction. Imagine Alex, a tech enthusiast, dreaming of a smart, local AI companion—one that doesn't rely

AI For Mental Health Gets Attentively Analyzed Via Exciting New Initiative At Stanford UniversityApr 18, 2025 am 11:49 AM

Their inaugural launch of AI4MH took place on April 15, 2025, and luminary Dr. Tom Insel, M.D., famed psychiatrist and neuroscientist, served as the kick-off speaker. Dr. Insel is renowned for his outstanding work in mental health research and techno

The 2025 WNBA Draft Class Enters A League Growing And Fighting Online HarassmentApr 18, 2025 am 11:44 AM

"We want to ensure that the WNBA remains a space where everyone, players, fans and corporate partners, feel safe, valued and empowered," Engelbert stated, addressing what has become one of women's sports' most damaging challenges. The anno

Comprehensive Guide to Python Built-in Data Structures - Analytics VidhyaApr 18, 2025 am 11:43 AM

Introduction Python excels as a programming language, particularly in data science and generative AI. Efficient data manipulation (storage, management, and access) is crucial when dealing with large datasets. We've previously covered numbers and st

First Impressions From OpenAI's New Models Compared To AlternativesApr 18, 2025 am 11:41 AM

Before diving in, an important caveat: AI performance is non-deterministic and highly use-case specific. In simpler terms, Your Mileage May Vary. Don't take this (or any other) article as the final word—instead, test these models on your own scenario

AI Portfolio | How to Build a Portfolio for an AI Career?Apr 18, 2025 am 11:40 AM

Building a Standout AI/ML Portfolio: A Guide for Beginners and Professionals Creating a compelling portfolio is crucial for securing roles in artificial intelligence (AI) and machine learning (ML). This guide provides advice for building a portfolio

What Agentic AI Could Mean For Security OperationsApr 18, 2025 am 11:36 AM

The result? Burnout, inefficiency, and a widening gap between detection and action. None of this should come as a shock to anyone who works in cybersecurity. The promise of agentic AI has emerged as a potential turning point, though. This new class

Google Versus OpenAI: The AI Fight For StudentsApr 18, 2025 am 11:31 AM

Immediate Impact versus Long-Term Partnership? Two weeks ago OpenAI stepped forward with a powerful short-term offer, granting U.S. and Canadian college students free access to ChatGPT Plus through the end of May 2025. This tool includes GPT‑4o, an a

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks agoByDDD

Where to find the Crane Control Keycard in Atomfall

3 weeks agoByDDD

Saving in R.E.P.O. Explained (And Save Files)

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

4 weeks agoByDDD

Hot Tools

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),