search
HomeTechnology peripheralsAIByte's large-model simultaneous interpretation agent has a level of simultaneous interpretation comparable to humans right from the start.

Whether it’s tongue twisters with super fast speech and complex pronunciation, exquisite classical Chinese, or casual chats full of impromptu and inspiration, the model can provide accurate and authentic translation results smoothly and naturally.

In recent years, artificial intelligence (AI), especially AI represented by large language models (LLMs), is developing at an alarming rate. These models are used in a variety of natural language processing tasks. Demonstrated outstanding abilities. However, despite breakthroughs in many fields, simultaneous interpretation (Simultaneous Interpretation, SI), which represents the top level of human language, is still a problem that has not been completely overcome.

Traditional simultaneous interpretation software on the market usually adopts the cascaded model method, that is, automatic speech recognition (ASR) is performed first, and then machine translation (MT) is performed. There is a significant problem with this approach – error propagation. Errors in the ASR process will directly affect the subsequent translation quality, leading to serious error accumulation. In addition, due to limited low-latency requirements, traditional simultaneous interpretation systems usually only use small models with poor performance, which creates bottlenecks in dealing with complex and changeable practical application scenarios.

Researchers from the ByteDance Research team launched an end-to-end simultaneous interpretation agent: Cross Language Agent - Simultaneous Interpretation, CLASI. Its effect is close to professional artificial-level simultaneous interpretation, showing great potential and Advanced technical capabilities. CLASI adopts an end-to-end architecture to avoid the problem of error propagation in the cascade model. It relies on the speech understanding capabilities of the large bean bag base model and the large bean bag model speech group. It also has the ability to acquire knowledge from the outside, and finally formed A simultaneous interpretation system that is comparable to human performance.

Bytes large-model simultaneous interpretation agent has a level of simultaneous interpretation comparable to humans right from the start.

  • Paper address: https://byteresearchcla.github.io/clasi/technical_report.pdf
  • Display page: https://byteresearchcla.github.io/clasi/

Effect Show

Video Demo: First, use a few impromptu videos to experience the effect of CLASI. All subtitles are recorded and output in real time. We can see that whether it is tongue twisters with fast speech and complex pronunciation, exquisite classical Chinese, or casual chats full of impromptu and inspiration, the model can provide accurate and authentic translation results smoothly and naturally. Not to mention, CLASI excels in its specialty – translating conference scenes.

Impromptu conversation-ConstellationBytes large-model simultaneous interpretation agent has a level of simultaneous interpretation comparable to humans right from the start.Reading-Chibi FuBytes large-model simultaneous interpretation agent has a level of simultaneous interpretation comparable to humans right from the start.Tongue twistersBytes large-model simultaneous interpretation agent has a level of simultaneous interpretation comparable to humans right from the start.

For more videos, please click "Read the original text" to view

Quantitative comparison: The researchers invited professional simultaneous interpreters to conduct manual evaluations in four different fields in terms of Chinese-English and English-Chinese translation, and used an evaluation index consistent with manual simultaneous interpretation: the proportion of effective information (percentage system). As can be seen in the figure, the CLASI system is significantly ahead of all commercial systems and open source SOTA systems, and even reaches or exceeds the level of human simultaneous interpretation on some test sets (it is generally believed that the average level of human simultaneous interpretation is about 80%).

Bytes large-model simultaneous interpretation agent has a level of simultaneous interpretation comparable to humans right from the start.

System Architecture

En termes d'architecture système, CLASI adopte une architecture basée sur les agents LLM (à gauche dans la figure ci-dessous), qui définit l'interprétation simultanée comme une série d'opérations simples et coordonnées, comprenant la lecture de flux audio, la récupération (facultatif) et la lecture de la mémoire, mettre à jour la mémoire, la sortie, etc. L'ensemble du processus est contrôlé de manière autonome par un vaste modèle linguistique, permettant ainsi d'obtenir un équilibre efficace entre performances en temps réel et qualité de traduction. Le système peut ajuster de manière flexible les stratégies de traitement de chaque lien en fonction des besoins réels, garantissant ainsi le maintien de l'exactitude et de la cohérence du contenu traduit tout en transmettant efficacement les informations. Le modèle sous-jacent de CLASI est un LLM conditionné par un encodeur, pré-entraîné sur des quantités massives de données non supervisées et supervisées. L'architecture système du modèle CLASI est présentée dans la figure ci-dessous.

Bytes large-model simultaneous interpretation agent has a level of simultaneous interpretation comparable to humans right from the start.

Figure 1 : Schéma montrant le processus de fonctionnement global du CLASSI. À l'étape 1, CLASSI traite les données audio actuellement entrées. Le chercheur est ensuite activé (facultatif) pour récupérer les informations pertinentes de la base de connaissances définie par l'utilisateur. Dans cet exemple, l'utilisation de la paire de traduction « Modèle Ising : Modèle Ising » dans la base de connaissances peut aider le modèle à générer la traduction correcte. À l'étape 3, CLASI charge la transcription (facultatif) et la traduction depuis la mémoire du tour précédent. Ensuite (étapes 4 et 5), CLASI peut permettre à la chaîne de pensées (CoT) de produire les résultats de translittération (facultatif) et de traduction, puis de mettre à jour sa mémoire. Enfin, revenez à l’étape 1 pour traiter le prochain tour de parole.

Bytes large-model simultaneous interpretation agent has a level of simultaneous interpretation comparable to humans right from the start.

Figure 2 : Schéma structurel du CLASSI. Au tour r, CLASI prend en entrée le flux audio actuel, la mémoire précédente (r-1) et les connaissances récupérées (le cas échéant). CLASSI génère une réponse basée sur les instructions données, puis met à jour la mémoire. Dans le même temps, CLASI affichera également désormais l'horodatage du dernier fragment sémantique. Pour l'exemple donné, ce qui précède l'expression « juste avant » est considéré comme un fragment sémantique complet, donc l'horodatage de coupure est juste avant cette expression.

Résultats expérimentaux

Bytes large-model simultaneous interpretation agent has a level of simultaneous interpretation comparable to humans right from the start.

Tableau 1 : Dans l'évaluation manuelle de la proportion de champs valides (Valid Information Proportion, VIP), le système CLASI a largement surpassé tous les autres produits concurrents, et dans les deux sens linguistiques. une précision de plus de 78 % a été obtenue. D'une manière générale, la précision de l'interprétation simultanée humaine peut être considérée comme supérieure à 70 % et peut idéalement atteindre 95 %, les chercheurs utilisant une précision de 80 % comme norme moyenne pour les traducteurs humains de haut niveau.

Exemple d'analyse

Chinois vers anglais : Bytes large-model simultaneous interpretation agent has a level of simultaneous interpretation comparable to humans right from the start.

Anglais vers chinois :

Bytes large-model simultaneous interpretation agent has a level of simultaneous interpretation comparable to humans right from the start.

On peut voir que la traduction de CLASI est nettement meilleure que celle des systèmes commerciaux à bien des égards.

Résumé

Des chercheurs de l'équipe ByteDance Research ont proposé un agent d'interprétation simultanée basé sur le grand modèle Beanbao : CLASSI. Grâce à une pré-formation et à un apprentissage par imitation à grande échelle, CLASI surpasse considérablement les performances des systèmes d'interprétation simultanée automatique existants en matière d'évaluation humaine, atteignant presque le niveau de l'interprétation simultanée humaine.

1. Les chercheurs proposent une stratégie d'alphabétisation basée sur les données qui imite les traducteurs humains professionnels. Cette stratégie équilibre facilement la qualité de la traduction et la latence sans nécessiter une conception humaine préalable complexe. Contrairement à la plupart des systèmes commerciaux qui réécrivent fréquemment les résultats pendant la traduction pour améliorer la qualité, cette stratégie garantit que tous les résultats sont déterministes tout en conservant une qualité élevée.

2. Les traducteurs humains doivent généralement préparer le contenu d'interprétation simultanée à l'avance. S'inspirant de cela, les chercheurs ont introduit un processus de génération augmentée par récupération multimodale (MM-RAG) pour permettre à LLM d'avoir des connaissances spécifiques à un domaine en temps réel. Le module proposé améliore encore la qualité de la traduction avec une surcharge de calcul minimale lors de l'inférence.

3. Les chercheurs ont travaillé en étroite collaboration avec des interprètes simultanés humains professionnels pour développer une nouvelle stratégie d'évaluation manuelle « Proportion d'informations valides » (VIP) et publié des lignes directrices détaillées. Dans le même temps, un ensemble de tests d'annotation manuelle multi-domaines pour la traduction vocale longue, plus proche des scénarios réels, a également été publié.

The above is the detailed content of Byte's large-model simultaneous interpretation agent has a level of simultaneous interpretation comparable to humans right from the start.. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
The Hidden Dangers Of AI Internal Deployment: Governance Gaps And Catastrophic RisksThe Hidden Dangers Of AI Internal Deployment: Governance Gaps And Catastrophic RisksApr 28, 2025 am 11:12 AM

The unchecked internal deployment of advanced AI systems poses significant risks, according to a new report from Apollo Research. This lack of oversight, prevalent among major AI firms, allows for potential catastrophic outcomes, ranging from uncont

Building The AI PolygraphBuilding The AI PolygraphApr 28, 2025 am 11:11 AM

Traditional lie detectors are outdated. Relying on the pointer connected by the wristband, a lie detector that prints out the subject's vital signs and physical reactions is not accurate in identifying lies. This is why lie detection results are not usually adopted by the court, although it has led to many innocent people being jailed. In contrast, artificial intelligence is a powerful data engine, and its working principle is to observe all aspects. This means that scientists can apply artificial intelligence to applications seeking truth through a variety of ways. One approach is to analyze the vital sign responses of the person being interrogated like a lie detector, but with a more detailed and precise comparative analysis. Another approach is to use linguistic markup to analyze what people actually say and use logic and reasoning. As the saying goes, one lie breeds another lie, and eventually

Is AI Cleared For Takeoff In The Aerospace Industry?Is AI Cleared For Takeoff In The Aerospace Industry?Apr 28, 2025 am 11:10 AM

The aerospace industry, a pioneer of innovation, is leveraging AI to tackle its most intricate challenges. Modern aviation's increasing complexity necessitates AI's automation and real-time intelligence capabilities for enhanced safety, reduced oper

Watching Beijing's Spring Robot RaceWatching Beijing's Spring Robot RaceApr 28, 2025 am 11:09 AM

The rapid development of robotics has brought us a fascinating case study. The N2 robot from Noetix weighs over 40 pounds and is 3 feet tall and is said to be able to backflip. Unitree's G1 robot weighs about twice the size of the N2 and is about 4 feet tall. There are also many smaller humanoid robots participating in the competition, and there is even a robot that is driven forward by a fan. Data interpretation The half marathon attracted more than 12,000 spectators, but only 21 humanoid robots participated. Although the government pointed out that the participating robots conducted "intensive training" before the competition, not all robots completed the entire competition. Champion - Tiangong Ult developed by Beijing Humanoid Robot Innovation Center

The Mirror Trap: AI Ethics And The Collapse Of Human ImaginationThe Mirror Trap: AI Ethics And The Collapse Of Human ImaginationApr 28, 2025 am 11:08 AM

Artificial intelligence, in its current form, isn't truly intelligent; it's adept at mimicking and refining existing data. We're not creating artificial intelligence, but rather artificial inference—machines that process information, while humans su

New Google Leak Reveals Handy Google Photos Feature UpdateNew Google Leak Reveals Handy Google Photos Feature UpdateApr 28, 2025 am 11:07 AM

A report found that an updated interface was hidden in the code for Google Photos Android version 7.26, and each time you view a photo, a row of newly detected face thumbnails are displayed at the bottom of the screen. The new facial thumbnails are missing name tags, so I suspect you need to click on them individually to see more information about each detected person. For now, this feature provides no information other than those people that Google Photos has found in your images. This feature is not available yet, so we don't know how Google will use it accurately. Google can use thumbnails to speed up finding more photos of selected people, or may be used for other purposes, such as selecting the individual to edit. Let's wait and see. As for now

Guide to Reinforcement Finetuning - Analytics VidhyaGuide to Reinforcement Finetuning - Analytics VidhyaApr 28, 2025 am 09:30 AM

Reinforcement finetuning has shaken up AI development by teaching models to adjust based on human feedback. It blends supervised learning foundations with reward-based updates to make them safer, more accurate, and genuinely help

Let's Dance: Structured Movement To Fine-Tune Our Human Neural NetsLet's Dance: Structured Movement To Fine-Tune Our Human Neural NetsApr 27, 2025 am 11:09 AM

Scientists have extensively studied human and simpler neural networks (like those in C. elegans) to understand their functionality. However, a crucial question arises: how do we adapt our own neural networks to work effectively alongside novel AI s

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.