Home  >  Article  >  Technology peripherals  >  Deeply cultivate AI voice multi-modal technology to achieve localized intelligent interactive experience

Deeply cultivate AI voice multi-modal technology to achieve localized intelligent interactive experience

王林
王林forward
2023-09-17 13:21:101438browse

With the development of 5G and artificial intelligence technology, intelligent voice has penetrated into people's daily lives with various intelligent terminal products, bringing more convenience and possibilities. As a provider of smart terminal products and mobile Internet services in emerging markets, Transsion focuses on continuous innovation in the field of artificial intelligence, continuously promotes the research and application of AI voice technology, explores more localized user scenario requirements, and brings full-scenario intelligence to users in emerging markets. interactive experience.

At present, Transsion has formed its own underlying AI voice technology capabilities in speech recognition, semantic understanding, speech synthesis, natural language processing, knowledge graphs, etc., has built advantages in small language voice data, and has developed in multilingual voice Major breakthroughs have been made in assistants, digital humans, and voice forgery detection technology. Since the beginning of this year, Transsion's AI technology department has continued to achieve results, winning great results in the ICASSP 2023 SLU Spoken Language Understanding Challenge and the IJCAI 2023 ADD Voice Deep Forgery Detection International Challenge, and published the Digital Human Multi-Model at the international multimedia flagship academic conference ICME 2023. Academic papers related to dynamic interaction.

Building a multilingual voice assistant for local voice interactive content ecosystem

Voice assistant is one of the standard applications of smartphones. Its core technology is voice interaction and natural language understanding, aiming to help users perform target tasks more quickly and efficiently. Faced with the demand for local voice interaction in emerging markets, TRANSSION has been deeply involved in multi-lingual voice assistant technology for a long time, focusing on understanding the needs of local users and forming technical solutions. It has accumulated profound technical capabilities and practical experience in the process of exploration and research and development.

At the top international conference ICASSP in 2023, Transsion AI Technology Department achieved great success in the SLU (Spoken Language Understanding) Challenge. With their excellent performance in speech recognition and semantic understanding, they won first place in the offline voice assistant sub-track with an accuracy of 71.97%. Their entry paper "A Two-Stage System for Spoken Language Understanding" was also included in the IEEE Institute of Electrical and Electronics Engineers

Deeply cultivate AI voice multi-modal technology to achieve localized intelligent interactive experience

Colleagues from Transsion’s AI technology department shared research results at ICASSP 2023

Currently, voice assistants are mainly oriented to mainstream languages, but have less coverage of niche languages, specific groups of people and other subdivided areas. Targeting the local accents and minority languages ​​of users in emerging markets such as Africa and South Asia, TRANSSION has built a localized low-cost, high-quality corpus data production system based on massive mobile phone user resources to solve the problem of lack of corpus and data scarcity in minority languages. . On this basis, Transsion develops multilingual voice assistants that can adapt to the language and cultural characteristics of local users in emerging markets, helping local users more conveniently use local languages ​​to interact with their mobile phones via voice. Currently, Transsion's multilingual voice assistant technology supports voice interaction and natural language understanding capabilities in English, French, Hausa, Arabic, Swahili and other languages, covering contact calls, APP quick launch, music playback, More than 100 usage scenarios such as WhatsApp messaging and chatting

In order to meet the needs of local users in life services, Transsion's multilingual AI voice assistant technology will continue to be applied to more life, travel, study and work scenarios to build a cross-language AI content service Ecosystem enables intelligent voice services to penetrate into all aspects of local life and benefit more people who speak small languages

Deeply cultivate AI voice multi-modal technology to achieve localized intelligent interactive experience

AI digital human technology empowers Transsion’s multi-scenario business

With the accelerated development of interactive intelligence technology, digital humans are moving from technological innovation to industrial application, playing a role in entertainment, education, medical and other fields. Transsion actively embraces AI development opportunities, deploys digital human technology in advance, and has established complete full-link technology and engineering self-research capabilities. Transsion's digital human system includes 2D real people and 3D realistic digital humans. It has data resources based on multilingual speech recognition, speech synthesis, voice awakening, natural language understanding and digital human capabilities. It can be used in multilingual voice dialogue, human design and Appearance, intelligent scene interaction and other areas have formed their own localized characteristics and industry leadership. In January this year, Transsion’s digital human system received the authoritative standard certification in the digital human field issued by the China Academy of Information and Communications Technology. This is also the only digital human system from a Chinese mobile phone manufacturer that has passed the evaluation of China Academy of Information and Communications Technology and is based on "interactive dialogue".

In order to improve the simulation effect of virtual images and synthesize realistic and expressive digital human videos, Transsion AI Technology Department independently developed end-to-end technology. In the process of optimizing the quality of digital human video generation, it proposed based on the Unet network A new technical framework densely-connected Unet structure is developed, and the encoder structure of CLIP is introduced to use text semantic information to improve the animation effect of digital human mouths. At the same time, this technology proposes a probability density map of face key point technology, which increases the modal information of the model network and improves the quality of model generation. This technological breakthrough can make the facial image of digital people more realistic and delicate, while improving the consistency of voice and lip shape. Its generation effect has reached an academically leading level. The related academic paper "CPNet: Exploiting CLIP-based Attention Condenser and Probability Map Guidance for High-fidelity Talking Face Generation" was successfully accepted by the international multimedia flagship academic conference ICME 2023 (IEEE International Conference on Multimedia and Expo).

Deeply cultivate AI voice multi-modal technology to achieve localized intelligent interactive experience

Currently, Transsion Digital Human System has been widely used in multiple business scenarios. It is not only used as a smart shopping guide in overseas mobile phone stores to provide users with a reference for purchasing mobile phones, but can also provide smart voice assistant functions for various smart terminal products to enhance user experience. In the future, Transsion will further utilize "AI digital human" technology to empower businesses in a variety of scenarios, actively explore new business forms such as digital human voice assistants and customer service systems, and bring a new intelligent interactive experience to users

Continue to build the underlying technical capabilities of AI voice

With the rapid development of AI technology today, algorithm-generated audio and audio forgery can be used to confuse fake audio with real audio. It is very difficult for ordinary users to distinguish audio authenticity from fake audio. In order to maintain the credibility of information and ensure social security, voice forgery detection technology has become crucial and has become a new research direction in the field of artificial intelligence. Transsion focuses on the business scenarios of smart terminal products and is guided by local user needs, continuously extending the underlying technical capabilities of AI voice, deploying new technology fields, and making major breakthroughs in voice forgery detection technology.

The Second Audio Deepfake Detection Challenge ADD (The Second Audio Deepfake Detection Challenge) "Tampering Area" organized by Transsion AI Technology Department at IJCAI 2023 (The 32nd International Joint Conference on Artificial Intelligence) Won second place in the Manipulation Region Location track. During the competition, Transsion's AI technology department independently developed innovative AI model algorithms and technologies that can accurately identify and locate speech tampering in audio, thereby effectively ensuring the originality and authenticity of digital audio and building a foundation for AI applications and information security. Provide new ideas. Relevant academic papers have been successfully published at this IJCAI 2023 Workshop on Deepfake Audio Detection and Analysis (DADA 2023) conference.

Deeply cultivate AI voice multi-modal technology to achieve localized intelligent interactive experience

In the next step, Transsion's AI technology department will continue to explore the application of voice deep forgery detection technology on Transsion's smart terminal products, such as call fraud checks to protect user privacy and security, etc., and continuously improve user experience.

In the future, Transsion will continue to work hard in the field of AI voice multi-modal technology, focusing on the core business needs of "mobile phone Internet services home appliances and digital accessories", combined with deep insights into emerging markets and local consumers, to provide users with Smart life experiences that meet their needs form a localized AI content service ecosystem that continues to meet multi-lingual, multi-scenario, personalized, and intelligent application needs.

The above is the detailed content of Deeply cultivate AI voice multi-modal technology to achieve localized intelligent interactive experience. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:sohu.com. If there is any infringement, please contact admin@php.cn delete