search
HomeTechnology peripheralsAILow-quality multi-modal data fusion, multiple institutions jointly published a review paper

Low-quality multi-modal data fusion, multiple institutions jointly published a review paper
The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com

##Multimodal fusion is multimodal intelligence One of the basic tasks in .

The motivation of multi-modal fusion is to jointly utilize effective information from different modalities to improve the accuracy and stability of downstream tasks. Traditional multi-modal fusion methods often rely on high-quality data and are difficult to adapt to the complex and low-quality multi-modal data in real applications.

Low-quality multimodal jointly released by Tianjin University, Renmin University of China, Singapore Agency for Science, Technology and Research, Sichuan University, Xi'an University of Electronic Science and Technology and Harbin Institute of Technology (Shenzhen) Data fusion review "Multimodal Fusion on Low-quality Data: A Comprehensive Survey" introduces the fusion challenges of multimodal data from a unified perspective, and focuses on the existing fusion methods of low-quality multimodal data and potential development directions in this field. Ordered.
Low-quality multi-modal data fusion, multiple institutions jointly published a review paper
arXiv link:
http://arxiv.org/abs/2404.18947
awesome-list link:
https://github.com/QingyangZhang/awesome-low-quality-multimodal-learning

Traditional multimodal fusion model

#Humans perceive the world by fusing information from multiple modalities.

Humans have the ability to process these low-quality multi-modal data signals and perceive the environment even when the signals of some modalities are unreliable.

Although multimodal learning has made great progress, multimodal machine learning models still lack the ability to effectively fuse low-quality multimodal data in the real world. In practical experience, the performance of traditional multi-modal fusion models will decline significantly in the following scenarios:

(1)
Noisy multi-modal data : Some features of some modes are disturbed by noise and lose their original information. In the real world, unknown environmental factors, sensor failures, and signal loss during transmission may introduce noise interference, thereby damaging the reliability of the multi-modal fusion model.

(2)
Missing multimodal data: Due to various practical factors, some modalities of the actual collected multimodal data samples There may be something missing. For example, in the medical field, the multimodal data composed of patients' various physiological examination results may be seriously missing, and some patients may have never had a certain examination.

(3)
Imbalanced multi-modal data: Due to the inconsistency in the heterogeneous encoding properties and information quality differences between modalities, This leads to the emergence of imbalanced learning problems between modalities. During the multi-modal fusion process, the model may rely too much on certain modalities and ignore the potentially effective information contained in other modalities.

(4)
Dynamic low-quality multi-modal data: Due to the complexity and change of the application environment, different samples, different time and space, the modal quality It has dynamic changing characteristics. The occurrence of low-quality modal data is often difficult to predict in advance, which brings challenges to multi-modal fusion.

In order to fully characterize the nature and processing methods of low-quality multi-modal data, this article summarizes the current machine learning methods in the field of low-quality multi-modal fusion. The development process of this field is systematically reviewed, and issues that require further research are further prospected.

Low-quality multi-modal data fusion, multiple institutions jointly published a review paper

Figure 1. Low -quality multi -modal data classification schematic diagram, yellow and blue represent two modes, the deeper the color represents the higher the quality

#Denoising method in multi-modal fusion

Problem definition:

Noise is one of the most common causes of multimodal data quality degradation.

This article mainly focuses on two types of noise:

(1)Mode-related multi-modal noise
. This type of noise may be caused by factors such as sensor errors (such as instrument errors in medical diagnosis) and environmental factors (such as rain and fog in autonomous driving). The noise is limited to certain feature levels within a specific mode.

(2) Cross-modal noise at the semantic level. This type of noise is caused by the misalignment of high-level semantics between modalities, and is more difficult to process than multi-modal noise at the feature layer. Fortunately, due to the complementarity between multi-modal data modes and the redundancy of information, combining information from multiple modalities for denoising has proven to be an effective strategy in the multi-modal fusion process. .

Method classification:

Feature-level multi-mode State denoising methods are highly dependent on the specific modalities involved in the actual task.

This article mainly takes the multi-modal image fusion task as an example to illustrate. In multi-modal image fusion, the mainstream denoising methods include weighted fusion and joint variation.

Weighted fusion method
Considering that feature noise is random and real data obeys a specific distribution, the influence of noise is eliminated through weighted summation;

Joint variation method
is an expansion of traditional single-modal image variation denoising, which can transform the denoising process into an optimization problem. solution process, and utilizes complementary information from multiple modalities to improve the denoising effect. Semantic-level cross-modal noise results from weakly aligned or misaligned multimodal sample pairs.

For example, in the multi-modal target detection task of joint RGB and thermal images, due to differences in sensors, although the same target is present in both modalities appears, but its precise position and attitude may be slightly different in different modalities (weak alignment), which brings challenges to accurately estimate position information.

In the content understanding task of social media, the semantic information contained in the image and text modalities of a sample (such as a Weibo) may be very different or even completely different. Irrelevant (completely misaligned), which further brings greater challenges to multi-modal fusion. Ways to deal with cross-modal semantic noise include rule filtering, model filtering, noise-robust model regularization and other methods.

Future Outlook:

Although the processing of data noise has long been used in classic machine learning The task has been extensively studied, but in multi-modal scenarios, how to jointly utilize the complementarity and consistency between modalities to weaken the impact of noise is still an urgent research problem to be solved.

In addition, unlike traditional feature-level denoising, how to solve semantic-level noise during the pre-training and inference process of multi-modal large models is interesting and extremely Challenging questions.

Low-quality multi-modal data fusion, multiple institutions jointly published a review paper

###
                                                                                                                                                                                                        Table 1. Classification of multi-modal fusion methods for noise

Missing many Modal data fusion method

Problem definition:

In real scenarios The collected multimodal data is often incomplete. Due to various factors such as damage to the storage device and unreliable data transmission process, multimodal data often inevitably loses part of the modal information.

For example: in the recommendation system, the user's browsing history and credit rating constitute multi-modal data. However, due to permissions and privacy issues, it is often impossible to fully collect To build a multi-modal learning system based on user information from all modalities.

In medical diagnosis, due to limited equipment in some hospitals and high cost of specific examinations, the multi-modal diagnostic data of different patients are often highly incomplete. .

Method classification:

According to "Whether it is necessary to explicitly correct missing multi-mode Based on the classification principle of "completing modal data", missing multi-modal data fusion methods can be divided into:

(1) Multi-modal fusion method based on completion

Completion-based multi-modal fusion methods include model-independent completion methods: for example, completion methods that directly fill missing modes with 0 values ​​or the mean value of residual modes ;

Completion methods based on graphs or kernels: This type of method does not directly learn how to complete the original multi-modal data, but constructs a graph or kernel for each modality. Kernel, and then learn the similarity or correlation information between sample pairs, and then complete the missing data;

Complete directly at the original feature level: some methods Use generative models, such as Generative Adversarial Network GAN and its variants, to directly complete the missing features.

(2) Multi-modal fusion method without completion.

Different from completion-based methods, completion-free methods focus on how to use the useful information contained in the non-missing modalities to fuse the best possible representation. This type of method often adds constraints to the unified representation expected to be learned, so that this representation can reflect the complete information of the observable modal data, so as to bypass the completion process for multi-modal fusion. Low-quality multi-modal data fusion, multiple institutions jointly published a review paper
##Future Outlook:
Although many methods have been proposed at home and abroad to solve many incomplete problems in classic machine learning tasks such as clustering and classification. modal data fusion problem, but there are still some deeper challenges.
For example: Quality assessment of completion data in missing modal completion schemes is often overlooked.

In addition, the strategy of using a priori missing data location information to shield the missing modality itself is difficult to make up for the information gap and information imbalance caused by the missing modality.

                                                                                                                                                                                                      Table 2. Classification of fusion methods for missing multi-modal data

Balance Multi-modal fusion method

Problem definition:

In multi-modal fusion In modal learning, joint training is usually used to integrate data from different modalities to improve the overall performance and generalization performance of the model. However, this type of widely adopted joint training paradigm that uses a unified learning objective ignores the heterogeneity of data from different modalities.

On the one hand,
The heterogeneity of different modes in terms of data sources and forms
makes them have different characteristics in terms of convergence speed, etc. This makes it difficult for all modalities to be processed and learned well at the same time, which brings difficulties to multi-modal joint learning;

On the other hand, this difference also reflects On the quality of
unimodal data
. Although all modalities describe the same concept, they vary in the amount of information related to the target event or target object. Deep neural networks based on the maximum likelihood learning objective have greedy learning characteristics, resulting in multi-modal models that often rely on high-quality modalities with high discriminative information and are easier to learn, while insufficiently modeling other modal information.

In order to address these challenges and improve the learning quality of multi-modal models, related research on
balanced multi-modal learning
has received widespread attention recently.

Method classification:

According to different balance angles, related methods can be divided into For
methods based on characteristic differences
and methods based on quality differences.

(1) The widely used multi-modal joint training framework often
ignores the inherent differences in learning properties of single-modal data
, which may have a negative impact on Negatively affects model performance. The method based on characteristic differences starts from the differences in learning characteristics of each modality and tries to solve this problem in terms of learning goals, optimization, and architecture.

(2) Recent research further found that multi-modal models often
heavily rely on certain high-quality information modalities
while ignoring others modalities, resulting in insufficient learning of all modalities. Methods based on quality differences start from this perspective and try to solve this problem and promote the balanced utilization of different modalities in multi-modal models from the perspectives of learning objectives, optimization methods, model architecture and data enhancement.

                                                                                                                                                                                                          Table 3. Classification of balanced multi-modal data fusion methods

Future outlook:

The balanced multi-modal learning method mainly targets the differences in learning characteristics or data quality between different modalities caused by the heterogeneity of multi-modal data. These methods propose solutions from different perspectives such as learning objectives, optimization methods, model architecture, and data enhancement.

Balanced multimodal learning is currently a booming field, and there are many theoretical and application directions that have not been fully explored. For example, current methods are mainly limited to typical multi-modal tasks, which are mostly discriminative tasks and a few generative tasks.

In addition, multi-modal large models also need to combine modal data with different qualities. There is also this objective imbalance problem. Accordingly, it is expected that in Extend existing research or design new solutions in multimodal large-model scenarios.

Dynamic multi-modal fusion method

Problem definition:

Dynamic multimodal data refers to the fact that the quality of the modality changes dynamically with different input samples and scenarios. For example, in autonomous driving scenarios, the system obtains road surface and target information through RGB and infrared sensors. Under good lighting conditions, the RGB camera can better support the decision-making of the intelligent system because it can capture the rich texture and color information of the target;

# However, at night when there is insufficient light, the perception information provided by the infrared sensor is more reliable. How to enable the model to automatically sense changes in the quality of different modalities, so as to perform accurate and stable fusion, is the core task of the dynamic multi-modal fusion method.
Low-quality multi-modal data fusion, multiple institutions jointly published a review paper
# Method classification:

Dynamic multi-modal fusion methods can be roughly divided into three categories:

(1) Heuristic dynamic fusion method:

#The heuristic dynamic fusion method relies on the algorithm designer’s understanding of the multi-modal model application scenarios, generally through This is achieved by introducing a
dynamic fusion mechanism
.

For example, in the multi-modal target detection task of RGB/thermal signal collaboration, researchers heuristically designed an illumination perception module to dynamically evaluate the illumination of the input image situation, and dynamically adjust the fusion weight of RGB and thermal modes based on the light intensity to adapt to the environment. When the brightness is high, the RGB mode is mainly relied on for decision-making, and vice versa, the thermal mode is mainly relied on for decision-making.

(2) Dynamic fusion method based on attention mechanism:

Dynamic fusion method based on attention mechanism Mainly focus on
presentation layer fusion
. The attention mechanism itself has dynamic characteristics, so it can be naturally used in multi-modal dynamic fusion tasks.

Various mechanisms such as Self-attention, Spatial attention, Channel attention and Transformer are widely used in the construction of multi-modal fusion models. Such methods automatically learn how to perform dynamic fusion, driven by task goals. The fusion based on the attention mechanism can adapt to dynamic low-quality multi-modal data to a certain extent in the absence of explicit or heuristic guidance.

(3) Dynamic fusion method of uncertainty perception:

Dynamic fusion method of uncertainty perception Often have
clearer and explainable fusion mechanisms
. Different from complex fusion modes based on attention mechanisms, uncertainty-aware dynamic fusion methods rely on uncertainty estimates of modalities (such as evidence, energy, entropy, etc.) to adapt to low-quality multi-modal data.

#Specifically, uncertainty perception can be used to characterize the quality changes of each modality of the input data. When the quality of a certain modality of the input sample becomes low, the uncertainty of the model's decision-making based on that modality becomes higher, providing clear guidance for subsequent fusion mechanism design. In addition, compared to heuristics and attention mechanisms, uncertainty-aware dynamic fusion methods can provide good theoretical guarantees.

Future Outlook:

Although in traditional multi-modal fusion tasks, The superiority of uncertainty-aware dynamic fusion methods has been proven experimentally and theoretically. However, in SOTA's multi-modal models (not limited to fusion models, such as CLIP/BLIP, etc.), the idea of ​​dynamics also has Greater potential for exploration and application.

In addition, dynamic fusion mechanisms with theoretical guarantees are often limited to the decision-making level. How to make them work at the representation level is also worth thinking about and exploring.

The above is the detailed content of Low-quality multi-modal data fusion, multiple institutions jointly published a review paper. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:机器之心. If there is any infringement, please contact admin@php.cn delete
五个时间序列预测的深度学习模型对比总结五个时间序列预测的深度学习模型对比总结May 05, 2023 pm 05:16 PM

MakridakisM-Competitions系列(分别称为M4和M5)分别在2018年和2020年举办(M6也在今年举办了)。对于那些不了解的人来说,m系列得比赛可以被认为是时间序列生态系统的一种现有状态的总结,为当前得预测的理论和实践提供了经验和客观的证据。2018年M4的结果表明,纯粹的“ML”方法在很大程度上胜过传统的统计方法,这在当时是出乎意料的。在两年后的M5[1]中,最的高分是仅具有“ML”方法。并且所有前50名基本上都是基于ML的(大部分是树型模型)。这场比赛看到了LightG

RLHF与AlphaGo核心技术强强联合,UW/Meta让文本生成能力再上新台阶RLHF与AlphaGo核心技术强强联合,UW/Meta让文本生成能力再上新台阶Oct 27, 2023 pm 03:13 PM

在一项最新的研究中,来自UW和Meta的研究者提出了一种新的解码算法,将AlphaGo采用的蒙特卡洛树搜索算法(Monte-CarloTreeSearch,MCTS)应用到经过近端策略优化(ProximalPolicyOptimization,PPO)训练的RLHF语言模型上,大幅提高了模型生成文本的质量。PPO-MCTS算法通过探索与评估若干条候选序列,搜索到更优的解码策略。通过PPO-MCTS生成的文本能更好满足任务要求。论文链接:https://arxiv.org/pdf/2309.150

MIT团队运用机器学习闭环自主分子发现平台,成功发现、合成和描述了303种新分子MIT团队运用机器学习闭环自主分子发现平台,成功发现、合成和描述了303种新分子Jan 04, 2024 pm 05:38 PM

编辑|X传统意义上,发现所需特性的分子过程一直是由手动实验、化学家的直觉以及对机制和第一原理的理解推动的。随着化学家越来越多地使用自动化设备和预测合成算法,自主研究设备越来越接近实现。近日,来自MIT的研究人员开发了由集成机器学习工具驱动的闭环自主分子发现平台,以加速具有所需特性的分子的设计。无需手动实验即可探索化学空间并利用已知的化学结构。在两个案例研究中,该平台尝试了3000多个反应,其中1000多个产生了预测的反应产物,提出、合成并表征了303种未报道的染料样分子。该研究以《Autonom

Code Llama代码能力飙升,微调版HumanEval得分超越GPT-4,一天发布Code Llama代码能力飙升,微调版HumanEval得分超越GPT-4,一天发布Aug 26, 2023 pm 09:01 PM

昨天,Meta开源专攻代码生成的基础模型CodeLlama,可免费用于研究以及商用目的。CodeLlama系列模型有三个参数版本,参数量分别为7B、13B和34B。并且支持多种编程语言,包括Python、C++、Java、PHP、Typescript(Javascript)、C#和Bash。Meta提供的CodeLlama版本包括:代码Llama,基础代码模型;代码羊-Python,Python微调版本;代码Llama-Instruct,自然语言指令微调版就其效果来说,CodeLlama的不同版

AI助力脑机接口研究,纽约大学突破性神经语音解码技术,登Nature子刊AI助力脑机接口研究,纽约大学突破性神经语音解码技术,登Nature子刊Apr 17, 2024 am 08:40 AM

作者|陈旭鹏编辑|ScienceAI由于神经系统的缺陷导致的失语会导致严重的生活障碍,它可能会限制人们的职业和社交生活。近年来,深度学习和脑机接口(BCI)技术的飞速发展为开发能够帮助失语者沟通的神经语音假肢提供了可行性。然而,神经信号的语音解码面临挑战。近日,约旦大学VideoLab和FlinkerLab的研究者开发了一个新型的可微分语音合成器,可以利用一个轻型的卷积神经网络将语音编码为一系列可解释的语音参数(例如音高、响度、共振峰频率等),并通过可微分神经网络将这些参数合成为语音。这个合成器

手机摄影技术让以假乱真的好莱坞级电影特效视频走红手机摄影技术让以假乱真的好莱坞级电影特效视频走红Sep 07, 2023 am 09:41 AM

一个普通人用一台手机就能制作电影特效的时代已经来了。最近,一个名叫Simulon的3D技术公司发布了一系列特效视频,视频中的3D机器人与环境无缝融合,而且光影效果非常自然。呈现这些效果的APP也叫Simulon,它能让使用者通过手机摄像头的实时拍摄,直接渲染出CGI(计算机生成图像)特效,就跟打开美颜相机拍摄一样。在具体操作中,你要先上传一个3D模型(比如图中的机器人)。Simulon会将这个模型放置到你拍摄的现实世界中,并使用准确的照明、阴影和反射效果来渲染它们。整个过程不需要相机解算、HDR

准确率 >98%,基于电子密度的 GPT 用于化学研究,登 Nature 子刊准确率 >98%,基于电子密度的 GPT 用于化学研究,登 Nature 子刊Mar 27, 2024 pm 02:16 PM

编辑|紫罗可合成分子的化学空间是非常广阔的。有效地探索这个领域需要依赖计算筛选技术,比如深度学习,以便快速地发现各种有趣的化合物。将分子结构转换为数字表示形式,并开发相应算法生成新的分子结构是进行化学发现的关键。最近,英国格拉斯哥大学的研究团队提出了一种基于电子密度训练的机器学习模型,用于生成主客体binders。这种模型能够以简化分子线性输入规范(SMILES)格式读取数据,准确率高达98%,从而实现对分子在二维空间的全面描述。通过变分自编码器生成主客体系统的电子密度和静电势的三维表示,然后通

谷歌用大型模型训练机器狗理解模糊指令,激动不已准备去野餐谷歌用大型模型训练机器狗理解模糊指令,激动不已准备去野餐Jan 16, 2024 am 11:24 AM

人类和四足机器人之间简单有效的交互是创造能干的智能助理机器人的途径,其昭示着这样一个未来:技术以超乎我们想象的方式改善我们的生活。对于这样的人类-机器人交互系统,关键是让四足机器人有能力响应自然语言指令。近来大型语言模型(LLM)发展迅速,已经展现出了执行高层规划的潜力。然而,对LLM来说,理解低层指令依然很难,比如关节角度目标或电机扭矩,尤其是对于本身就不稳定、必需高频控制信号的足式机器人。因此,大多数现有工作都会假设已为LLM提供了决定机器人行为的高层API,而这就从根本上限制了系统的表现能

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)