search
HomeTechnology peripheralsAILow-quality multi-modal data fusion, multiple institutions jointly published a review paper

Low-quality multi-modal data fusion, multiple institutions jointly published a review paper
The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com

##Multimodal fusion is multimodal intelligence One of the basic tasks in .

The motivation of multi-modal fusion is to jointly utilize effective information from different modalities to improve the accuracy and stability of downstream tasks. Traditional multi-modal fusion methods often rely on high-quality data and are difficult to adapt to the complex and low-quality multi-modal data in real applications.

Low-quality multimodal jointly released by Tianjin University, Renmin University of China, Singapore Agency for Science, Technology and Research, Sichuan University, Xi'an University of Electronic Science and Technology and Harbin Institute of Technology (Shenzhen) Data fusion review "Multimodal Fusion on Low-quality Data: A Comprehensive Survey" introduces the fusion challenges of multimodal data from a unified perspective, and focuses on the existing fusion methods of low-quality multimodal data and potential development directions in this field. Ordered.
Low-quality multi-modal data fusion, multiple institutions jointly published a review paper
arXiv link:
http://arxiv.org/abs/2404.18947
awesome-list link:
https://github.com/QingyangZhang/awesome-low-quality-multimodal-learning

Traditional multimodal fusion model

#Humans perceive the world by fusing information from multiple modalities.

Humans have the ability to process these low-quality multi-modal data signals and perceive the environment even when the signals of some modalities are unreliable.

Although multimodal learning has made great progress, multimodal machine learning models still lack the ability to effectively fuse low-quality multimodal data in the real world. In practical experience, the performance of traditional multi-modal fusion models will decline significantly in the following scenarios:

(1)
Noisy multi-modal data : Some features of some modes are disturbed by noise and lose their original information. In the real world, unknown environmental factors, sensor failures, and signal loss during transmission may introduce noise interference, thereby damaging the reliability of the multi-modal fusion model.

(2)
Missing multimodal data: Due to various practical factors, some modalities of the actual collected multimodal data samples There may be something missing. For example, in the medical field, the multimodal data composed of patients' various physiological examination results may be seriously missing, and some patients may have never had a certain examination.

(3)
Imbalanced multi-modal data: Due to the inconsistency in the heterogeneous encoding properties and information quality differences between modalities, This leads to the emergence of imbalanced learning problems between modalities. During the multi-modal fusion process, the model may rely too much on certain modalities and ignore the potentially effective information contained in other modalities.

(4)
Dynamic low-quality multi-modal data: Due to the complexity and change of the application environment, different samples, different time and space, the modal quality It has dynamic changing characteristics. The occurrence of low-quality modal data is often difficult to predict in advance, which brings challenges to multi-modal fusion.

In order to fully characterize the nature and processing methods of low-quality multi-modal data, this article summarizes the current machine learning methods in the field of low-quality multi-modal fusion. The development process of this field is systematically reviewed, and issues that require further research are further prospected.

Low-quality multi-modal data fusion, multiple institutions jointly published a review paper

Figure 1. Low -quality multi -modal data classification schematic diagram, yellow and blue represent two modes, the deeper the color represents the higher the quality

#Denoising method in multi-modal fusion

Problem definition:

Noise is one of the most common causes of multimodal data quality degradation.

This article mainly focuses on two types of noise:

(1)Mode-related multi-modal noise
. This type of noise may be caused by factors such as sensor errors (such as instrument errors in medical diagnosis) and environmental factors (such as rain and fog in autonomous driving). The noise is limited to certain feature levels within a specific mode.

(2) Cross-modal noise at the semantic level. This type of noise is caused by the misalignment of high-level semantics between modalities, and is more difficult to process than multi-modal noise at the feature layer. Fortunately, due to the complementarity between multi-modal data modes and the redundancy of information, combining information from multiple modalities for denoising has proven to be an effective strategy in the multi-modal fusion process. .

Method classification:

Feature-level multi-mode State denoising methods are highly dependent on the specific modalities involved in the actual task.

This article mainly takes the multi-modal image fusion task as an example to illustrate. In multi-modal image fusion, the mainstream denoising methods include weighted fusion and joint variation.

Weighted fusion method
Considering that feature noise is random and real data obeys a specific distribution, the influence of noise is eliminated through weighted summation;

Joint variation method
is an expansion of traditional single-modal image variation denoising, which can transform the denoising process into an optimization problem. solution process, and utilizes complementary information from multiple modalities to improve the denoising effect. Semantic-level cross-modal noise results from weakly aligned or misaligned multimodal sample pairs.

For example, in the multi-modal target detection task of joint RGB and thermal images, due to differences in sensors, although the same target is present in both modalities appears, but its precise position and attitude may be slightly different in different modalities (weak alignment), which brings challenges to accurately estimate position information.

In the content understanding task of social media, the semantic information contained in the image and text modalities of a sample (such as a Weibo) may be very different or even completely different. Irrelevant (completely misaligned), which further brings greater challenges to multi-modal fusion. Ways to deal with cross-modal semantic noise include rule filtering, model filtering, noise-robust model regularization and other methods.

Future Outlook:

Although the processing of data noise has long been used in classic machine learning The task has been extensively studied, but in multi-modal scenarios, how to jointly utilize the complementarity and consistency between modalities to weaken the impact of noise is still an urgent research problem to be solved.

In addition, unlike traditional feature-level denoising, how to solve semantic-level noise during the pre-training and inference process of multi-modal large models is interesting and extremely Challenging questions.

Low-quality multi-modal data fusion, multiple institutions jointly published a review paper

###
                                                                                                                                                                                                        Table 1. Classification of multi-modal fusion methods for noise

Missing many Modal data fusion method

Problem definition:

In real scenarios The collected multimodal data is often incomplete. Due to various factors such as damage to the storage device and unreliable data transmission process, multimodal data often inevitably loses part of the modal information.

For example: in the recommendation system, the user's browsing history and credit rating constitute multi-modal data. However, due to permissions and privacy issues, it is often impossible to fully collect To build a multi-modal learning system based on user information from all modalities.

In medical diagnosis, due to limited equipment in some hospitals and high cost of specific examinations, the multi-modal diagnostic data of different patients are often highly incomplete. .

Method classification:

According to "Whether it is necessary to explicitly correct missing multi-mode Based on the classification principle of "completing modal data", missing multi-modal data fusion methods can be divided into:

(1) Multi-modal fusion method based on completion

Completion-based multi-modal fusion methods include model-independent completion methods: for example, completion methods that directly fill missing modes with 0 values ​​or the mean value of residual modes ;

Completion methods based on graphs or kernels: This type of method does not directly learn how to complete the original multi-modal data, but constructs a graph or kernel for each modality. Kernel, and then learn the similarity or correlation information between sample pairs, and then complete the missing data;

Complete directly at the original feature level: some methods Use generative models, such as Generative Adversarial Network GAN and its variants, to directly complete the missing features.

(2) Multi-modal fusion method without completion.

Different from completion-based methods, completion-free methods focus on how to use the useful information contained in the non-missing modalities to fuse the best possible representation. This type of method often adds constraints to the unified representation expected to be learned, so that this representation can reflect the complete information of the observable modal data, so as to bypass the completion process for multi-modal fusion. Low-quality multi-modal data fusion, multiple institutions jointly published a review paper
##Future Outlook:
Although many methods have been proposed at home and abroad to solve many incomplete problems in classic machine learning tasks such as clustering and classification. modal data fusion problem, but there are still some deeper challenges.
For example: Quality assessment of completion data in missing modal completion schemes is often overlooked.

In addition, the strategy of using a priori missing data location information to shield the missing modality itself is difficult to make up for the information gap and information imbalance caused by the missing modality.

                                                                                                                                                                                                      Table 2. Classification of fusion methods for missing multi-modal data

Balance Multi-modal fusion method

Problem definition:

In multi-modal fusion In modal learning, joint training is usually used to integrate data from different modalities to improve the overall performance and generalization performance of the model. However, this type of widely adopted joint training paradigm that uses a unified learning objective ignores the heterogeneity of data from different modalities.

On the one hand,
The heterogeneity of different modes in terms of data sources and forms
makes them have different characteristics in terms of convergence speed, etc. This makes it difficult for all modalities to be processed and learned well at the same time, which brings difficulties to multi-modal joint learning;

On the other hand, this difference also reflects On the quality of
unimodal data
. Although all modalities describe the same concept, they vary in the amount of information related to the target event or target object. Deep neural networks based on the maximum likelihood learning objective have greedy learning characteristics, resulting in multi-modal models that often rely on high-quality modalities with high discriminative information and are easier to learn, while insufficiently modeling other modal information.

In order to address these challenges and improve the learning quality of multi-modal models, related research on
balanced multi-modal learning
has received widespread attention recently.

Method classification:

According to different balance angles, related methods can be divided into For
methods based on characteristic differences
and methods based on quality differences.

(1) The widely used multi-modal joint training framework often
ignores the inherent differences in learning properties of single-modal data
, which may have a negative impact on Negatively affects model performance. The method based on characteristic differences starts from the differences in learning characteristics of each modality and tries to solve this problem in terms of learning goals, optimization, and architecture.

(2) Recent research further found that multi-modal models often
heavily rely on certain high-quality information modalities
while ignoring others modalities, resulting in insufficient learning of all modalities. Methods based on quality differences start from this perspective and try to solve this problem and promote the balanced utilization of different modalities in multi-modal models from the perspectives of learning objectives, optimization methods, model architecture and data enhancement.

                                                                                                                                                                                                          Table 3. Classification of balanced multi-modal data fusion methods

Future outlook:

The balanced multi-modal learning method mainly targets the differences in learning characteristics or data quality between different modalities caused by the heterogeneity of multi-modal data. These methods propose solutions from different perspectives such as learning objectives, optimization methods, model architecture, and data enhancement.

Balanced multimodal learning is currently a booming field, and there are many theoretical and application directions that have not been fully explored. For example, current methods are mainly limited to typical multi-modal tasks, which are mostly discriminative tasks and a few generative tasks.

In addition, multi-modal large models also need to combine modal data with different qualities. There is also this objective imbalance problem. Accordingly, it is expected that in Extend existing research or design new solutions in multimodal large-model scenarios.

Dynamic multi-modal fusion method

Problem definition:

Dynamic multimodal data refers to the fact that the quality of the modality changes dynamically with different input samples and scenarios. For example, in autonomous driving scenarios, the system obtains road surface and target information through RGB and infrared sensors. Under good lighting conditions, the RGB camera can better support the decision-making of the intelligent system because it can capture the rich texture and color information of the target;

# However, at night when there is insufficient light, the perception information provided by the infrared sensor is more reliable. How to enable the model to automatically sense changes in the quality of different modalities, so as to perform accurate and stable fusion, is the core task of the dynamic multi-modal fusion method.
Low-quality multi-modal data fusion, multiple institutions jointly published a review paper
# Method classification:

Dynamic multi-modal fusion methods can be roughly divided into three categories:

(1) Heuristic dynamic fusion method:

#The heuristic dynamic fusion method relies on the algorithm designer’s understanding of the multi-modal model application scenarios, generally through This is achieved by introducing a
dynamic fusion mechanism
.

For example, in the multi-modal target detection task of RGB/thermal signal collaboration, researchers heuristically designed an illumination perception module to dynamically evaluate the illumination of the input image situation, and dynamically adjust the fusion weight of RGB and thermal modes based on the light intensity to adapt to the environment. When the brightness is high, the RGB mode is mainly relied on for decision-making, and vice versa, the thermal mode is mainly relied on for decision-making.

(2) Dynamic fusion method based on attention mechanism:

Dynamic fusion method based on attention mechanism Mainly focus on
presentation layer fusion
. The attention mechanism itself has dynamic characteristics, so it can be naturally used in multi-modal dynamic fusion tasks.

Various mechanisms such as Self-attention, Spatial attention, Channel attention and Transformer are widely used in the construction of multi-modal fusion models. Such methods automatically learn how to perform dynamic fusion, driven by task goals. The fusion based on the attention mechanism can adapt to dynamic low-quality multi-modal data to a certain extent in the absence of explicit or heuristic guidance.

(3) Dynamic fusion method of uncertainty perception:

Dynamic fusion method of uncertainty perception Often have
clearer and explainable fusion mechanisms
. Different from complex fusion modes based on attention mechanisms, uncertainty-aware dynamic fusion methods rely on uncertainty estimates of modalities (such as evidence, energy, entropy, etc.) to adapt to low-quality multi-modal data.

#Specifically, uncertainty perception can be used to characterize the quality changes of each modality of the input data. When the quality of a certain modality of the input sample becomes low, the uncertainty of the model's decision-making based on that modality becomes higher, providing clear guidance for subsequent fusion mechanism design. In addition, compared to heuristics and attention mechanisms, uncertainty-aware dynamic fusion methods can provide good theoretical guarantees.

Future Outlook:

Although in traditional multi-modal fusion tasks, The superiority of uncertainty-aware dynamic fusion methods has been proven experimentally and theoretically. However, in SOTA's multi-modal models (not limited to fusion models, such as CLIP/BLIP, etc.), the idea of ​​dynamics also has Greater potential for exploration and application.

In addition, dynamic fusion mechanisms with theoretical guarantees are often limited to the decision-making level. How to make them work at the representation level is also worth thinking about and exploring.

The above is the detailed content of Low-quality multi-modal data fusion, multiple institutions jointly published a review paper. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:机器之心. If there is any infringement, please contact admin@php.cn delete
The Hidden Dangers Of AI Internal Deployment: Governance Gaps And Catastrophic RisksThe Hidden Dangers Of AI Internal Deployment: Governance Gaps And Catastrophic RisksApr 28, 2025 am 11:12 AM

The unchecked internal deployment of advanced AI systems poses significant risks, according to a new report from Apollo Research. This lack of oversight, prevalent among major AI firms, allows for potential catastrophic outcomes, ranging from uncont

Building The AI PolygraphBuilding The AI PolygraphApr 28, 2025 am 11:11 AM

Traditional lie detectors are outdated. Relying on the pointer connected by the wristband, a lie detector that prints out the subject's vital signs and physical reactions is not accurate in identifying lies. This is why lie detection results are not usually adopted by the court, although it has led to many innocent people being jailed. In contrast, artificial intelligence is a powerful data engine, and its working principle is to observe all aspects. This means that scientists can apply artificial intelligence to applications seeking truth through a variety of ways. One approach is to analyze the vital sign responses of the person being interrogated like a lie detector, but with a more detailed and precise comparative analysis. Another approach is to use linguistic markup to analyze what people actually say and use logic and reasoning. As the saying goes, one lie breeds another lie, and eventually

Is AI Cleared For Takeoff In The Aerospace Industry?Is AI Cleared For Takeoff In The Aerospace Industry?Apr 28, 2025 am 11:10 AM

The aerospace industry, a pioneer of innovation, is leveraging AI to tackle its most intricate challenges. Modern aviation's increasing complexity necessitates AI's automation and real-time intelligence capabilities for enhanced safety, reduced oper

Watching Beijing's Spring Robot RaceWatching Beijing's Spring Robot RaceApr 28, 2025 am 11:09 AM

The rapid development of robotics has brought us a fascinating case study. The N2 robot from Noetix weighs over 40 pounds and is 3 feet tall and is said to be able to backflip. Unitree's G1 robot weighs about twice the size of the N2 and is about 4 feet tall. There are also many smaller humanoid robots participating in the competition, and there is even a robot that is driven forward by a fan. Data interpretation The half marathon attracted more than 12,000 spectators, but only 21 humanoid robots participated. Although the government pointed out that the participating robots conducted "intensive training" before the competition, not all robots completed the entire competition. Champion - Tiangong Ult developed by Beijing Humanoid Robot Innovation Center

The Mirror Trap: AI Ethics And The Collapse Of Human ImaginationThe Mirror Trap: AI Ethics And The Collapse Of Human ImaginationApr 28, 2025 am 11:08 AM

Artificial intelligence, in its current form, isn't truly intelligent; it's adept at mimicking and refining existing data. We're not creating artificial intelligence, but rather artificial inference—machines that process information, while humans su

New Google Leak Reveals Handy Google Photos Feature UpdateNew Google Leak Reveals Handy Google Photos Feature UpdateApr 28, 2025 am 11:07 AM

A report found that an updated interface was hidden in the code for Google Photos Android version 7.26, and each time you view a photo, a row of newly detected face thumbnails are displayed at the bottom of the screen. The new facial thumbnails are missing name tags, so I suspect you need to click on them individually to see more information about each detected person. For now, this feature provides no information other than those people that Google Photos has found in your images. This feature is not available yet, so we don't know how Google will use it accurately. Google can use thumbnails to speed up finding more photos of selected people, or may be used for other purposes, such as selecting the individual to edit. Let's wait and see. As for now

Guide to Reinforcement Finetuning - Analytics VidhyaGuide to Reinforcement Finetuning - Analytics VidhyaApr 28, 2025 am 09:30 AM

Reinforcement finetuning has shaken up AI development by teaching models to adjust based on human feedback. It blends supervised learning foundations with reward-based updates to make them safer, more accurate, and genuinely help

Let's Dance: Structured Movement To Fine-Tune Our Human Neural NetsLet's Dance: Structured Movement To Fine-Tune Our Human Neural NetsApr 27, 2025 am 11:09 AM

Scientists have extensively studied human and simpler neural networks (like those in C. elegans) to understand their functionality. However, a crucial question arises: how do we adapt our own neural networks to work effectively alongside novel AI s

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software