Home  >  Article  >  Technology peripherals  >  CVPR 2024 | A general image fusion model based on MoE, adding 2.8% parameters to complete multiple tasks

CVPR 2024 | A general image fusion model based on MoE, adding 2.8% parameters to complete multiple tasks

PHPz
PHPzforward
2024-04-24 14:28:12473browse

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com.

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

  • ## Paper link: https://arxiv.org/abs/2403.12494
  • Code link: https://github.com/YangSun22/TC-MoA
  • Paper title: Task-Customized Mixture of Adapters for General Image Fusion

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

# 图 Figure 1 Different fusion tasks The dominant strength of the fusion results

##Research background and motivation

The purpose of image fusion is to integrate the complementary information of multi-source images captured by different sensors in the same scene into a single image . This method is usually used to extract important information from images and improve visual quality.

Currently, general image fusion mainly includes multi-modal, multi-exposure, multi-focus image fusion, etc. Fusion tasks exhibit different fusion mechanisms. Multi-exposure image fusion (MEF) focuses on converting image sequences with multiple exposure levels into a high-quality full-exposure image. Each source image provides its own lighting and structural information to the fused image. Visible infrared image fusion (VIF) is a type of multi-modal image fusion (MMF) that aims to fuse complementary information from infrared and visible modalities to produce robust and information-rich fused images. Infrared images provide more intensity information, while visible images provide more texture and gradient information. The purpose of multi-focus image fusion (MFF) is to generate a fully focused image from a series of partially focused images. Each clear region of a multi-focus fused image usually only needs to be learned from one source image. Therefore, it can be observed that the MEF and VIF tasks are relatively equal fusions of multiple sources, while MFF is a task with more extreme multi-source status, often showing polarized selection for a certain area of ​​the image.

With the rapid development of deep learning technology, great progress has been made in the field of image fusion in recent years, but most of the existing methods only focus on a single image fusion scenario, usually Adopting a specific strategy for a single task, such as a complex network or task-specific loss function designed for a certain task, makes it impossible to directly apply it to other tasks. Considering that the essence of different fusion tasks is the same, that is, integrating important information from multiple source images, some recently proposed methods try to use a unified model to handle multiple fusion tasks and build a universal image fusion. However, these methods either suffer from task-dominant bias or sacrifice individuality for multi-task commonality, resulting in suboptimal performance. This motivates us to explore a more compatible fusion paradigm that can be adaptively and dynamically compatible with different fusion scenarios.

To deal with this challenge, inspired by the powerful feature representation capabilities of the pre-trained base model, we introduce the base model as a frozen encoder to extract multiple Complementary features of the source image. Different from most existing methods, we draw on the idea of ​​Mixed Experts (MoE) and treat each expert as an efficient fine-tuned adapter to perform adaptive visual feature cue fusion based on the base model. Task-specific routing networks tailor a mix of these adapters to generate task-specific fusion cues for different sources, forming a new Task-Customized Hybrid Adapter (TC-MoA) architecture. Additionally, we design mutual information regularization to constrain the fusion cues, thus ensuring complementarity to different sources. Notably, fusion cues had significant task bias and modality dominance strength differences. As shown in Figure 1, MFF cues have larger color differences than VIF and MEF, indicating that the feature selection is more bipolar in the intensity bias of the dominant mode. Our model effectively perceives the fusion strength bias between different fusion tasks in a single model and is therefore compatible with a wider range of fusion tasks.

Extensive experiments have verified our superiority in general image fusion, including multi-modal, multi-exposure and multi-focus fusion. More importantly, our TC-MoA shows creative controllability and generalization even to unknown fusion tasks, fully demonstrating our potential in a wider range of fusion scenarios.

Main Contributions

  • We proposed a unified Universal Image Fusion Model, which provides a new task-tailored hybrid adapter (TC-MoA) for adaptive multi-source image fusion (which benefits from dynamically aggregating the effective information of the respective modalities).
  • We propose a mutual information regularization method for adapters, which enables our model to more accurately identify the dominant intensity of different source images.
  • To the best of our knowledge, we propose a MoE-based flexible adapter for the first time. By adding only 2.8% of the learnable parameters, our model can handle many fusion tasks. Extensive experiments demonstrate the advantages of our competing methods while showing significant controllability and generalization.

Core method

As shown in Figure 2, Given a pair of source images CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务, the network integrates complementary information from different sources to obtain the fused image CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务. We input the source image into the ViT network and obtain the Token of the source image through the patch encoding layer. ViT consists of an encoder for feature extraction and a decoder for image reconstruction, both of which are composed of Transformer blocks.

In the encoder and decoder, a TC-MoA is inserted for every CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务 Transformer block. The network progressively modulates the outcome of fusion through these TC-MoAs. Each TC-MoA consists of a task-specific router bankCVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务, a task-shared adapter bankCVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务, and a hint fusion layer F. TC-MoA consists of two main stages: cue generation and cue-driven fusion. For ease of expression, we take VIF as an example, assume that the input comes from the VIF data set, and use G to represent CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务.

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

## 图 2 TC-MOA's overall architecture

Prompt generation. First, multi-source features are obtained for subsequent processing. The network structure before the jth TC-MoA is defined as , and the extracted cue generation features are defined as CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务. We concatenate CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务 as the feature representation of multi-source Token pairs. This allows tokens from different sources to exchange information within the subsequent network. However, directly calculating high-dimensional concatenated features will bring a large number of unnecessary parameters. Therefore, we use CVPR 2024 | A general image fusion model based on MoE, adding 2.8% parameters to complete multiple tasks to perform feature dimensionality reduction and obtain the processed multi-source feature CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务, as follows: CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

Then, according to the task to which Φ belongs, we start from Select a task-specific router in the router bank to customize the routing scheme, i.e., which adapter in the adapter bank each pair of source tokens should enter.

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

Finally, we perform a weighted sum of the adapter's outputs to obtain the fusion hint. Each router has task preferences to customize the appropriate adapter mix, and then generates hints from the adapter mix, calculated as follows:

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

Prompt-driven fusion . Task-tailored cues are subject to mutual information regularization (MIR), which guarantees complementarity to different sources. Cues therefore serve as an estimate of the proportion of important information in each source. Through the dot product of multi-source features and cues, we retain complementary information while removing redundant information. Then, considering that the feature representation should contain source-dependent biases (such as visible or infrared images), we introduce input-independent learnable parameters for each source, i.e., source encoding s. After the features are modified by hints and source offsets, we get the refined source features CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务, and then obtain the fusion features through the fusion layer F. The process is as follows:

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

Finally, We obtained a fusion feature with task-tailored prompts. In order to encourage the model to extract important information step by step, we define the features output to the next Transformer block as follows (CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务 is a hyperparameter):

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

## Information regular. In order to ensure that the model dynamically retains complementary information while discarding redundant information from multi-source features, we impose regularization constraints on prompts. Assuming that the feature representation changes linearly, we define MIR as follows:

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

Experimental effect

Qualitative and quantitative experiments. As shown in Figure 3-5 and Table 1-3, qualitative and quantitative comparisons on three fusion tasks show that the performance of our method surpasses previous general fusion methods. Compared with task-specific methods, our method also achieves state-of-the-art performance on all tasks and even leads on some tasks (VIF). The superiority of the proposed method is proved.

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

                               图3 VIF 任务LLVIP 数据集上的定性比较实验 

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

                                                                                                                                                                                                                      Qualitative comparative experiment on the MEF task MEFB data set

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

##                                   5 Qualitative comparison experiment on the MFF task data set

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

#                                                                                                     —                                                    
2 MEF task LLVIP data set The quantitative comparative experiment

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

## Table 3 MFF task LLVIP data set quantitative comparative experiment

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

# m m m m m m m and unknown task pan Generalizability

##Controllability and generalization
.
As shown in Figure 6, by controlling the hyperparameters α and β of the fusion prompt, we can respectively control the feature selection strength of the model for the complementary information of the source image (region level) and the similarity between the fused image and a certain source image ( image level). We can fuse the cues through linear transformation, ultimately generating a customized fused image. For known tasks, such as multi-exposure fusion, we can obtain customized fusion results that best match human perception. For unknown tasks, we can modulate the most appropriate fusion parameters and generalize the model to unknown tasks.

The above is the detailed content of CVPR 2024 | A general image fusion model based on MoE, adding 2.8% parameters to complete multiple tasks. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:jiqizhixin.com. If there is any infringement, please contact admin@php.cn delete