search
HomeTechnology peripheralsAICVPR 2024 | A general image fusion model based on MoE, adding 2.8% parameters to complete multiple tasks

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com.

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

  • ## Paper link: https://arxiv.org/abs/2403.12494
  • Code link: https://github.com/YangSun22/TC-MoA
  • Paper title: Task-Customized Mixture of Adapters for General Image Fusion

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

# 图 Figure 1 Different fusion tasks The dominant strength of the fusion results

##Research background and motivation

The purpose of image fusion is to integrate the complementary information of multi-source images captured by different sensors in the same scene into a single image . This method is usually used to extract important information from images and improve visual quality.

Currently, general image fusion mainly includes multi-modal, multi-exposure, multi-focus image fusion, etc. Fusion tasks exhibit different fusion mechanisms. Multi-exposure image fusion (MEF) focuses on converting image sequences with multiple exposure levels into a high-quality full-exposure image. Each source image provides its own lighting and structural information to the fused image. Visible infrared image fusion (VIF) is a type of multi-modal image fusion (MMF) that aims to fuse complementary information from infrared and visible modalities to produce robust and information-rich fused images. Infrared images provide more intensity information, while visible images provide more texture and gradient information. The purpose of multi-focus image fusion (MFF) is to generate a fully focused image from a series of partially focused images. Each clear region of a multi-focus fused image usually only needs to be learned from one source image. Therefore, it can be observed that the MEF and VIF tasks are relatively equal fusions of multiple sources, while MFF is a task with more extreme multi-source status, often showing polarized selection for a certain area of ​​the image.

With the rapid development of deep learning technology, great progress has been made in the field of image fusion in recent years, but most of the existing methods only focus on a single image fusion scenario, usually Adopting a specific strategy for a single task, such as a complex network or task-specific loss function designed for a certain task, makes it impossible to directly apply it to other tasks. Considering that the essence of different fusion tasks is the same, that is, integrating important information from multiple source images, some recently proposed methods try to use a unified model to handle multiple fusion tasks and build a universal image fusion. However, these methods either suffer from task-dominant bias or sacrifice individuality for multi-task commonality, resulting in suboptimal performance. This motivates us to explore a more compatible fusion paradigm that can be adaptively and dynamically compatible with different fusion scenarios.

To deal with this challenge, inspired by the powerful feature representation capabilities of the pre-trained base model, we introduce the base model as a frozen encoder to extract multiple Complementary features of the source image. Different from most existing methods, we draw on the idea of ​​Mixed Experts (MoE) and treat each expert as an efficient fine-tuned adapter to perform adaptive visual feature cue fusion based on the base model. Task-specific routing networks tailor a mix of these adapters to generate task-specific fusion cues for different sources, forming a new Task-Customized Hybrid Adapter (TC-MoA) architecture. Additionally, we design mutual information regularization to constrain the fusion cues, thus ensuring complementarity to different sources. Notably, fusion cues had significant task bias and modality dominance strength differences. As shown in Figure 1, MFF cues have larger color differences than VIF and MEF, indicating that the feature selection is more bipolar in the intensity bias of the dominant mode. Our model effectively perceives the fusion strength bias between different fusion tasks in a single model and is therefore compatible with a wider range of fusion tasks.

Extensive experiments have verified our superiority in general image fusion, including multi-modal, multi-exposure and multi-focus fusion. More importantly, our TC-MoA shows creative controllability and generalization even to unknown fusion tasks, fully demonstrating our potential in a wider range of fusion scenarios.

Main Contributions

  • We proposed a unified Universal Image Fusion Model, which provides a new task-tailored hybrid adapter (TC-MoA) for adaptive multi-source image fusion (which benefits from dynamically aggregating the effective information of the respective modalities).
  • We propose a mutual information regularization method for adapters, which enables our model to more accurately identify the dominant intensity of different source images.
  • To the best of our knowledge, we propose a MoE-based flexible adapter for the first time. By adding only 2.8% of the learnable parameters, our model can handle many fusion tasks. Extensive experiments demonstrate the advantages of our competing methods while showing significant controllability and generalization.

Core method

As shown in Figure 2, Given a pair of source images CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务, the network integrates complementary information from different sources to obtain the fused image CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务. We input the source image into the ViT network and obtain the Token of the source image through the patch encoding layer. ViT consists of an encoder for feature extraction and a decoder for image reconstruction, both of which are composed of Transformer blocks.

In the encoder and decoder, a TC-MoA is inserted for every CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务 Transformer block. The network progressively modulates the outcome of fusion through these TC-MoAs. Each TC-MoA consists of a task-specific router bankCVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务, a task-shared adapter bankCVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务, and a hint fusion layer F. TC-MoA consists of two main stages: cue generation and cue-driven fusion. For ease of expression, we take VIF as an example, assume that the input comes from the VIF data set, and use G to represent CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务.

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

## 图 2 TC-MOA's overall architecture

Prompt generation. First, multi-source features are obtained for subsequent processing. The network structure before the jth TC-MoA is defined as , and the extracted cue generation features are defined as CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务. We concatenate CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务 as the feature representation of multi-source Token pairs. This allows tokens from different sources to exchange information within the subsequent network. However, directly calculating high-dimensional concatenated features will bring a large number of unnecessary parameters. Therefore, we use CVPR 2024 | A general image fusion model based on MoE, adding 2.8% parameters to complete multiple tasks to perform feature dimensionality reduction and obtain the processed multi-source feature CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务, as follows: CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

Then, according to the task to which Φ belongs, we start from Select a task-specific router in the router bank to customize the routing scheme, i.e., which adapter in the adapter bank each pair of source tokens should enter.

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

Finally, we perform a weighted sum of the adapter's outputs to obtain the fusion hint. Each router has task preferences to customize the appropriate adapter mix, and then generates hints from the adapter mix, calculated as follows:

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

Prompt-driven fusion . Task-tailored cues are subject to mutual information regularization (MIR), which guarantees complementarity to different sources. Cues therefore serve as an estimate of the proportion of important information in each source. Through the dot product of multi-source features and cues, we retain complementary information while removing redundant information. Then, considering that the feature representation should contain source-dependent biases (such as visible or infrared images), we introduce input-independent learnable parameters for each source, i.e., source encoding s. After the features are modified by hints and source offsets, we get the refined source features CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务, and then obtain the fusion features through the fusion layer F. The process is as follows:

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

Finally, We obtained a fusion feature with task-tailored prompts. In order to encourage the model to extract important information step by step, we define the features output to the next Transformer block as follows (CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务 is a hyperparameter):

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

## Information regular. In order to ensure that the model dynamically retains complementary information while discarding redundant information from multi-source features, we impose regularization constraints on prompts. Assuming that the feature representation changes linearly, we define MIR as follows:

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

Experimental effect

Qualitative and quantitative experiments. As shown in Figure 3-5 and Table 1-3, qualitative and quantitative comparisons on three fusion tasks show that the performance of our method surpasses previous general fusion methods. Compared with task-specific methods, our method also achieves state-of-the-art performance on all tasks and even leads on some tasks (VIF). The superiority of the proposed method is proved.

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

                               图3 VIF 任务LLVIP 数据集上的定性比较实验 

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

                                                                                                                                                                                                                      Qualitative comparative experiment on the MEF task MEFB data set

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

##                                   5 Qualitative comparison experiment on the MFF task data set

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

#                                                                                                     —                                                    
2 MEF task LLVIP data set The quantitative comparative experiment

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

## Table 3 MFF task LLVIP data set quantitative comparative experiment

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

# m m m m m m m and unknown task pan Generalizability

##Controllability and generalization
.
As shown in Figure 6, by controlling the hyperparameters α and β of the fusion prompt, we can respectively control the feature selection strength of the model for the complementary information of the source image (region level) and the similarity between the fused image and a certain source image ( image level). We can fuse the cues through linear transformation, ultimately generating a customized fused image. For known tasks, such as multi-exposure fusion, we can obtain customized fusion results that best match human perception. For unknown tasks, we can modulate the most appropriate fusion parameters and generalize the model to unknown tasks.

The above is the detailed content of CVPR 2024 | A general image fusion model based on MoE, adding 2.8% parameters to complete multiple tasks. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:机器之心. If there is any infringement, please contact admin@php.cn delete
AI Game Development Enters Its Agentic Era With Upheaval's Dreamer PortalAI Game Development Enters Its Agentic Era With Upheaval's Dreamer PortalMay 02, 2025 am 11:17 AM

Upheaval Games: Revolutionizing Game Development with AI Agents Upheaval, a game development studio comprised of veterans from industry giants like Blizzard and Obsidian, is poised to revolutionize game creation with its innovative AI-powered platfor

Uber Wants To Be Your Robotaxi Shop, Will Providers Let Them?Uber Wants To Be Your Robotaxi Shop, Will Providers Let Them?May 02, 2025 am 11:16 AM

Uber's RoboTaxi Strategy: A Ride-Hail Ecosystem for Autonomous Vehicles At the recent Curbivore conference, Uber's Richard Willder unveiled their strategy to become the ride-hail platform for robotaxi providers. Leveraging their dominant position in

AI Agents Playing Video Games Will Transform Future RobotsAI Agents Playing Video Games Will Transform Future RobotsMay 02, 2025 am 11:15 AM

Video games are proving to be invaluable testing grounds for cutting-edge AI research, particularly in the development of autonomous agents and real-world robots, even potentially contributing to the quest for Artificial General Intelligence (AGI). A

The Startup Industrial Complex, VC 3.0, And James Currier's ManifestoThe Startup Industrial Complex, VC 3.0, And James Currier's ManifestoMay 02, 2025 am 11:14 AM

The impact of the evolving venture capital landscape is evident in the media, financial reports, and everyday conversations. However, the specific consequences for investors, startups, and funds are often overlooked. Venture Capital 3.0: A Paradigm

Adobe Updates Creative Cloud And Firefly At Adobe MAX London 2025Adobe Updates Creative Cloud And Firefly At Adobe MAX London 2025May 02, 2025 am 11:13 AM

Adobe MAX London 2025 delivered significant updates to Creative Cloud and Firefly, reflecting a strategic shift towards accessibility and generative AI. This analysis incorporates insights from pre-event briefings with Adobe leadership. (Note: Adob

Everything Meta Announced At LlamaConEverything Meta Announced At LlamaConMay 02, 2025 am 11:12 AM

Meta's LlamaCon announcements showcase a comprehensive AI strategy designed to compete directly with closed AI systems like OpenAI's, while simultaneously creating new revenue streams for its open-source models. This multifaceted approach targets bo

The Brewing Controversy Over The Proposition That AI Is Nothing More Than Just Normal TechnologyThe Brewing Controversy Over The Proposition That AI Is Nothing More Than Just Normal TechnologyMay 02, 2025 am 11:10 AM

There are serious differences in the field of artificial intelligence on this conclusion. Some insist that it is time to expose the "emperor's new clothes", while others strongly oppose the idea that artificial intelligence is just ordinary technology. Let's discuss it. An analysis of this innovative AI breakthrough is part of my ongoing Forbes column that covers the latest advancements in the field of AI, including identifying and explaining a variety of influential AI complexities (click here to view the link). Artificial intelligence as a common technology First, some basic knowledge is needed to lay the foundation for this important discussion. There is currently a large amount of research dedicated to further developing artificial intelligence. The overall goal is to achieve artificial general intelligence (AGI) and even possible artificial super intelligence (AS)

Model Citizens, Why AI Value Is The Next Business YardstickModel Citizens, Why AI Value Is The Next Business YardstickMay 02, 2025 am 11:09 AM

The effectiveness of a company's AI model is now a key performance indicator. Since the AI boom, generative AI has been used for everything from composing birthday invitations to writing software code. This has led to a proliferation of language mod

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment