search
HomeTechnology peripheralsAIBAT method: AAAI 2024's first multi-modal target tracking universal bidirectional adapter

Object tracking is one of the basic tasks of computer vision. In recent years, single-modality (RGB) object tracking has made significant progress. However, due to the limitations of a single imaging sensor, we need to introduce multi-modal images (such as RGB, infrared, etc.) to make up for this shortcoming to achieve all-weather target tracking in complex environments. The application of such multi-modal images can provide more comprehensive information and enhance the accuracy and robustness of target detection and tracking. The development of multimodal target tracking is of great significance for realizing higher-level computer vision applications.

However, existing multi-modal tracking tasks also face two main problems:

  1. Due to multi-modal target tracking The cost of data annotation is high, and most existing data sets are limited in size and insufficient to support the construction of effective multi-modal trackers;
  2. Because different imaging methods have different effects on Objects have different sensitivities, the dominant mode in the open world changes dynamically, and the dominant correlation between multi-modal data is not fixed.

Many multi-modal tracking efforts that pre-train on RGB sequences and then fully fine-tune to multi-modal scenes have time and efficiency issues, as well as limited performance.

In addition to the complete fine-tuning method, it is also inspired by the efficient fine-tuning method of parameters in the field of natural language processing (NLP). Some recent methods have introduced parameter-efficient prompt fine-tuning in multi-modal tracking. These methods do this by freezing the backbone network parameters and adding an additional set of learnable parameters.

Typically, these methods focus on one modality (usually RGB) as the primary modality and the other modality as the auxiliary modality. However, this method ignores the dynamic correlation between multi-modal data and therefore cannot fully utilize the complementary effects of multi-modal information in complex scenes, thus limiting the tracking performance.

首个通用双向Adapter多模态目标追踪方法BAT,入选AAAI 2024

Figure 1: Different dominant modes in complex scenarios.

To solve the above problems, researchers from Tianjin University proposed a solution called Bidirectional Adapter for Multimodal Tracking (BAT). Different from traditional methods, the BAT method does not rely on fixed dominant mode and auxiliary mode, but obtains better performance in the change of auxiliary mode to dominant mode through the process of dynamically extracting effective information. The innovation of this method is that it can adapt to different data characteristics and task requirements, thereby improving the representation ability of the basic model in downstream tasks. By using the BAT method, researchers hope to provide a more flexible and efficient multi-modal tracking solution, bringing better results to research and applications in related fields.

BAT consists of two base model encoders with shared parameters specific to the modal branches and a general bidirectional adapter. During the training process, BAT did not fully fine-tune the basic model, but adopted a step-by-step training method. Each specific modality branch is initialized by using the base model with fixed parameters, and only the newly added bidirectional adapters are trained. Each modal branch learns cue information from other modalities and combines it with the feature information of the current modality to enhance representation capabilities. Two modality-specific branches interact through a universal bidirectional adapter to dynamically fuse dominant and auxiliary information with each other to adapt to the paradigm of multi-modal non-fixed association. This design enables BAT to fine-tune the content without changing the meaning of the original content, improving the model's representation ability and adaptability.

The universal bidirectional adapter adopts a lightweight hourglass structure and can be embedded into each layer of the transformer encoder of the basic model to avoid introducing a large number of learnable parameters. By adding only a small number of training parameters (0.32M), the universal bidirectional adapter has lower training cost and achieves better tracking performance compared with fully fine-tuned methods and cue learning-based methods.

The paper "Bi-directional Adapter for Multi-modal Tracking":

首个通用双向Adapter多模态目标追踪方法BAT,入选AAAI 2024

Paper link: https ://arxiv.org/abs/2312.10611

Code link: https://github.com/SparkTempest/BAT

Main Contributions

  • We first propose an adapter-based multi-modal tracking visual cue framework. Our model is able to perceive the dynamic changes of dominant modalities in open scenes and effectively fuse multi-modal information in an adaptive manner.
  • To the best of our knowledge, we propose a universal bidirectional adapter for the base model for the first time. It has a simple and efficient structure and can effectively realize multi-modal cross-cue tracking. By adding only 0.32M learnable parameters, our model is robust to multi-modal tracking in open scenarios.
  • We conducted an in-depth analysis of the impact of our universal adapter at different levels. We also explore a more efficient adapter architecture in experiments and verify our advantages on multiple RGBT tracking related datasets.

Core method

As shown in Figure 2, we propose a multi-modal tracking visual cue framework based on a bidirectional Adapter (BAT), the framework has a dual-stream encoder structure with RGB modality and thermal infrared modality, and each stream uses the same basic model parameters. The bidirectional Adapter is set up in parallel with the dual-stream encoder layer to cross-cue multimodal data from the two modalities.

The method does not completely fine-tune the basic model. It only efficiently transfers the pre-trained RGB tracker to multi-modal scenes by learning a lightweight bidirectional Adapter. It achieves excellent multi-modal complementarity and excellent tracking accuracy.

首个通用双向Adapter多模态目标追踪方法BAT,入选AAAI 2024

Figure 2: Overall architecture of BAT.

First, the 首个通用双向Adapter多模态目标追踪方法BAT,入选AAAI 2024template frame of each modality (the initial frame of the target object in the first frame首个通用双向Adapter多模态目标追踪方法BAT,入选AAAI 2024) and 首个通用双向Adapter多模态目标追踪方法BAT,入选AAAI 2024 search frames (subsequent tracking images) are converted into 首个通用双向Adapter多模态目标追踪方法BAT,入选AAAI 2024, and they are spliced ​​together and passed to the N-layer dual-stream transformer encoder respectively.

首个通用双向Adapter多模态目标追踪方法BAT,入选AAAI 2024

Bidirectional adapter is set up in parallel with the dual-stream encoder layer to learn feature cues from one modality to another. For this purpose, the output features of the two branches are added and input into the prediction head H to obtain the final tracking result box B.

首个通用双向Adapter多模态目标追踪方法BAT,入选AAAI 2024

The bidirectional adapter adopts a modular design and is embedded in the multi-head self-attention stage and MLP stage respectively, as shown on the right side of Figure 1. Detailed structures designed to transfer feature cues from one modality to another. It consists of three linear projection layers, tn represents the number of tokens in each modality, the input token is first dimensionally reduced to de through down projection and passes through a linear projection layer, and then projected upward to the original dimension dt and fed back as a feature prompt Transformer encoder layers to other modalities.

Through this simple structure, the bidirectional adapter can effectively perform feature prompts between 首个通用双向Adapter多模态目标追踪方法BAT,入选AAAI 2024 modalities to achieve multi-modal tracking.

Since the transformer encoder and prediction head are frozen, only the parameters of the newly added adapter need to be optimized. Notably, unlike most traditional adapters, our bidirectional adapter functions as a cross-modal feature cue for dynamically changing dominant modalities, ensuring good tracking performance in the open world.

Experimental results

As shown in Table 1, the comparison on the two data sets of RGBT234 and LasHeR shows that our method has both accuracy and success rate. Outperforms state-of-the-art methods. As shown in Figure 3, the performance comparison with state-of-the-art methods under different scene properties of the LasHeR dataset also demonstrates the superiority of the proposed method.

These experiments fully prove that our dual-stream tracking framework and bidirectional Adapter successfully track targets in most complex environments and adaptively switch from dynamically changing dominant-auxiliary modes Extract effective information from the system and achieve state-of-the-art performance.

首个通用双向Adapter多模态目标追踪方法BAT,入选AAAI 2024

Table 1 Overall performance on RGBT234 and LasHeR datasets.

首个通用双向Adapter多模态目标追踪方法BAT,入选AAAI 2024

Figure 3 Comparison of BAT and competing methods under different attributes in the LasHeR dataset.

Experiments demonstrate our effectiveness in dynamically prompting effective information from changing dominant-auxiliary patterns in complex scenarios. As shown in Figure 4, compared with related methods that fix the dominant mode, our method can effectively track the target even when RGB is completely unavailable, when both RGB and TIR can provide effective information in subsequent scenes. , the tracking effect is much better. Our bidirectional Adapter dynamically extracts effective features of the target from both RGB and IR modalities, captures more accurate target response locations, and eliminates interference from the RGB modality.

首个通用双向Adapter多模态目标追踪方法BAT,入选AAAI 2024

# Figure 4 Visualization of tracking results.

# We also evaluate our method on the RGBE trace dataset. As shown in Figure 5, compared with other methods on the VisEvent test set, our method has the most accurate tracking results in different complex scenarios, proving the effectiveness and generalization of our BAT model.

首个通用双向Adapter多模态目标追踪方法BAT,入选AAAI 2024

Figure 5 Tracking results under the VisEvent data set.

首个通用双向Adapter多模态目标追踪方法BAT,入选AAAI 2024

Figure 6 Attention weight visualization.

We visualize the attention weights of different layers tracking targets in Figure 6. Compared with the baseline-dual (dual-stream framework for basic model parameter initialization) method, our BAT effectively drives the auxiliary mode to learn more complementary information from the dominant mode, while maintaining the effectiveness of the dominant mode as the network depth increases. performance, thereby improving overall tracking performance.

Experiments show that BAT successfully captures multi-modal complementary information and achieves sample adaptive dynamic tracking.

The above is the detailed content of BAT method: AAAI 2024's first multi-modal target tracking universal bidirectional adapter. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
Tesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserTesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserApr 22, 2025 am 11:48 AM

Since 2008, I've championed the shared-ride van—initially dubbed the "robotjitney," later the "vansit"—as the future of urban transportation. I foresee these vehicles as the 21st century's next-generation transit solution, surpas

Sam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailSam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailApr 22, 2025 am 11:29 AM

Revolutionizing the Checkout Experience Sam's Club's innovative "Just Go" system builds on its existing AI-powered "Scan & Go" technology, allowing members to scan purchases via the Sam's Club app during their shopping trip.

Nvidia's AI Omniverse Expands At GTC 2025Nvidia's AI Omniverse Expands At GTC 2025Apr 22, 2025 am 11:28 AM

Nvidia's Enhanced Predictability and New Product Lineup at GTC 2025 Nvidia, a key player in AI infrastructure, is focusing on increased predictability for its clients. This involves consistent product delivery, meeting performance expectations, and

Exploring the Capabilities of Google's Gemma 2 ModelsExploring the Capabilities of Google's Gemma 2 ModelsApr 22, 2025 am 11:26 AM

Google's Gemma 2: A Powerful, Efficient Language Model Google's Gemma family of language models, celebrated for efficiency and performance, has expanded with the arrival of Gemma 2. This latest release comprises two models: a 27-billion parameter ver

The Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaThe Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaApr 22, 2025 am 11:21 AM

This Leading with Data episode features Dr. Kirk Borne, a leading data scientist, astrophysicist, and TEDx speaker. A renowned expert in big data, AI, and machine learning, Dr. Borne offers invaluable insights into the current state and future traje

AI For Runners And Athletes: We're Making Excellent ProgressAI For Runners And Athletes: We're Making Excellent ProgressApr 22, 2025 am 11:12 AM

There were some very insightful perspectives in this speech—background information about engineering that showed us why artificial intelligence is so good at supporting people’s physical exercise. I will outline a core idea from each contributor’s perspective to demonstrate three design aspects that are an important part of our exploration of the application of artificial intelligence in sports. Edge devices and raw personal data This idea about artificial intelligence actually contains two components—one related to where we place large language models and the other is related to the differences between our human language and the language that our vital signs “express” when measured in real time. Alexander Amini knows a lot about running and tennis, but he still

Jamie Engstrom On Technology, Talent And Transformation At CaterpillarJamie Engstrom On Technology, Talent And Transformation At CaterpillarApr 22, 2025 am 11:10 AM

Caterpillar's Chief Information Officer and Senior Vice President of IT, Jamie Engstrom, leads a global team of over 2,200 IT professionals across 28 countries. With 26 years at Caterpillar, including four and a half years in her current role, Engst

New Google Photos Update Makes Any Photo Pop With Ultra HDR QualityNew Google Photos Update Makes Any Photo Pop With Ultra HDR QualityApr 22, 2025 am 11:09 AM

Google Photos' New Ultra HDR Tool: A Quick Guide Enhance your photos with Google Photos' new Ultra HDR tool, transforming standard images into vibrant, high-dynamic-range masterpieces. Ideal for social media, this tool boosts the impact of any photo,

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

PhpStorm Mac version

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.