search
HomeTechnology peripheralsAISurpassing the CVPR 2024 method, DynRefer achieves multiple SOTAs in regional-level multi-modal recognition tasks

In order to achieve high-precision regional-level multi-modal understanding, this paper proposes a dynamic resolution scheme to simulate the human visual cognitive system.

The author of this article is from the LAMP Laboratory of the University of Chinese Academy of Sciences. The first author Zhao Yuzhong is a doctoral student of the University of Chinese Academy of Sciences in 2023, and the co-author Liu Feng is a direct doctoral student of the University of Chinese Academy of Sciences in 2020. Their main research directions are visual language models and visual object perception.

Introduction

DynRefer significantly improves regional-level multi-modal recognition capabilities by simulating the human visual cognitive process. By introducing the dynamic resolution mechanism of the human eye, DynRefer can simultaneously complete the tasks of region recognition, region attribute detection and region-level captioning with a single model, and achieve SOTA performance in all the above tasks. Among them, 115.7 CIDEr was achieved on the region-level captioning task of the RefCOCOg data set, which is significantly higher than the CVPR 2024 methods such as RegionGPT, GlaMM, Osprey, Alpha-CLIP and so on.

超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA

  • Paper title: DynRefer: Delving into Region-level Multi-modality Tasks via Dynamic Resolution
  • Paper link: https://arxiv.org/abs/2405.16071
  • Paper code: https ://github.com/callsys/DynRefer

超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA

Motivation

The region-level multi-modal task is dedicated to converting specified image regions into language descriptions consistent with human preferences. Humans have a resolution-adaptive ability when completing regional-level multi-modal tasks, that is, the area of ​​interest is high-resolution, and the non-attention area is low-resolution. However, current regional-level multi-modal large language models often adopt a fixed-resolution encoding scheme, that is, encoding the entire image, and then extracting regional features through RoI Align. This approach lacks the resolution adaptive capability in the human visual cognitive system, and has low encoding efficiency and ability for areas of interest. In order to achieve high-precision regional-level multi-modal understanding, we propose a dynamic resolution scheme to simulate the human visual cognitive system, as shown in the figure below.

超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA

区 Figure 1: Comparison of traditional regional multi -modal methods (left) and Dynrefer method (right).

Method

1. Simulate dynamic resolution image (Multi-view construction).
Since the mainstream pre-trained visual language model (CLIP) can only receive uniform resolution input, we simulate a dynamic resolution image by constructing multiple uniform resolution views. The image has high resolution in the referent area and low resolution in the non-reference area. The specific process is shown in Figure 2. The original image x is cropped and resized into multiple candidate views. The cropping area is calculated as
, where . Here represents the bounding box of the reference area, 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA represents the size of the entire image, and t represents the interpolation coefficient. During training, we randomly select n views from candidate views to simulate images generated due to gaze and rapid eye movements. These n views correspond to the interpolation coefficient t, which is 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA. We fixedly retain the view containing only the reference region (i.e. 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA). This view has been experimentally proven to help preserve regional details, which is crucial for all regional multi-modal tasks. 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA
                                            Figure 2: DynRefer training (top) and inference (bottom).

2. Stochastic Multi-view Embedding. The specific process is shown in Figure 3. The sampled n views are encoded into spatial features via frozen CLIP and then processed by the RoI-Align module to obtain region embeddings, i.e., 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA. This is shown on the left side of Figure 3. These region embeddings are not spatially aligned due to spatial errors introduced by cropping, resizing, and RoI-Align. Inspired by the deformable convolution operation, we propose an alignment module to reduce the bias by aligning 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA to 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA, where Surpassing the CVPR 2024 method, DynRefer achieves multiple SOTAs in regional-level multi-modal recognition tasks is the region embedding of the view encoding containing only the reference region. For each region embedding 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA, it is first concatenated with 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA and then a 2D offset map is calculated through a convolutional layer. The spatial features of 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA are then resampled based on the 2D offset. Finally, the aligned region embeddings are concatenated along the channel dimension and fused through linear layers. The output is further compressed through a visual resampling module, i.e. Q-former, which extracts a regional representation of the reference region 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA of the original image x (超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA in Figure 3).

超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA

                                                                                                                                                                                                          Figure 3: DynRefer network structure

3. Vision-language Alignment. The region representation 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA computed by the stochastic multi-view embedding module is decoded by three decoders 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA as shown in Figure 3 (right), respectively supervised by three multi-modal tasks:

i ) Image region label generation. We employ a lightweight query-based recognition decoder for region label generation. The decoder 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA is shown in Figure 3 (right). The tagging process is completed by calculating the confidence of a predefined tag using the tag as query, 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA as key and value. We parse labels from ground-truth subtitles to supervise the recognition decoder. ii) Region-text contrastive learning. Similar to the region tag decoder, the decoder 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA is defined as a query-based recognition decoder. The decoder computes similarity scores between subtitles and region features, supervised using SigLIP loss. iii) Language modeling. We use a pre-trained large language model 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA to convert the regional representation 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA into a language description.

超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA

Figure 4: Performance of dual-view (n=2) DynRefer model on region-level multi-modal tasks. Under different interpolation coefficients t, 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA. View one is fixed (超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA), view two is randomly selected or fixed.

4. During the inference process, the trained DynRefer model performs multi-modal tasks on images with dynamic resolution. By adjusting the interpolation coefficients 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA of the sampled n views, we can obtain a regional representation with dynamic resolution characteristics. To evaluate the properties at different dynamic resolutions, we trained a dual-view (n=2) DynRefer model and evaluated it on four multi-modal tasks. As can be seen from the curves in Figure 4, attribute detection achieves better results for views without contextual information (超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA). This can be explained by the fact that such tasks often require detailed regional information. For Region-level captioning and Dense captioning tasks, a context-rich view (超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA) is required to fully understand the reference region. It is important to note that views with too much context (超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA) degrade performance on all tasks because they introduce too much region-irrelevant information. When the task type is known, we can sample appropriate views based on task characteristics. When the task type is unknown, we first construct a set of candidate views under different interpolation coefficients t, 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA. From the candidate set, n views are sampled via a greedy search algorithm. The objective function of the search is defined as:

超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTAwhere 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA represents the interpolation coefficient of the i-th view, 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA represents the i-th view, pHASH (・) represents the perceptual image hash function, and 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA represents the XOR operation. In order to compare the information of views from a global perspective, we use the "pHASH (・)" function to convert the views from the spatial domain to the frequency domain and then encode them into hash codes. For this item 超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA, we reduce the weight of context-rich views to avoid introducing too much redundant information.

Experiment

Region-level Captioning

超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA

In the regional subtitle generation task, DynRefer uses a smaller model (4.2B vs. 7B) on the RefCOCOg and VG datasets, In both METEOR and CIDEr indicators, it significantly surpasses many methods in CVPR 2024, such as RegionGPT, GlaMM, Alpha-CLIP and Osprey, etc., demonstrating the huge performance advantage of DynRefer.

Dense Captioning

超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA

In the task of dense subtitle generation, on the VG1.2 data set, DynRefer improved 7.1% mAP compared to the previous SOTA method GRiT.

Open Vocabulary Attribute Detection

超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA

In the regional attribute detection task, DynRefer also achieved SOTA performance.

Open Vocabulary Region Recognition

超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA

In the region recognition task, DynRefer improves 15% mAP and 8.8% Accuracy compared with RegionGPT of CVPR 24, and is 15.7% mAP higher than ASM of ICLR 24.

Ablation experiment

超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA

  • Line 1-6: Random dynamic multi-view is better than fixed view.
  • Line 6-10: Selecting views by maximizing information is better than randomly selecting views.
  • Line 10-13: Multi-task training can learn better regional representations.

Visualization

The following pictures show the inference results of DynRefer. DynRefer can use one model to output regional subtitles, tags, attributes and categories at the same time.

超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA

超越CVPR 2024方法,DynRefer在区域级多模态识别任务上,多项SOTA

The above is the detailed content of Surpassing the CVPR 2024 method, DynRefer achieves multiple SOTAs in regional-level multi-modal recognition tasks. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Tesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserTesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserApr 22, 2025 am 11:48 AM

Since 2008, I've championed the shared-ride van—initially dubbed the "robotjitney," later the "vansit"—as the future of urban transportation. I foresee these vehicles as the 21st century's next-generation transit solution, surpas

Sam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailSam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailApr 22, 2025 am 11:29 AM

Revolutionizing the Checkout Experience Sam's Club's innovative "Just Go" system builds on its existing AI-powered "Scan & Go" technology, allowing members to scan purchases via the Sam's Club app during their shopping trip.

Nvidia's AI Omniverse Expands At GTC 2025Nvidia's AI Omniverse Expands At GTC 2025Apr 22, 2025 am 11:28 AM

Nvidia's Enhanced Predictability and New Product Lineup at GTC 2025 Nvidia, a key player in AI infrastructure, is focusing on increased predictability for its clients. This involves consistent product delivery, meeting performance expectations, and

Exploring the Capabilities of Google's Gemma 2 ModelsExploring the Capabilities of Google's Gemma 2 ModelsApr 22, 2025 am 11:26 AM

Google's Gemma 2: A Powerful, Efficient Language Model Google's Gemma family of language models, celebrated for efficiency and performance, has expanded with the arrival of Gemma 2. This latest release comprises two models: a 27-billion parameter ver

The Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaThe Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaApr 22, 2025 am 11:21 AM

This Leading with Data episode features Dr. Kirk Borne, a leading data scientist, astrophysicist, and TEDx speaker. A renowned expert in big data, AI, and machine learning, Dr. Borne offers invaluable insights into the current state and future traje

AI For Runners And Athletes: We're Making Excellent ProgressAI For Runners And Athletes: We're Making Excellent ProgressApr 22, 2025 am 11:12 AM

There were some very insightful perspectives in this speech—background information about engineering that showed us why artificial intelligence is so good at supporting people’s physical exercise. I will outline a core idea from each contributor’s perspective to demonstrate three design aspects that are an important part of our exploration of the application of artificial intelligence in sports. Edge devices and raw personal data This idea about artificial intelligence actually contains two components—one related to where we place large language models and the other is related to the differences between our human language and the language that our vital signs “express” when measured in real time. Alexander Amini knows a lot about running and tennis, but he still

Jamie Engstrom On Technology, Talent And Transformation At CaterpillarJamie Engstrom On Technology, Talent And Transformation At CaterpillarApr 22, 2025 am 11:10 AM

Caterpillar's Chief Information Officer and Senior Vice President of IT, Jamie Engstrom, leads a global team of over 2,200 IT professionals across 28 countries. With 26 years at Caterpillar, including four and a half years in her current role, Engst

New Google Photos Update Makes Any Photo Pop With Ultra HDR QualityNew Google Photos Update Makes Any Photo Pop With Ultra HDR QualityApr 22, 2025 am 11:09 AM

Google Photos' New Ultra HDR Tool: A Quick Guide Enhance your photos with Google Photos' new Ultra HDR tool, transforming standard images into vibrant, high-dynamic-range masterpieces. Ideal for social media, this tool boosts the impact of any photo,

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.