Based on cross-modal element transfer, the reference video object segmentation method of Meitu & Dalian University of Technology requires only a single stage-AI-php.cn

Based on cross-modal element transfer, the reference video object segmentation method of Meitu & Dalian University of Technology requires only a single stage

王林

Apr 08, 2023 pm 09:21 PM

framemigrate

Introduction

Referring VOS (RVOS) is a newly emerging task. It aims to segment the objects referred to by the text from a video sequence based on the reference text. . Compared with semi-supervised video object segmentation, RVOS only relies on abstract language descriptions instead of pixel-level reference masks, providing a more convenient option for human-computer interaction and therefore has received widespread attention.

Based on cross-modal element transfer, the reference video object segmentation method of Meitu & Dalian University of Technology requires only a single stage

## Paper link: https://www.aaai.org/AAAI22Papers/AAAI-1100.LiD.pdf

The main purpose of this research is to solve two major challenges faced in existing RVOS tasks:

How to convert text information into , cross-modal fusion of picture information, so as to maintain the scale consistency between the two modalities and fully integrate the useful feature references provided by the text into the picture features;
How to abandon the two-stage strategy of existing methods (that is, first obtain rough results frame by frame at the picture level, then use the results as a reference, and obtain the final prediction through structural refinement of enhanced temporal information), and unify the entire RVOS task into a single-stage framework.

In this regard, this study proposes an end-to-end RVOS framework for cross-modal element migration - YOFO , its main contributions and innovations are:

Only one-stage reasoning is needed to directly obtain the segmentation of video targets using reference text information. As a result, the results obtained on two mainstream data sets - Ref-DAVIS2017 and Ref-Youtube-VOS surpassed all current two-stage methods;
proposed a meta-migration ( Meta-Transfer) module to enhance temporal information, thereby achieving more target-focused feature learning;
proposes a multi-scale cross-modal feature mining (Multi-Scale Cross-Modal Feature Mining) module, which can fully integrate useful features in language and pictures.

Implementation Strategy

The main process of the YOFO framework is as follows: input images and text first pass through the image encoder and language encoder to extract features, and then in multi-scale Cross-modal feature mining module for fusion. The fused bimodal features are simplified in the meta-transfer module that contains the memory library to eliminate redundant information in the language features. At the same time, temporal information can be preserved to enhance temporal correlation, and finally the segmentation results are obtained through a decoder.

Based on cross-modal element transfer, the reference video object segmentation method of Meitu & Dalian University of Technology requires only a single stage

Figure 1: Main process of YOFO framework.

Multi-scale cross-modal feature mining module: This module passes the Fusion of two modal features of different scales can maintain the consistency between the scale information conveyed by the image features and the language features. More importantly, it ensures that the language information will not be diluted and overwhelmed by the multi-scale image information during the fusion process.

Based on cross-modal element transfer, the reference video object segmentation method of Meitu & Dalian University of Technology requires only a single stage

Figure 2: Multi-scale cross-modal feature mining module.

##Meta Migration Module: A learning-to-learn strategy is adopted, and the process can be simply described as the following mapping function. The migration function is a convolution, then Based on cross-modal element transfer, the reference video object segmentation method of Meitu & Dalian University of Technology requires only a single stage is its convolution kernel parameter:

Based on cross-modal element transfer, the reference video object segmentation method of Meitu & Dalian University of Technology requires only a single stage

The optimization process can be expressed as the following objective function:

Based on cross-modal element transfer, the reference video object segmentation method of Meitu & Dalian University of Technology requires only a single stage

Among them, M represents A memory bank that can store historical information. W represents the weight of different positions and can give different attention to different positions in the feature. Y represents the bimodal features of each video frame stored in the memory bank. This optimization process maximizes the ability of the meta-transfer function to reconstruct bimodal features, and also enables the entire framework to be trained end-to-end.

##Training and testing: The loss function used in training is lovasz loss, and the training set is two video data sets Ref-DAVIS2017 , Ref-Youtube-VOS, and use the static data set Ref-COCO to perform random affine transformation to simulate video data as auxiliary training. The meta-migration process is performed during training and prediction, and the entire network runs at a speed of 10FPS on 1080ti.

Experimental results

The method used in the study has achieved excellent results on two mainstream RVOS data sets (Ref-DAVIS2017 and Ref-Youtube-VOS). The quantitative indicators and some visualization renderings are as follows:

Based on cross-modal element transfer, the reference video object segmentation method of Meitu & Dalian University of Technology requires only a single stage

## Figure 3: Quantitative indicators on two mainstream data sets.

Based on cross-modal element transfer, the reference video object segmentation method of Meitu & Dalian University of Technology requires only a single stage

Figure 4: Visualization on the VOS dataset.

Based on cross-modal element transfer, the reference video object segmentation method of Meitu & Dalian University of Technology requires only a single stage

Figure 5: Other visualization effects of YOFO.

#The study also conducted a series of ablation experiments to illustrate the effectiveness of the feature mining module (FM) and the meta-transfer module (MT).

Based on cross-modal element transfer, the reference video object segmentation method of Meitu & Dalian University of Technology requires only a single stage

Figure 6: Effectiveness of feature mining module (FM) and meta-transfer module (MT).

In addition, the study visualized the output features of the decoder with and without the MT module. It can be clearly seen that the MT module can correctly capture The content described in the language and filtering the interference noise.

Based on cross-modal element transfer, the reference video object segmentation method of Meitu & Dalian University of Technology requires only a single stage

Figure 7: Comparison of decoder output features before and after using the MT module. About the research team

This paper was jointly proposed by researchers from the Meitu Imaging Research Institute (MT Lab) and the Lu Huchuan team of Dalian University of Technology. Meitu Imaging Research Institute (MT Lab) is Meitu’s team dedicated to algorithm research, engineering development and productization in the fields of computer vision, machine learning, augmented reality, cloud computing and other fields. It provides the basis for Meitu’s existing and future products. It provides core algorithm support and promotes the development of Meitu products through cutting-edge technology. It is known as the "Technology Center of Meitu". It has participated in top international computer vision conferences such as CVPR, ICCV, and ECCV, and won more than ten championships and runner-ups.

The above is the detailed content of Based on cross-modal element transfer, the reference video object segmentation method of Meitu & Dalian University of Technology requires only a single stage. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

You Must Build Workplace AI Behind A Veil Of IgnoranceApr 29, 2025 am 11:15 AM

In John Rawls' seminal 1971 book The Theory of Justice, he proposed a thought experiment that we should take as the core of today's AI design and use decision-making: the veil of ignorance. This philosophy provides a simple tool for understanding equity and also provides a blueprint for leaders to use this understanding to design and implement AI equitably. Imagine that you are making rules for a new society. But there is a premise: you don’t know in advance what role you will play in this society. You may end up being rich or poor, healthy or disabled, belonging to a majority or marginal minority. Operating under this "veil of ignorance" prevents rule makers from making decisions that benefit themselves. On the contrary, people will be more motivated to formulate public

Decisions, Decisions… Next Steps For Practical Applied AIApr 29, 2025 am 11:14 AM

Numerous companies specialize in robotic process automation (RPA), offering bots to automate repetitive tasks—UiPath, Automation Anywhere, Blue Prism, and others. Meanwhile, process mining, orchestration, and intelligent document processing speciali

The Agents Are Coming – More On What We Will Do Next To AI PartnersApr 29, 2025 am 11:13 AM

The future of AI is moving beyond simple word prediction and conversational simulation; AI agents are emerging, capable of independent action and task completion. This shift is already evident in tools like Anthropic's Claude. AI Agents: Research a

Why Empathy Is More Important Than Control For Leaders In An AI-Driven FutureApr 29, 2025 am 11:12 AM

Rapid technological advancements necessitate a forward-looking perspective on the future of work. What happens when AI transcends mere productivity enhancement and begins shaping our societal structures? Topher McDougal's upcoming book, Gaia Wakes:

AI For Product Classification: Can Machines Master Tax Law?Apr 29, 2025 am 11:11 AM

Product classification, often involving complex codes like "HS 8471.30" from systems such as the Harmonized System (HS), is crucial for international trade and domestic sales. These codes ensure correct tax application, impacting every inv

Could Data Center Demand Spark A Climate Tech Rebound?Apr 29, 2025 am 11:10 AM

The future of energy consumption in data centers and climate technology investment This article explores the surge in energy consumption in AI-driven data centers and its impact on climate change, and analyzes innovative solutions and policy recommendations to address this challenge. Challenges of energy demand: Large and ultra-large-scale data centers consume huge power, comparable to the sum of hundreds of thousands of ordinary North American families, and emerging AI ultra-large-scale centers consume dozens of times more power than this. In the first eight months of 2024, Microsoft, Meta, Google and Amazon have invested approximately US$125 billion in the construction and operation of AI data centers (JP Morgan, 2024) (Table 1). Growing energy demand is both a challenge and an opportunity. According to Canary Media, the looming electricity

AI And Hollywood's Next Golden AgeApr 29, 2025 am 11:09 AM

Generative AI is revolutionizing film and television production. Luma's Ray 2 model, as well as Runway's Gen-4, OpenAI's Sora, Google's Veo and other new models, are improving the quality of generated videos at an unprecedented speed. These models can easily create complex special effects and realistic scenes, even short video clips and camera-perceived motion effects have been achieved. While the manipulation and consistency of these tools still need to be improved, the speed of progress is amazing. Generative video is becoming an independent medium. Some models are good at animation production, while others are good at live-action images. It is worth noting that Adobe's Firefly and Moonvalley's Ma

Is ChatGPT Slowly Becoming AI's Biggest Yes-Man?Apr 29, 2025 am 11:08 AM

ChatGPT user experience declines: is it a model degradation or user expectations? Recently, a large number of ChatGPT paid users have complained about their performance degradation, which has attracted widespread attention. Users reported slower responses to models, shorter answers, lack of help, and even more hallucinations. Some users expressed dissatisfaction on social media, pointing out that ChatGPT has become “too flattering” and tends to verify user views rather than provide critical feedback. This not only affects the user experience, but also brings actual losses to corporate customers, such as reduced productivity and waste of computing resources. Evidence of performance degradation Many users have reported significant degradation in ChatGPT performance, especially in older models such as GPT-4 (which will soon be discontinued from service at the end of this month). this

See all articles