search
HomeTechnology peripheralsAINTU and Shanghai AI Lab compiled 300+ papers: the latest review of visual segmentation based on Transformer is released

SAM (Segment Anything), as a basic visual segmentation model, has attracted the attention and follow-up of many researchers in just 3 months. If you want to systematically understand the technology behind SAM, keep up with the pace of involution, and be able to make your own SAM model, then this Transformer-Based Segmentation Survey is not to be missed! Recently, several researchers from Nanyang Technological University and Shanghai Artificial Intelligence Laboratory wrote a review about Transformer-Based Segmentation, systematically reviewing the segmentation and detection models based on Transformer in recent years, and conducting research. The latest model is as of June this year! At the same time, the review also includes the latest papers in related fields and a large number of experimental analyzes and comparisons, and reveals a number of future research directions with broad prospects!

Visual segmentation aims to segment images, video frames or point clouds into multiple segments or groups. This technology has many real-world applications, such as autonomous driving, image editing, robot perception, and medical analysis. Over the past decade, deep learning-based methods have made significant progress in this field. Recently, Transformer has become a neural network based on a self-attention mechanism, originally designed for natural language processing, which significantly surpasses previous convolutional or recurrent methods in various visual processing tasks. Specifically, the Vision Transformer provides powerful, unified, and even simpler solutions for various segmentation tasks. This review provides a comprehensive overview of Transformer-based visual segmentation, summarizing recent advances. First, this article reviews the background, including problem definition, data sets, and previous convolution methods. Next, this paper summarizes a meta-architecture that unifies all recent Transformer-based methods. Based on this meta-architecture, this article studies various method designs, including modifications to this meta-architecture and related applications. In addition, this article also introduces several related settings, including 3D point cloud segmentation, basic model tuning, domain adaptive segmentation, efficient segmentation and medical segmentation. Furthermore, this paper compiles and re-evaluates these methods on several widely recognized datasets. Finally, the paper identifies open challenges in this field and proposes directions for future research. This article will continue and track the latest Transformer-based segmentation and detection methods.

NTU、上海AI Lab整理300+论文:基于Transformer的视觉分割最新综述出炉Picture

Project address: https://github.com/lxtGH/Awesome-Segmentation-With-Transformer

Paper address: https://arxiv.org/pdf/2304.09854.pdf

Research motivation

  • The emergence of ViT and DETR has made full progress in the field of segmentation and detection. Currently, the top-ranking methods on almost every data set benchmark are based on Transformer. For this reason, it is necessary to systematically summarize and compare the methods and technical characteristics of this direction.
  • Recent large model architectures are all based on the Transformer structure, including multi-modal models and segmentation basic models (SAM), and various visual tasks are moving closer to unified model modeling.
  • Segmentation and detection have derived many related downstream tasks, and many of these tasks are also solved using the Transformer structure.

Summary Features

  • Systematic and readable. This article systematically reviews each task definition of segmentation, as well as related task definitions and evaluation indicators. And this article starts from the convolution method and summarizes a meta-architecture based on ViT and DETR. Based on this meta-architecture, this review summarizes and summarizes related methods, and systematically reviews recent methods. The specific technical review route is shown in Figure 1.
  • Detailed classification from a technical perspective. Compared with previous Transformer reviews, this article’s classification of methods will be more detailed. This article brings together papers with similar ideas and compares their similarities and differences. For example, this article will classify methods that simultaneously modify the decoder side of the meta-architecture into image-based Cross Attention, and video-based spatio-temporal Cross Attention modeling.
  • Comprehensiveness of the research question. This article will systematically review all directions of segmentation, including image, video, and point cloud segmentation tasks. At the same time, this article will also review related directions such as open set segmentation and detection models, unsupervised segmentation and weakly supervised segmentation.

NTU、上海AI Lab整理300+论文:基于Transformer的视觉分割最新综述出炉Picture

Figure 1. Survey’s content roadmap

NTU、上海AI Lab整理300+论文:基于Transformer的视觉分割最新综述出炉

Figure 2. Summary of commonly used data sets and segmentation tasks

Summary of Transformer-Based segmentation and detection methods and Comparison

NTU、上海AI Lab整理300+论文:基于Transformer的视觉分割最新综述出炉

##Figure 3. General Meta-Architecture Framework

This article first summarizes a meta-architecture based on the DETR and MaskFormer frameworks. This model includes the following different modules:

  • Backbone: Feature extractor, used to extract image features.
  • Neck: Construct multi-scale features to handle multi-scale objects.
  • Object Query: Query object, used to represent each entity in the scene, including foreground objects and background objects.
  • Decoder: Decoder, used to gradually optimize Object Query and corresponding features.
  • End-to-End Training: The design based on Object Query can achieve end-to-end optimization.

Based on this meta-architecture, existing methods can be divided into the following five different directions for optimization and adjustment according to tasks, as shown in Figure 4, each Directions have several different sub-directions.

NTU、上海AI Lab整理300+论文:基于Transformer的视觉分割最新综述出炉

Figure 4. Summary and comparison of Transformer-Based Segmentation methods

  • Better feature expression learning, Representation Learning. Strong visual feature representation will always lead to better segmentation results. This article divides related work into three aspects: better visual Transformer design, hybrid CNN/Transformer/MLP, and self-supervised learning.
  • Method design on the decoder side, Interaction Design in Decoder. This chapter reviews the new Transformer decoder design. This paper divides the decoder design into two groups: one is used to improve the cross-attention design in image segmentation, and the other is used to improve the spatio-temporal cross-attention design in video segmentation. The former focuses on designing a better decoder to improve upon the one in the original DETR. The latter extends query object-based object detectors and segmenters to the video domain for video object detection (VOD), video instance segmentation (VIS), and video pixel segmentation (VPS), focusing on modeling temporal consistency and correlation. sex.
  • #Try to Optimizing Object Query from the perspective of query object optimization. Compared with Faster-RCNN, DETR requires a longer convergence timetable. Due to the key role of query objects, some existing methods have been studied to speed up training and improve performance. According to the method of object query, this paper divides the following literature into two aspects: adding location information and using additional supervision. Location information provides clues for fast training sampling of query features. Additional supervision focuses on designing specific loss functions in addition to the DETR default loss function.
  • Use query objects to associate features and instances, Using Query For Association. Benefiting from the simplicity of query objects, multiple recent studies have used them as correlation tools to solve downstream tasks. There are two main usages: one is instance-level association, and the other is task-level association. The former uses the idea of ​​instance discrimination to solve instance-level matching problems in videos, such as video segmentation and tracking. The latter uses query objects to bridge different subtasks to achieve efficient multi-task learning.
  • Multi-modal conditional query object generation, Conditional Query Generation. This chapter mainly focuses on multi-modal segmentation tasks. Conditional query query objects are mainly used to handle cross-modal and cross-image feature matching tasks. Depending on the task input conditions, the decoder head uses different queries to obtain the corresponding segmentation masks. According to the sources of different inputs, this paper divides these works into two aspects: language features and image features. These methods are based on the strategy of fusing query objects with different model features, and have achieved good results in multiple multi-modal segmentation tasks and few-shot segmentation.

Figure 5 shows some representative work comparisons in these five different directions. For more specific method details and comparisons, please refer to the content of the paper.

NTU、上海AI Lab整理300+论文:基于Transformer的视觉分割最新综述出炉Picture

Figure 5. Summary and comparison of Transformer-based segmentation and representativeness detection methods

This article also explores several related fields: 1. Point cloud segmentation method based on Transformer. 2. Vision and multi-modal large model tuning. 3. Research on domain-related segmentation models, including domain transfer learning and domain generalization learning. 4. Efficient semantic segmentation: unsupervised and weakly supervised segmentation models. 5. Class-independent segmentation and tracking. 6. Medical image segmentation.

NTU、上海AI Lab整理300+论文:基于Transformer的视觉分割最新综述出炉Picture

Figure 6. Summary and comparison of Transformer-based methods in related research fields

Comparison of experimental results of different methods

NTU、上海AI Lab整理300+论文:基于Transformer的视觉分割最新综述出炉

Figure 7. Benchmark experiment of semantic segmentation data set

NTU、上海AI Lab整理300+论文:基于Transformer的视觉分割最新综述出炉

Figure 8. Benchmark experiment on panoramic segmentation data set

This article also uniformly uses the same experimental design conditions to compare the results of several representative works on panoramic segmentation and semantic segmentation on multiple data sets. It was found that when using the same training strategy and encoder, the gap between method performance will narrow.

In addition, this article also compares the results of recent Transformer-based segmentation methods on multiple different data sets and tasks. (Semantic segmentation, instance segmentation, panoramic segmentation, and corresponding video segmentation tasks)

Future Direction

In addition, this article also gives some Analysis of some possible future research directions. Three different directions are given here as examples.

  • UpdateAdd a general and unified segmentation model. Using the Transformer structure to unify different segmentation tasks is a trend. Recent research uses query object-based Transformers to perform different segmentation tasks under one architecture. One possible research direction is to unify image and video segmentation tasks on various segmentation datasets through one model. These general models can achieve versatile and robust segmentation in various scenarios. For example, detecting and segmenting rare categories in various scenarios helps robots make better decisions.
  • Segmentation model combined with visual reasoning. Visual reasoning requires the robot to understand the connections between objects in the scene, and this understanding plays a key role in motion planning. Previous research has explored using segmentation results as input to visual reasoning models for various applications such as object tracking and scene understanding. Joint segmentation and visual reasoning can be a promising direction, with mutually beneficial potential for both segmentation and relational classification. By incorporating visual reasoning into the segmentation process, researchers can leverage the power of reasoning to improve segmentation accuracy, while segmentation results can also provide better input for visual reasoning.
  • Research on segmentation model of continuous learning. Existing segmentation methods are usually benchmarked on closed-world datasets with a predefined set of categories, i.e., it is assumed that the training and test samples have the same categories known in advance and feature space. However, real-world scenarios are often open-world and non-stable, and new categories of data may constantly emerge. For example, in autonomous vehicles and medical diagnostics, unanticipated situations may suddenly arise. There is a clear gap between the performance and capabilities of existing methods in real-world and closed-world scenarios. Therefore, it is hoped that new concepts can be gradually and continuously incorporated into the existing knowledge base of the segmentation model, so that the model can engage in lifelong learning.

For more research directions, please refer to the original paper.

The above is the detailed content of NTU and Shanghai AI Lab compiled 300+ papers: the latest review of visual segmentation based on Transformer is released. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
论文插图也能自动生成了,用到了扩散模型,还被ICLR接收论文插图也能自动生成了,用到了扩散模型,还被ICLR接收Jun 27, 2023 pm 05:46 PM

生成式AI已经风靡了人工智能社区,无论是个人还是企业,都开始热衷于创建相关的模态转换应用,比如文生图、文生视频、文生音乐等等。最近呢,来自ServiceNowResearch、LIVIA等科研机构的几位研究者尝试基于文本描述生成论文中的图表。为此,他们提出了一种FigGen的新方法,相关论文还被ICLR2023收录为了TinyPaper。图片论文地址:https://arxiv.org/pdf/2306.00800.pdf也许有人会问了,生成论文中的图表有什么难的呢?这样做对于科研又有哪些帮助呢

聊天截图曝出AI顶会审稿潜规则!AAAI 3000块即可strong accept?聊天截图曝出AI顶会审稿潜规则!AAAI 3000块即可strong accept?Apr 12, 2023 am 08:34 AM

正值AAAI 2023论文截止提交之际,知乎上突然出现了一张AI投稿群的匿名聊天截图。其中有人声称,自己可以提供「3000块一个strong accept」的服务。爆料一出,顿时引起了网友们的公愤。不过,先不要着急。知乎大佬「微调」表示,这大概率只是「口嗨」而已。据「微调」透露,打招呼和团伙作案这个是任何领域都不能避免的问题。随着openreview的兴起,cmt的各种弊端也越来越清楚,未来留给小圈子操作的空间会变小,但永远会有空间。因为这是个人的问题,不是投稿系统和机制的问题。引入open r

CVPR 2023放榜,录用率25.78%!2360篇论文被接收,提交量暴涨至9155篇CVPR 2023放榜,录用率25.78%!2360篇论文被接收,提交量暴涨至9155篇Apr 13, 2023 am 09:37 AM

刚刚,CVPR 2023发文称:今年,我们收到了创纪录的9155份论文(比CVPR2022增加了12%),并录用了2360篇论文,接收率为25.78%。据统计,CVPR的投稿量在2010-2016的7年间仅从1724增加到2145。在2017年后则迅速飙升,进入快速增长期,2019年首次突破5000,至2022年投稿数已达到8161份。可以看到,今年提交了共9155份论文确实创下了最高记录。疫情放开后,今年的CVPR顶会将在加拿大举行。今年采用单轨会议的形式,并取消了传统Oral的评选。谷歌研究

上交大校友获最佳论文,机器人顶会CoRL 2022奖项公布上交大校友获最佳论文,机器人顶会CoRL 2022奖项公布Apr 11, 2023 pm 11:43 PM

自 2017 年首次举办以来,CoRL 已经成为了机器人学与机器学习交叉领域的全球顶级学术会议之一。CoRL 是面向机器人学习研究的 single-track 会议,涵盖机器人学、机器学习和控制等多个主题,包括理论与应用。2022年的CoRL大会于12月14日至18日在新西兰奥克兰举行。​本届大会共收到504篇投稿,最终接收34篇Oral论文、163篇Poster论文,接收率为39%。​目前,CoRL 2022 公布了最佳论文奖、最佳系统论文奖、特别创新奖等全部奖项。宾夕法尼亚大学GRASP实验

Nature新规:用ChatGPT写论文可以,列为作者不行Nature新规:用ChatGPT写论文可以,列为作者不行Apr 11, 2023 pm 01:13 PM

​本文经AI新媒体量子位(公众号ID:QbitAI)授权转载,转载请联系出处。面对ChatGPT,Nature终于坐不住了。本周,这家权威学术出版机构下场,针对ChatGPT代写学研文章、被列为作者等一系列问题,给了定性。具体来说,Nature列出两项原则:(1)任何大型语言模型工具(比如ChatGPT)都不能成为论文作者;(2)如在论文创作中用过相关工具,作者应在“方法”或“致谢”或适当的部分明确说明。现在,上述要求已经添进作者投稿指南中。近段时间,ChatGPT染指学研圈情况越来越多。去年1

学术专用版ChatGPT火了,一键完成论文润色、代码解释、报告生成学术专用版ChatGPT火了,一键完成论文润色、代码解释、报告生成Apr 04, 2023 pm 01:05 PM

用 ChatGPT 辅助写论文这件事,越来越靠谱了。 ChatGPT 发布以来,各个领域的从业者都在探索 ChatGPT 的应用前景,挖掘它的潜力。其中,学术文本的理解与编辑是一种极具挑战性的应用场景,因为学术文本需要较高的专业性、严谨性等,有时还需要处理公式、代码、图谱等特殊的内容格式。现在,一个名为「ChatGPT 学术优化(chatgpt_academic)」的新项目在 GitHub 上爆火,上线几天就在 GitHub 上狂揽上万 Star。项目地址:https://github.com/

快速学习InstructGPT论文的关键技术点:跟随李沐掌握ChatGPT背后的技术快速学习InstructGPT论文的关键技术点:跟随李沐掌握ChatGPT背后的技术Apr 24, 2023 pm 04:04 PM

在ChatGPT走红之后,很多关注技术的同学都在问一个问题:有没有什么学习资料可以让我们系统地了解ChatGPT背后的原理?由于OpenAI还没有发布ChatGPT相关论文,这一问题变得棘手起来。不过,从OpenAI关于ChatGPT的博客中我们知道,ChatGPT用到的方法和它的兄弟模型——InstructGPT一样,只不过InstructGPT是在GPT-3上微调的,而ChatGPT则是基于GPT-3.5。在数据收集工作上,二者也存在一些差别。博客链接:ht

DeepMind用AI重写排序算法;将33B大模型塞进单个消费级GPUDeepMind用AI重写排序算法;将33B大模型塞进单个消费级GPUJun 12, 2023 pm 06:49 PM

目录:FastersortingalgorithmsdiscoveredusingdeepreinforcementlearningVideo-LLaMA:AnInstruction-tunedAudio-VisualLanguageModelforVideoUnderstandingPatch-based3DNaturalSceneGenerationfromaSingleExampleSpatio-temporalDiffusionPointProcessesSpQR:ASparse-Qua

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function