


Google team launches new Transformer to optimize panoramic segmentation solution
Recently, the Google AI team proposed an end-to-end solution for panoramic segmentation using Mask Transformer, inspired by Transformer and DETR.
The full name is end-to-end solution for panoptic segmentation with mask transformers, which is mainly used to generate extensions of the segmentation MaskTransformer architecture.
The solution uses a pixel path (composed of a convolutional neural network or a visual Transformer) to extract pixel features, a memory path (composed of a Transformer decoder module) to extract memory features, and a dual-path Transformer for pixel features and Characteristics of interactions between memories.
However, the dual-path Transformer utilizing cross-attention was originally designed for language tasks, and its input sequence consists of hundreds of words.
For visual tasks, especially segmentation problems, the input sequence consists of tens of thousands of pixels, which not only indicates that the magnitude of the input scale is much larger, but also represents a lower representation compared to language words. level of embedding.
Panoramic segmentation is a computer vision problem that is now a core task in many applications.
It is divided into two parts: semantic segmentation and instance segmentation.
Semantic segmentation is like assigning semantic labels to each pixel in the image, such as "person" and "sky".
Instance segmentation only identifies and segments countable objects in the graph, such as "pedestrians" and "cars", and further divides them into several subtasks.
Each subtask is processed individually, and additional modules are applied to merge the results of each subtask stage.
This process is not only complex, but also introduces many artificially designed priors when processing subtasks and integrating the results of different subtasks.
In "CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation" published at CVPR 2022, the article proposes to reinterpret and redesign cross-attention from the perspective of clustering cross attention (that is, grouping pixels with the same semantic label into the same group) to better adapt to visual tasks.
CMT-DeepLab builds on the previous state-of-the-art method MaX-DeepLab and adopts a pixel clustering method to perform cross-attention, resulting in denser and more reasonable attention maps.
kMaX-DeepLab further redesigns cross-attention to be more like a k-means clustering algorithm with simple changes to the activation function.
Structural Overview
Researchers will reinterpret it from the perspective of clustering, rather than directly applying cross-attention to visual tasks without modification.
Specifically, they note that Mask Transformer object queries can be thought of as cluster centers (aimed at grouping pixels with the same semantic label).
The process of cross-attention is similar to the k-means clustering algorithm, (1) iterative process of assigning pixels to cluster centers, in which multiple pixels can be assigned to a single cluster center, and some Cluster centers may not have assigned pixels, and (2) cluster centers are updated by averaging pixels assigned to the same cluster center, if no pixels are assigned, cluster centers are not updated).
In CMT-DeepLab and kMaX-DeepLab, we reformulate cross-attention from a clustering perspective, which includes iterative cluster assignment and clustering update step
Given the popularity of k-means clustering algorithm, in CMT-DeepLab, they redesigned the cross-attention for spatial-aspect softmax operation (i.e., applied along the spatial resolution of the image softmax operation), which actually assigns cluster centers to the opposite, pixels are applied along the cluster centers.
In kMaX-DeepLab, we further simplify spatial-wise softmax to cluster-wise argmax (i.e., apply the argmax operation along the cluster center).
They note that the argmax operation is the same as the hard assignment (i.e. one pixel is assigned to only one cluster) used in the k-means clustering algorithm.
Reconstructing MaskTransformer's cross-attention from a clustering perspective significantly improves segmentation performance and simplifies the complex MaskTransformer pipeline to make it more interpretable.
First, an encoder-decoder structure is used to extract pixel features from the input image. The pixels are then grouped using a set of cluster centers, which are further updated based on cluster assignments. Finally, the cluster assignment and update steps are performed iteratively, and the last assignment can be directly used as segmentation prediction.
In order to convert the typical MaskTransformer decoder (composed of cross-attention, multi-head self-attention and feed-forward network) into the one proposed above k-means cross-attention, just replace the spatial-wise softmax with the cluster-wise maximum parameter.
The meta-architecture of kMaX-DeepLab proposed this time consists of three components: pixel encoder, enhanced pixel decoder and kMaX decoder.
The pixel encoder is the backbone of any network and is used to extract image features.
The enhanced pixel decoder includes a Transformer encoder to enhance pixel features, and an upsampling layer to generate higher resolution features.
A series of kMax decoders convert cluster centers into (1) Mask embedding vectors, which are multiplied with pixel features to generate predicted Masks, and (2) class predictions for each Mask.
kMaX-DeepLab’s meta-architecture
Research results
Finally, the research team achieved success in the two most challenging panoramic segmentation data We evaluate CMT-DeepLab and kMaX-DeepLab using the Panorama Quality (PQ) metric on COCO and Cityscapes, and compare MaX-DeepLab with other state-of-the-art methods.
Among them, CMT-DeepLab achieved significant performance improvement, while kMaX-DeepLab not only simplified the modification, but also further improved it. The PQ on COCO val set was 58.0%, PQ was 68.4%, and 44.0% Mask Average precision (Mask AP), 83.5% average intersection over union (mIoU) on Cityscapes validation set, without test-time augmentation or use of external datasets.
Designed from the perspective of clustering, kMaX-DeepLab not only has higher performance, but also can more reasonably visualize the attention map to understand its working mechanism.
In the example below, kMaX-DeepLab iteratively performs cluster assignment and updates, gradually improving Mask quality.
kMaX-DeepLab’s attention map can be directly visualized as panoramic segmentation, making the model working mechanism more reasonable
Conclusion
This research Demonstrates a way to better design MaskTransformers in vision tasks.
With simple modifications, CMT-DeepLab and kMaX-DeepLab reconstruct cross-attention to make it more like a clustering algorithm.
Thus, the proposed model achieves state-of-the-art performance on COCO and Cityscapes datasets.
The research team stated that they hope that the open source version of kMaX-DeepLab in the DeepLab2 library will contribute to future research on the design of architectures dedicated to visual Transformers.
The above is the detailed content of Google team launches new Transformer to optimize panoramic segmentation solution. For more information, please follow other related articles on the PHP Chinese website!

Running large language models at home with ease: LM Studio User Guide In recent years, advances in software and hardware have made it possible to run large language models (LLMs) on personal computers. LM Studio is an excellent tool to make this process easy and convenient. This article will dive into how to run LLM locally using LM Studio, covering key steps, potential challenges, and the benefits of having LLM locally. Whether you are a tech enthusiast or are curious about the latest AI technologies, this guide will provide valuable insights and practical tips. Let's get started! Overview Understand the basic requirements for running LLM locally. Set up LM Studi on your computer

Guy Peri is McCormick’s Chief Information and Digital Officer. Though only seven months into his role, Peri is rapidly advancing a comprehensive transformation of the company’s digital capabilities. His career-long focus on data and analytics informs

Introduction Artificial intelligence (AI) is evolving to understand not just words, but also emotions, responding with a human touch. This sophisticated interaction is crucial in the rapidly advancing field of AI and natural language processing. Th

Introduction In today's data-centric world, leveraging advanced AI technologies is crucial for businesses seeking a competitive edge and enhanced efficiency. A range of powerful tools empowers data scientists, analysts, and developers to build, depl

This week's AI landscape exploded with groundbreaking releases from industry giants like OpenAI, Mistral AI, NVIDIA, DeepSeek, and Hugging Face. These new models promise increased power, affordability, and accessibility, fueled by advancements in tr

But the company’s Android app, which offers not only search capabilities but also acts as an AI assistant, is riddled with a host of security issues that could expose its users to data theft, account takeovers and impersonation attacks from malicious

You can look at what’s happening in conferences and at trade shows. You can ask engineers what they’re doing, or consult with a CEO. Everywhere you look, things are changing at breakneck speed. Engineers, and Non-Engineers What’s the difference be

Simulate Rocket Launches with RocketPy: A Comprehensive Guide This article guides you through simulating high-power rocket launches using RocketPy, a powerful Python library. We'll cover everything from defining rocket components to analyzing simula


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Atom editor mac version download
The most popular open source editor

SublimeText3 Linux new version
SublimeText3 Linux latest version

SublimeText3 Mac version
God-level code editing software (SublimeText3)

SublimeText3 English version
Recommended: Win version, supports code prompts!

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.