CLIP is selected as CVPR when used as RNN: it can segment countless concepts without training | Oxford University & Google Research-AI-php.cn

CLIP is selected as CVPR when used as RNN: it can segment countless concepts without training | Oxford University & Google Research

PHPz

Jun 09, 2024 pm 12:53 PM

Neural NetworksclipCaR technology

Call CLIP in a loop to effectively segment countless concepts without additional training.

Any phrase including movie characters, landmarks, brands, and general categories.

CLIP is selected as CVPR when used as RNN: it can segment countless concepts without training | Oxford University & Google Research

This new result of the joint team of Oxford University and Google Research has been accepted by CVPR 2024 and the code has been open sourced.

CLIP is selected as CVPR when used as RNN: it can segment countless concepts without training | Oxford University & Google Research

The team proposed a new technology called CLIP as RNN (CaR for short), which solves several key problems in the field of open vocabulary image segmentation:

No training data required: While traditional methods require extensive mask annotations or image-text datasets for fine-tuning, CaR technology works without any additional training data.
Limitations of Open Vocabulary: Pre-trained visual-language models (VLMs) are limited in their ability to handle open vocabularies after fine-tuning. CaR technology preserves the wide vocabulary space of VLMs.
Text query processing for concepts not in images: Without fine-tuning, VLMs are difficult to accurately segment concepts that do not exist in images. CaR is gradually optimized through an iterative process to improve the segmentation quality.

Inspired by RNN, circularly calling CLIP

To understand the principle of CaR, you need to first review the recurrent neural network RNN.

RNN introduces the concept of hidden state, which is like a "memory" that stores information from past time steps. And each time step shares the same set of weights, which can model sequence data well.

Inspired by RNN, CaR is also designed as a cyclic framework, consisting of two parts:

Mask proposal generator: generates a mask for each text query with the help of CLIP.
Mask classifier: Then use a CLIP model to evaluate the matching degree of each generated mask and the corresponding text query. If the matching degree is low, the text query is eliminated.

If iteration continues like this, the text query will become more and more accurate, and the quality of the mask will become higher and higher.

Finally, when the query set no longer changes, the final segmentation result can be output.

CLIP is selected as CVPR when used as RNN: it can segment countless concepts without training | Oxford University & Google Research

The reason why this recursive framework is designed is to retain the "knowledge" of CLIP pre-training to the greatest extent.

There are a huge number of concepts seen in CLIP pre-training, covering everything from celebrities, landmarks to anime characters. If you fine-tune on a split data set, the vocabulary is bound to shrink significantly.

For example, the "divide everything" SAM model can only recognize a bottle of Coca-Cola, but not even a bottle of Pepsi-Cola.

CLIP is selected as CVPR when used as RNN: it can segment countless concepts without training | Oxford University & Google Research

#But using CLIP directly for segmentation, the effect is not satisfactory.

This is because CLIP’s pre-training goal was not originally designed for dense prediction. Especially when certain text queries do not exist in the image, CLIP can easily generate some wrong masks.

CaR cleverly solves this problem through RNN-style iteration. By repeatedly evaluating and filtering queries while improving the mask, high-quality open vocabulary segmentation is finally achieved.

Finally, let’s follow the team’s interpretation and learn about the details of the CaR framework.

CaR technical details

CLIP is selected as CVPR when used as RNN: it can segment countless concepts without training | Oxford University & Google Research

Recurrent neural network framework: CaR adopts a novel circular framework to continuously optimize the correspondence between text queries and images through an iterative process.
Two-stage segmenter: consists of a mask proposal generator and a mask classifier, both built on the pre-trained CLIP model, and the weights remain unchanged during the iteration process.
Mask proposal generation: Use gradCAM technology to generate mask proposals based on similarity scores of image and text features.
Visual cues: Apply visual cues such as red circles, background blur, etc. to enhance the model's focus on specific areas of the image.
Threshold function: By setting a similarity threshold, mask proposals that are highly aligned with the text query are filtered out.
Post-processing: Mask refinement using dense conditional random fields (CRF) and optional SAM models.

Through these technical means, CaR technology has achieved significant performance improvements on multiple standard data sets, surpassing traditional zero-shot learning methods, and working with models that have been fine-tuned on a large amount of data. It also showed competitiveness in comparison. As shown in the table below, although no additional training and fine-tuning is required, CaR shows stronger performance on eight different indicators of zero-shot semantic segmentation than previous methods fine-tuned on additional data.

CLIP is selected as CVPR when used as RNN: it can segment countless concepts without training | Oxford University & Google Research

The author also tested the effect of CaR on zero-sample Referring segmentation. CaR also showed stronger performance than the previous zero-sample method.

CLIP is selected as CVPR when used as RNN: it can segment countless concepts without training | Oxford University & Google Research

To sum up, CaR (CLIP as RNN) is an innovative recurrent neural network framework that can effectively perform zero training without additional training data. Sample semantic and referent image segmentation tasks. It significantly improves segmentation quality by preserving the broad vocabulary space of pre-trained visual-language models and leveraging an iterative process to continuously optimize the alignment of text queries with mask proposals.

The advantages of CaR are its ability to handle complex text queries without fine-tuning and its scalability to the video field, which has brought breakthrough progress to the field of open vocabulary image segmentation.

Paper link: https://arxiv.org/abs/2312.07661.
Project homepage: https://torrvision.com/clip_as_rnn/.

The above is the detailed content of CLIP is selected as CVPR when used as RNN: it can segment countless concepts without training | Oxford University & Google Research. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

AI Therapists Are Here: 14 Groundbreaking Mental Health Tools You Need To KnowApr 30, 2025 am 11:17 AM

While it can’t provide the human connection and intuition of a trained therapist, research has shown that many people are comfortable sharing their worries and concerns with relatively faceless and anonymous AI bots. Whether this is always a good i

Calling AI To The Grocery AisleApr 30, 2025 am 11:16 AM

Artificial intelligence (AI), a technology decades in the making, is revolutionizing the food retail industry. From large-scale efficiency gains and cost reductions to streamlined processes across various business functions, AI's impact is undeniabl

Getting Pep Talks From Generative AI To Lift Your SpiritApr 30, 2025 am 11:15 AM

Let’s talk about it. This analysis of an innovative AI breakthrough is part of my ongoing Forbes column coverage on the latest in AI including identifying and explaining various impactful AI complexities (see the link here). In addition, for my comp

Why AI-Powered Hyper-Personalization Is A Must For All BusinessesApr 30, 2025 am 11:14 AM

Maintaining a professional image requires occasional wardrobe updates. While online shopping is convenient, it lacks the certainty of in-person try-ons. My solution? AI-powered personalization. I envision an AI assistant curating clothing selecti

Forget Duolingo: Google Translate's New AI Feature Teaches LanguagesApr 30, 2025 am 11:13 AM

Google Translate adds language learning function According to Android Authority, app expert AssembleDebug has found that the latest version of the Google Translate app contains a new "practice" mode of testing code designed to help users improve their language skills through personalized activities. This feature is currently invisible to users, but AssembleDebug is able to partially activate it and view some of its new user interface elements. When activated, the feature adds a new Graduation Cap icon at the bottom of the screen marked with a "Beta" badge indicating that the "Practice" feature will be released initially in experimental form. The related pop-up prompt shows "Practice the activities tailored for you!", which means Google will generate customized

They're Making TCP/IP For AI, And It's Called NANDAApr 30, 2025 am 11:12 AM

MIT researchers are developing NANDA, a groundbreaking web protocol designed for AI agents. Short for Networked Agents and Decentralized AI, NANDA builds upon Anthropic's Model Context Protocol (MCP) by adding internet capabilities, enabling AI agen

The Prompt: Deepfake Detection Is A Booming BusinessApr 30, 2025 am 11:11 AM

Meta's Latest Venture: An AI App to Rival ChatGPT Meta, the parent company of Facebook, Instagram, WhatsApp, and Threads, is launching a new AI-powered application. This standalone app, Meta AI, aims to compete directly with OpenAI's ChatGPT. Lever

The Next Two Years In AI Cybersecurity For Business LeadersApr 30, 2025 am 11:10 AM

Navigating the Rising Tide of AI Cyber Attacks Recently, Jason Clinton, CISO for Anthropic, underscored the emerging risks tied to non-human identities—as machine-to-machine communication proliferates, safeguarding these "identities" become

See all articles