Home > Article > Technology peripherals > Chinese team subverts CV! SEEM perfectly divides all explosions and divides the "instantaneous universe" with one click
The emergence of Meta’s “Divide Everything” made many people exclaim that CV no longer exists.
Based on this model, many netizens have done further work, such as Grounded SAM.
By combining Stable Diffusion, Whisper, and ChatGPT, you can turn a dog into a monkey through voice.
And now, not just voice, you can segment everything everywhere at once through multi-modal prompts .
How to do it specifically?
Click the mouse to directly select the split content.
Open your mouth and say a word.
Just swipe it and the complete emoticon package will be there.
Even, you can even split the video.
The latest research on SEEM was jointly completed by scholars from the University of Wisconsin-Madison, Microsoft Research and other institutions.
Easily segment images with SEEM using different kinds of cues, visual cues (dots, marks, boxes, doodles, and image fragments), as well as verbal cues (text and audio).
Paper address: https://arxiv.org/pdf/2304.06718.pdf
The interesting thing about the title of this paper is that it is very similar to the name of an American science fiction movie "Everything Everywhere All at Once" released in 2022.
NVIDIA scientist Jim Fan said that the Oscar for best paper title goes to "Segment Everything Everywhere All at Once"
Having a unified, versatile task specification interface is key to scaling up large base models. Multimodal prompts are the way of the future.
After reading the paper, netizens said that CV will now begin to embrace large models. What is the future for graduate students? ?
was inspired by the development of a common interface for LLMs based on prompts , the researchers proposed SEEM.
As shown in the figure, the SEEM model can perform any segmentation task in the open set without hints, such as semantic segmentation, instance segmentation and panoramic segmentation.
Additionally, it supports any combination of visual, textual and reference area prompts, allowing for versatile and interactive Reference splitting.
In terms of model architecture, SEEM adopts a common encoder-decoder architecture. What makes it unique is the complex interaction between queries and prompts.
Features and cues are encoded into a joint visual semantic space by corresponding encoders, or samplers.
Learnable queries are randomly initialized, and the SEEM decoder accepts learnable queries, image features, and textual cues as input and output, including class and mask embeddings for masks and semantics predict.
It is worth mentioning that the SEEM model has multiple rounds of interactions. Each round consists of a manual cycle and a model cycle.
In the manual loop, the mask output of the previous iteration is manually received and positive feedback for the next round of decoding is given through visual cues. In the model loop, the model receives and updates memory cues for future predictions.
Through SEEM, given a picture of Optimus Prime’s truck, you can segment Optimus Prime on any target image .
Generate a mask from the text entered by the user for one-click segmentation.
In addition, SEEM can add similar semantics to the target image by simply clicking or graffiti on the reference image. objects are segmented.
In addition, SEEM understands solution space relationships very well. After the zebras in the upper left row are graffitied, the leftmost zebra will also be segmented.
SEEM can also reference images to video masks. It does not require any video data training and can perfectly segment videos. .
## On the data set and settings, SEEM Three datasets were trained: panoramic segmentation, reference segmentation and interactive segmentation.
Interactive segmentation
On interactive segmentation, researchers compared SEEM with state-of-the-art interactive segmentation models.
As a general model, SEEM has achieved comparable performance to RITM, SimpleClick, etc. And it achieves very similar performance to SAM. SAM also uses 50 more segmented data for training.
Notably, unlike existing interactive models, SEEM is the first to support not only classic segmentation tasks but also a wide range of multi-modal inputs, including text , points, scribbles, bounding boxes and images, providing powerful combination capabilities.
##Universal segmentation
pass all A set of parameters pre-trained for segmentation tasks, allowing researchers to directly evaluate its performance on common segmentation datasets.
SEEM achieves better panoramic view, instance and semantic segmentation performance.
##Researchers have four expected goals for SEEM:1 . Versatility: By introducing a versatile hint engine to handle different types of hints, including points, boxes, graffiti, masks, text and reference areas of another image;
2. Complexity: By learning a joint visual-semantic space, visual and textual cues can be combined for instant query reasoning;
3. Interactivity: By integrating learnable memory cues, through masking Code-guided cross-attention preserves conversation history information;
4. Semantic awareness: Open vocabulary segmentation is achieved by using a text encoder to encode text queries and mask tags.
The difference between SAM and SAM
#The SAM model proposed by Meta can be specified in a unified framework prompt encoder. Points, a bounding box, and a sentence can segment objects with one click.SAM has broad versatility, that is, it has the ability to migrate with zero samples, which is enough to cover various use cases. With additional training, it can be used out of the box in new imaging domains, whether underwater photos or cell microscopy.
Researchers discuss the interactive and semantic capabilities of three segmentation tasks (edge detection, open set and interactive segmentation) A comparison was made between SEEM and SAM.
In open set segmentation, high-level semantics are also required and no interaction is required.
Compared with SAM, SEEM covers a wider range of interactions and semantic levels.
SAM only supports limited interaction types, such as points and bounding boxes, and ignores high semantic tasks because it does not output semantic labels itself.
For SEEM, researchers have highlighted two highlights:
First, SEEM has a unified prompt encoder that combines all Visual and verbal cues are encoded into a joint representation space. Therefore, SEEM can support more general usage, and it can potentially be extended to custom prompts.
Secondly, SEEM does a good job at text masking and output semantic-aware predictions.
The first author of the paper Xueyan Zou
She is currently a doctoral student in the Department of Computer Science at the University of Wisconsin-Madison, under the supervision of Professor Yong Jae Lee.
Prior to this, Zou spent three years at the University of California, Davis, under the guidance of the same mentor and worked closely with Dr. Fanyi Xiao.
She received her bachelor's degree from Hong Kong Baptist University, supervised by Professor PC Yuen and Professor Chu Xiaowen.
##Jianwei Yang
Yang is a senior researcher in the deep learning group of Microsoft Research in Redmond, supervised by Dr. Jianfeng Gao.
Yang’s research mainly focuses on computer vision, vision and language, and machine learning. He focuses on different levels of structured visual understanding and how they can be further exploited for intelligent interaction with humans through language and environmental embodiment.
Before joining Microsoft in March 2020, Yang received his PhD in Computer Science from Georgia Tech’s School of Interactive Computing, where his advisor was Professor Devi Parikh, and he also worked with Professor Dhruv Batra Work closely together.
Gao Jianfeng
##Gao Jianfeng is a distinguished scientist and associate professor at Microsoft Research President, IEEE member, and ACM Distinguished Member.Currently, Gao Jianfeng leads the deep learning group. The group's mission is to advance the state-of-the-art of deep learning and its applications in natural language and image understanding, and to make advances in conversation models and methods.
Research mainly includes neural language models for natural language understanding and generation, neural symbolic computing, the foundation and understanding of visual language, conversational artificial intelligence, etc.
From 2014 to 2018, Gao Jianfeng served as a partner research manager for commercial artificial intelligence in the Microsoft Artificial Intelligence and Research Department and the Deep Learning Technology Center (DLTC) of Redmond Microsoft Research.
From 2006 to 2014, Gao Jianfeng served as the chief researcher in the natural language processing group.
Yong Jae Lee
Lee is a computer scientist at the University of Washington, Madison Associate Professor in the Department of Science.Before joining UW-Madison in the fall of 2021, he served as a visiting instructor in artificial intelligence at Cruise for a year, and before that he was at the University of California, Davis Served as assistant and associate professor for 6 years.
He also spent a year as a postdoctoral researcher at the Robotics Institute at Carnegie Mellon University.
He received his PhD from the University of Texas at Austin in May 2012 with Kristen Grauman, and from the University of Illinois at Urbana-Champaign in May 2006 Bachelor's degree.
He also worked as a summer intern at Microsoft Research with Larry Zitnick and Michael Cohen.
Currently, Lee’s research focuses on computer vision and machine learning. Lee is particularly interested in creating powerful visual recognition systems that can understand visual data with minimal human supervision. Currently, SEEM has opened a demo: https://huggingface.co/spaces/xdecoder/SEEM Come and try it out.
The above is the detailed content of Chinese team subverts CV! SEEM perfectly divides all explosions and divides the "instantaneous universe" with one click. For more information, please follow other related articles on the PHP Chinese website!