Home >Technology peripherals >AI >Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV

Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV

王林
王林forward
2023-04-07 15:00:04945browse

Just now, Meta AI released Segment Anything Model (SAM) - the first basic model for image segmentation.

SAM can achieve one-click segmentation of any object from photos or videos, and can migrate to other tasks with zero samples.

Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV

Overall, SAM follows the idea of ​​​​the basic model:

1. A very Simple yet scalable architecture that can handle multi-modal cues: text, keypoints, bounding boxes.

2. Intuitive annotation process, closely connected with model design.

3. A data flywheel that allows the model to be bootstrapped to a large number of unlabeled images.

And, it is no exaggeration to say that SAM has learned the general concept of "object", even for unknown objects, unfamiliar scenes (such as underwater and under microscopes), and blurry The same is true for the case.

In addition, SAM can also be generalized to new tasks and new fields, and practitioners no longer need to fine-tune the model themselves.

Paper address: https://ai.facebook.com/research/publications/segment-anything/

The most powerful thing is that Meta implements a completely different CV paradigm. You can specify a point, a bounding box, and a sentence in a unified framework prompt encoder to directly segment objects with one click.

In this regard, Tencent AI algorithm expert Jin Tian said, "The prompt paradigm in the NLP field has begun to extend to the CV field. This time, it may completely change the traditional prediction thinking of CV. . Now you can really use a model to segment any object, and it is dynamic!"

NVIDIA AI scientist Jim Fan even praised this: We are already here It’s the “GPT-3 moment” in the field of computer vision!

So, CV really doesn’t exist anymore?

SAM: "Cut out" all objects in any image with one click

Segment Anything is the first basic model dedicated to image segmentation.

Segmentation refers to identifying which image pixels belong to an object and has always been the core task of computer vision.

However, if you want to create an accurate segmentation model for a specific task, it usually requires highly specialized work by experts. This process requires an infrastructure for training AI and a large number of carefully annotated domains. Data, so the threshold is extremely high.

In order to solve this problem, Meta proposed a basic model for image segmentation-SAM. This hintable model, trained on diverse data, is not only adaptable to a variety of tasks, but also operates similarly to how hints are used in NLP models.

The SAM model grasps the concept of "what is an object" and can generate a mask for any object in any image or video, even objects it has not seen during training.

SAM is so versatile that it covers a variety of use cases and can be used in new imaging domains out of the box without additional training, whether it's underwater photos, Or a cell microscope. In other words, SAM already has the capability of zero-sample migration.

Meta said excitedly in the blog: It can be expected that in the future, SAM will be used in any application that needs to find and segment objects in images.

SAM can become part of a larger AI system to develop a more general multi-modal understanding of the world, for example, understanding the visual and textual content of web pages.

In the field of AR/VR, SAM can select objects based on the user’s line of sight and then “upgrade” the objects to 3D.

For content creators, SAM can extract image areas for collage, or video editing.

SAM can also locate and track animals or objects in videos, which is helpful for natural science and astronomy research.

Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV

General segmentation method

In the past, there were two methods to solve the segmentation problem.

One is interactive segmentation, which can segment objects of any category, but requires a person to fine-tune the mask through iteration.

The second is automatic segmentation, which can segment specific objects defined in advance, but the training process requires a large number of manually labeled objects (for example, to segment a cat, thousands of example).

In short, neither of these two methods can provide a universal, fully automatic segmentation method.

And SAM can be seen as a generalization of these two methods, and it can easily perform interactive segmentation and automatic segmentation.

On the model's promptable interface, a wide range of segmentation tasks can be completed by simply designing the correct prompts (clicks, boxes, text, etc.) for the model.

Additionally, SAM is trained on a diverse, high-quality dataset containing over 1 billion masks, allowing the model to generalize to new objects and images beyond its capabilities. What was observed during training. As a result, practitioners no longer need to collect their own segmentation data to fine-tune models for use cases.

This kind of flexibility that can be generalized to new tasks and new fields is the first time in the field of image segmentation.

(1) SAM allows users to segment objects with one click, or interactively click many points, and can also use bounding box hints for the model.

(2) When faced with the ambiguity of segmented objects, SAM can output multiple valid masks, which is an essential capability for solving segmentation problems in the real world.

(3) SAM can automatically discover and block all objects in the image. (4) After precomputing image embeddings, SAM can generate segmentation masks for any prompt in real time, allowing users to interact with the model in real time.

How it works

The SAM trained by the researchers can return valid segmentation masks for any prompt. Cues can be foreground/background points, rough boxes or masks, free-form text, or generally any information that indicates that segmentation is needed in the image.

The requirement for effective masking simply means that even in cases where the prompt is ambiguous and may refer to multiple objects (e.g., a dot on a shirt may represent either the shirt or the person wearing the shirt ) , the output should be a reasonable mask of one of the objects.


Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV

The researchers observed that pre-training tasks and interactive data collection impose specific constraints on model design. constraint.

In particular, the model needs to run in real time on the CPU in a web browser so that standard staff can efficiently interact with SAM in real time for annotation.

While runtime constraints mean there is a trade-off between quality and runtime, the researchers found that in practice, simple designs can achieve good results.

SAM's image encoder produces one-time embeddings for images, while the lightweight decoder converts any hints into vector embeddings on the fly. These two sources of information are then combined in a lightweight decoder that predicts segmentation masks.

After calculating the image embedding, SAM can generate a segment of the image in just 50 milliseconds and give any prompt in the web browser.


Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV

The latest SAM model was trained on 256 A100 images for 68 hours (nearly 5 days).


Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV

Project demonstration

Multiple input prompts

Prompts for specifying the content to be divided in the image, Various segmentation tasks can be implemented without additional training.


Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV

##Use interaction points and boxes as prompts


Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV

Automatically segment all elements in the image

Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV

Generate multiple valid masks for ambiguous prompts

Promptable design

SAM can accept input prompts from other systems.

For example, select the corresponding object based on the user's visual focus information transmitted from the AR/VR headset. Meta's development of AI that can understand the real world will pave the way for its future metaverse journey.


Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV

Alternatively, implement text-to-object segmentation using bounding box hints from the object detector.

Scalable output

The output mask can be used as input to other AI systems.

For example, the mask of an object can be tracked in a video, turned into 3D through imaging editing applications, or used for creative tasks such as collage.


Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV

Zero-sample generalization

SAM learned A general idea of ​​what an object is - this understanding enables zero-shot generalization to unfamiliar objects and images without the need for additional training.


Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV


Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV


Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV

Various reviews

Select Hover&Click, click Add Mask and a green dot will appear, click Remove Area and a red dot will appear , the apple-eating Huahua was immediately circled.

Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV

In the Box function, simply select the box and the recognition will be completed immediately.

Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV

#After clicking Everything, all objects recognized by the system are extracted immediately.

After choosing Cut-Outs, you will get a triangular dumpling in seconds.

Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV

SA-1B dataset: 11 million images, 1.1 billion masks

In addition to the new models released, Meta Also released is SA-1B, the largest segmentation dataset to date.

This dataset consists of 11 million diverse, high-resolution, privacy-preserving images, and 1.1 billion high-quality segmentation masks.

The overall characteristics of the data set are as follows:

· Total number of images: 11 million

· Total number of masks: 1.1 billion

· Average masks per image: 100

· Average image resolution: 1500 × 2250 pixels

Note: Image or mask annotations do not have class tags

Meta specifically emphasizes that these data are collected through our data engine, all Masks are all fully automatically generated by SAM.

With the SAM model, collecting new segmentation masks is faster than ever, and interactively annotating a mask only takes about 14 seconds.

The per-mask annotation process is only 2 times slower than annotating bounding boxes. Using the fastest annotation interface, annotating bounding boxes takes about 7 seconds.

Compared to previous large-scale segmentation data collection efforts, SAM model COCO’s fully manual polygon-based mask annotation is 6.5 times faster than the previous largest data annotation effort (also model Auxiliary) 2 times faster.


Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV

However, relying on interactive annotation masks is not enough to create more than 1 billion masked data set. Therefore, Meta built a data engine for creating SA-1B datasets.

This data engine has three "gears":

1. Model auxiliary annotation

2. The mixture of fully automatic annotation and auxiliary annotation helps to increase the diversity of collected masks

3. Fully automatic mask creation enables the expansion of the data set

Our final dataset includes over 1.1 billion segmentation masks collected on approximately 11 million authorized and privacy-preserving images.

SA-1B has 400x more masks than any existing segmentation dataset. And human evaluation studies confirm that the masks are of high quality and diversity, and in some cases are even qualitatively comparable to previous masks from smaller, fully manually annotated datasets.


Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV

## Pictures of the SA-1B were obtained through photo providers from multiple countries, These countries span different geographic regions and income levels.

While some geographic areas are still underrepresented, SA-1B has more images and better overall representation across all regions than previous segmentation datasets.

Finally, Meta says it hopes this data can form the basis of new datasets that include additional annotations, such as textual descriptions associated with each mask.

RBG master leads the team

Ross Girshick


Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV

##Ross Girshick (often called the RBG guru) is a research scientist at the Facebook Artificial Intelligence Research Institute (FAIR), where he is committed to computer vision and machine learning research.

In 2012, Ross Girshick received his PhD in Computer Science from the University of Chicago under the supervision of Pedro Felzenszwalb.

Before joining FAIR, Ross was a researcher at Microsoft Research and a postdoc at the University of California, Berkeley, where his mentors were Jitendra Malik and Trevor Darrell.

He received the 2017 PAMI Young Researcher Award and the 2017 and 2021 PAMI Mark Everingham Awards in recognition of his contributions to open source software.

As we all know, Ross and He Kaiming jointly developed the target detection algorithm of the R-CNN method. In 2017, the Mask R-CNN paper by Ross and He Kaiming won the best paper in ICCV 2017.

Netizen: CV really doesn’t exist anymore

Meta created this segmentation basic model in the CV field, which made many netizens shout, “Now, CV really doesn’t exist. Exists."

Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV

Meta scientist Justin Johnson said: "To me, Segment Anything's data engine and ChatGPT's RLHF represent the largest A new era of artificial intelligence. Instead of learning everything from noisy network data, it is better to cleverly apply human annotation combined with big data to unlock new capabilities. Supervised learning is back!"


Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV

#The only regret is that the SAM model release was mainly led by Ross Girshick, but He Yuming was absent.


Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV

Intimate friend "matrix Mingzi" said that this article further proves that multimodality is CV There is no tomorrow for pure CV.

The above is the detailed content of Prompt to cut out pictures with one click! Meta releases the first basic image segmentation model in history, creating a new paradigm for CV. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete