Home >Technology peripherals >AI >Unified visual AI capabilities! Automated image detection and segmentation, and controllable Vincentian images, produced by a Chinese team

Unified visual AI capabilities! Automated image detection and segmentation, and controllable Vincentian images, produced by a Chinese team

王林
王林forward
2023-04-12 17:31:171105browse

This article is reprinted with the authorization of AI New Media Qubit (public account ID: QbitAI). Please contact the source for reprinting.

Now it’s time for the AI ​​circle to compete with hand speed.

No, Meta’s SAM has just been launched a few days ago, and domestic programmers have come to superimpose a wave of buffs, integrating target detection, segmentation, and generation of major visual AI functions all in one!

For example, based on Stable Diffusion and SAM, you can seamlessly replace the chair in the photo with a sofa:

Unified visual AI capabilities! Automated image detection and segmentation, and controllable Vincentian images, produced by a Chinese team

It is also so easy to change clothes and hair color :

Unified visual AI capabilities! Automated image detection and segmentation, and controllable Vincentian images, produced by a Chinese team

As soon as the project was released, many people exclaimed: The hand speed is too fast!

Unified visual AI capabilities! Automated image detection and segmentation, and controllable Vincentian images, produced by a Chinese team

Someone else said: There are new wedding photos of Yui Aragaki and I.

Unified visual AI capabilities! Automated image detection and segmentation, and controllable Vincentian images, produced by a Chinese team

The above is the effect brought by Gounded-SAM. The project has received 1.8k stars on GitHub.

To put it simply, this is a zero-shot vision application that only needs to input images to automatically detect and segment images.

This research comes from IDEA Research Institute (Guangdong-Hong Kong-Macao Greater Bay Area Digital Economy Research Institute), whose founder and chairman is Shen Xiangyang.

No additional training required

Grounded SAM is mainly composed of two models: Grounding DINO and SAM.

SAM (Segment Anything) is a zero-sample segmentation model just launched by Meta 4 days ago.

It can generate masks for any objects in images/videos, including objects and images that have not appeared during the training process.

By allowing SAM to return a valid mask for any prompt, the model's output should be a reasonable mask among all possibilities, even if the prompt is ambiguous or points to multiple objects. This task is used to pretrain the model and solve general downstream segmentation tasks via hints.

The model framework mainly consists of an image encoder, a hint encoder and a fast mask decoder. After computing the image embedding, SAM is able to generate a segmentation based on any prompt in the web within 50 milliseconds.

Unified visual AI capabilities! Automated image detection and segmentation, and controllable Vincentian images, produced by a Chinese team

Grounding DINO is an existing achievement of this research team.

This is a zero-shot detection model, which can generate object boxes and labels with text descriptions.

Unified visual AI capabilities! Automated image detection and segmentation, and controllable Vincentian images, produced by a Chinese team

After combining the two, you can find any object in the picture through text description, and then use SAM's powerful segmentation capability to segment the mask in a fine-grained manner.

Unified visual AI capabilities! Automated image detection and segmentation, and controllable Vincentian images, produced by a Chinese team

On top of these abilities, they also added the ability of Stable Diffusion, which is the controllable image generation shown at the beginning.

It is worth mentioning that Stable Diffusion has been able to achieve similar functions before. Just erase the image elements you want to replace and enter the text prompt.

This time, Grounded SAM can save the step of manual selection and control it directly through text description.

In addition, combined with BLIP (Bootstrapping Language-Image Pre-training), it generates image titles, extracts labels, and then generates object boxes and masks.

Currently, there are more interesting features under development.

For example, some expansion of characters: changing clothes, hair color, skin color, etc.

Unified visual AI capabilities! Automated image detection and segmentation, and controllable Vincentian images, produced by a Chinese team


Unified visual AI capabilities! Automated image detection and segmentation, and controllable Vincentian images, produced by a Chinese team

#The specific consumption method has also been Given on GitHub. The project requires Python 3.8 or above, pytorch 1.7 or above, torchvision 0.8 or above, and related dependencies must be installed. Please see the GitHub project page for specific content.

The research team comes from the IDEA Research Institute (Guangdong-Hong Kong-Macao Greater Bay Area Digital Economy Research Institute).

Public information shows that the institute is an international innovative research institution for artificial intelligence, digital economy industry and cutting-edge technology. Former chief scientist of Microsoft Asia Research Institute and former vice president of Microsoft Global Intelligence Shen Xiangyang Dr. serves as the founder and chairman.

One More Thing

For the future work of Grounded SAM, the team has several prospects:

  • Automatically generate images to form a new data set
  • The powerful basic model with segmentation pre-training
  • cooperates with (Chat-)GPT
  • to form a pipeline that automatically generates image labels, boxes and masks, and can generate new images.

It is worth mentioning that many of the team members of this project are active respondents in the AI ​​field on Zhihu. This time they also answered questions about Grounded SAM on Zhihu. Content, interested children can leave a message to ask~

Unified visual AI capabilities! Automated image detection and segmentation, and controllable Vincentian images, produced by a Chinese team

The above is the detailed content of Unified visual AI capabilities! Automated image detection and segmentation, and controllable Vincentian images, produced by a Chinese team. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete