Meta releases multi-purpose large model open source to help move one step closer to visual unification-AI-php.cn

Home

Technology peripherals

Meta releases multi-purpose large model open source to help move one step closer to visual unification

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

May 07, 2023 pm 03:49 PM

VisionTask

After open source the SAM model that “divides everything”, Meta is going further and further on the road to “visual basic model”.

This time, they open sourced a set of models called DINOv2. These models can produce high-performance visual representations that can be used for downstream tasks such as classification, segmentation, image retrieval, and depth estimation without fine-tuning.

Meta releases multi-purpose large model open source to help move one step closer to visual unification

##This set of models has the following characteristics:

Uses self-supervised training without requiring a large amount of labeled data;
can be used as the backbone of almost all CV tasks, no fine-tuning is required, Such as image classification, segmentation, image retrieval and depth estimation;
Learn features directly from images without relying on text descriptions, which allows the model to better understand local information;
Can learn from any image collection;
A pre-trained version of DINOv2 is already available and is comparable to CLIP and OpenCLIP on a range of tasks.

Meta releases multi-purpose large model open source to help move one step closer to visual unification

Paper link: https://arxiv.org/pdf/2304.07193.pdf
Project link: https://dinov2.metademolab.com/

Paper Overview

Learning non-task-specific pre-trained representations has become a standard in natural language processing. You can use these features "as-is" (no fine-tuning required), and they perform significantly better on downstream tasks than task-specific models. This success is due to pre-training on large amounts of raw text using auxiliary objectives, such as language modeling or word vectors, which do not require supervision.

As this paradigm shift occurs in the field of NLP, it is expected that similar "base" models will emerge in computer vision. These models should generate visual features that work "out of the box" on any task, whether at the image level (e.g. image classification) or pixel level (e.g. segmentation).

These basic models have great hope to focus on text-guided pre-training, that is, using a form of text supervision to guide the training of features. This form of text-guided pre-training limits the information about the image that can be retained, as the caption only approximates the rich information in the image, and finer, complex pixel-level information may not be discovered with this supervision. Furthermore, these image encoders require already aligned text-image corpora and do not provide the flexibility of their text counterparts, i.e. cannot learn from raw data alone.

An alternative to text-guided pre-training is self-supervised learning, where features are learned from images only. These methods are conceptually closer to front-end tasks such as language modeling, and can capture information at the image and pixel level. However, despite their potential to learn general features, most of the improvements in self-supervised learning have been achieved in the context of pre-training on the small refined dataset ImageNet1k. There have been some efforts by some researchers to extend these methods beyond ImageNet-1k, but they focused on unfiltered datasets, which often resulted in significant degradation in performance quality. This is due to a lack of control over data quality and diversity, which are critical to producing good results.

In this work, researchers explore whether self-supervised learning is possible to learn general visual features if pre-trained on a large amount of refined data. They revisit existing discriminative self-supervised methods that learn features at the image and patch level, such as iBOT, and reconsider some of their design choices under larger datasets. Most of our technical contributions are tailored to stabilize and accelerate discriminative self-supervised learning when scaling model and data sizes. These improvements made their method approximately 2x faster and required 1/3 less memory than similar discriminative self-supervised methods, allowing them to take advantage of longer training and larger batch sizes.

Regarding the pre-training data, they built an automated pipeline for filtering and rebalancing the dataset from a large collection of unfiltered images. This is inspired by pipelines used in NLP, where data similarity is used instead of external metadata, and manual annotation is not required. A major difficulty when processing images is to rebalance concepts and avoid overfitting in some dominant modes. In this work, the naive clustering method can solve this problem well, and the researchers collected a small but diverse corpus consisting of 142M images to validate their method.

Finally, the researchers provide various pre-trained vision models, called DINOv2, trained on their data using different visual Transformer (ViT) architectures. They released all models and code to retrain DINOv2 on any data. When extended, they validated the quality of DINOv2 on a variety of computer vision benchmarks at the image and pixel levels, as shown in Figure 2. We conclude that self-supervised pre-training alone is a good candidate for learning transferable frozen features, comparable to the best publicly available weakly supervised models.

Data Processing

The researchers assembled their refined LVD by retrieving images from large amounts of unfiltered data that were close to images in multiple refined datasets -142M dataset. In their paper, they describe the main components in the data pipeline, including curated/unfiltered data sources, image deduplication steps, and retrieval systems. The entire pipeline does not require any metadata or text and processes images directly, as shown in Figure 3. The reader is referred to Appendix A for further details on the model methodology.

Meta releases multi-purpose large model open source to help move one step closer to visual unification

Figure 3: Overview of the data processing pipeline. Images from refined and non-refined data sources are first mapped to embeddings. The unrefined image is then deduplicated before being matched to the standard image. The resulting combination further enriches the initial data set through a self-supervised retrieval system.

Discriminative self-supervised pre-training

The researchers learned their features through a discriminative self-supervised method that can see The work is a combination of DINO and iBOT losses, centered on SwAV. They also added a regularizer to propagate features and a brief high-resolution training phase.

Efficient Implementation

They considered several improvements to train the model on a larger scale. The model is trained on an A100 GPU using PyTorch 2.0, and the code can also be used with a pretrained model for feature extraction. Details of the model are in Appendix Table 17. On the same hardware, the DINOv2 code uses only 1/3 of the memory and runs 2 times faster than the iBOT implementation.

Meta releases multi-purpose large model open source to help move one step closer to visual unification

Experimental results

In this section, the researcher will introduce the new model in many image understanding Empirical evaluation on tasks. They evaluated global and local image representations, including category and instance-level recognition, semantic segmentation, monocular depth prediction, and action recognition.

ImageNet Classification

Meta releases multi-purpose large model open source to help move one step closer to visual unification

#Other Image and Video Classification Benchmarks

Meta releases multi-purpose large model open source to help move one step closer to visual unification

##Instance identification

Meta releases multi-purpose large model open source to help move one step closer to visual unification

Dense recognition task

Meta releases multi-purpose large model open source to help move one step closer to visual unification

Qualitative results

Meta releases multi-purpose large model open source to help move one step closer to visual unification

The above is the detailed content of Meta releases multi-purpose large model open source to help move one step closer to visual unification. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

Most Used 10 Power BI Charts - Analytics VidhyaApr 16, 2025 pm 12:05 PM

Harnessing the Power of Data Visualization with Microsoft Power BI Charts In today's data-driven world, effectively communicating complex information to non-technical audiences is crucial. Data visualization bridges this gap, transforming raw data i

Expert Systems in AIApr 16, 2025 pm 12:00 PM

Expert Systems: A Deep Dive into AI's Decision-Making Power Imagine having access to expert advice on anything, from medical diagnoses to financial planning. That's the power of expert systems in artificial intelligence. These systems mimic the pro

Three Of The Best Vibe Coders Break Down This AI Revolution In CodeApr 16, 2025 am 11:58 AM

First of all, it’s apparent that this is happening quickly. Various companies are talking about the proportions of their code that are currently written by AI, and these are increasing at a rapid clip. There’s a lot of job displacement already around

Runway AI's Gen-4: How Can AI Montage Go Beyond AbsurdityApr 16, 2025 am 11:45 AM

The film industry, alongside all creative sectors, from digital marketing to social media, stands at a technological crossroad. As artificial intelligence begins to reshape every aspect of visual storytelling and change the landscape of entertainment

How to Enroll for 5 Days ISRO AI Free Courses? - Analytics VidhyaApr 16, 2025 am 11:43 AM

ISRO's Free AI/ML Online Course: A Gateway to Geospatial Technology Innovation The Indian Space Research Organisation (ISRO), through its Indian Institute of Remote Sensing (IIRS), is offering a fantastic opportunity for students and professionals to

Local Search Algorithms in AIApr 16, 2025 am 11:40 AM

Local Search Algorithms: A Comprehensive Guide Planning a large-scale event requires efficient workload distribution. When traditional approaches fail, local search algorithms offer a powerful solution. This article explores hill climbing and simul

OpenAI Shifts Focus With GPT-4.1, Prioritizes Coding And Cost EfficiencyApr 16, 2025 am 11:37 AM

The release includes three distinct models, GPT-4.1, GPT-4.1 mini and GPT-4.1 nano, signaling a move toward task-specific optimizations within the large language model landscape. These models are not immediately replacing user-facing interfaces like

The Prompt: ChatGPT Generates Fake PassportsApr 16, 2025 am 11:35 AM

Chip giant Nvidia said on Monday it will start manufacturing AI supercomputers— machines that can process copious amounts of data and run complex algorithms— entirely within the U.S. for the first time. The announcement comes after President Trump si

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Chat Commands and How to Use Them

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),