search
HomeTechnology peripheralsAIIn the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm

ImageNet accuracy was once the main indicator for evaluating model performance, but in today's computational vision field, this indicator gradually appears to be incomplete.

As computer vision models have become more complex, the variety of available models has increased significantly, from ConvNets to Vision Transformers. Training methods have also evolved to self-supervised learning and image-text pair training like CLIP, and are no longer limited to supervised training on ImageNet.

Although the accuracy of ImageNet is an important indicator, it is not sufficient to fully evaluate the performance of the model. Different architectures, training methods, and data sets may cause models to perform differently on different tasks, so relying solely on ImageNet to judge models may have limitations. When a model overfits the ImageNet dataset and reaches saturation in accuracy, the model's generalization ability on other tasks may be overlooked. Therefore, multiple factors need to be considered to evaluate the performance and applicability of the model.

Although CLIP's ImageNet accuracy is similar to ResNet, its visual encoder is more robust and transferable. This prompted researchers to explore the unique advantages of CLIP that were not apparent when considering only ImageNet metrics. This highlights the importance of analyzing other properties to help discover useful models.

Beyond this, traditional benchmarks cannot fully assess a model’s ability to handle real-world visual challenges, such as various camera angles, lighting conditions, or occlusions. Models trained on datasets such as ImageNet often find it difficult to leverage their performance in practical applications because real-world conditions and scenarios are more diverse.

These questions have brought new confusion to practitioners in the field: How to measure a visual model? And how to choose a visual model that suits your needs?

In a recent paper, researchers from MBZUAI and Meta conducted an in-depth discussion on this issue.

In the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm


  • Paper title: ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy
  • Paper link: https://arxiv.org/pdf/2311.09215.pdf

The research focuses on model behavior beyond ImageNet accuracy, analyzing the performance of major models in the field of computer vision, including ConvNeXt and Vision Transformer (ViT), which perform well in supervised and CLIP training paradigms lower performance.

The selected models have a similar number of parameters and almost the same accuracy on ImageNet-1K under each training paradigm, ensuring a fair comparison. The researchers deeply explored a series of model characteristics, such as prediction error type, generalization ability, invariance of learned representations, calibration, etc., focusing on the characteristics of the model without additional training or fine-tuning, hoping to directly References are provided by practitioners using pretrained models.

In the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm

In the analysis, the researchers found that there were significant differences in model behavior across different architectures and training paradigms. For example, models trained under the CLIP paradigm produced fewer classification errors than those trained on ImageNet. However, the supervised model is better calibrated and generally outperforms on the ImageNet robustness benchmark. ConvNeXt has advantages on synthetic data, but is more texture-oriented than ViT. Meanwhile, supervised ConvNeXt performs well on many benchmarks, with transferability performance comparable to the CLIP model.

It can be seen that various models demonstrate their own advantages in unique ways, and these advantages cannot be captured by a single indicator. The researchers emphasize that more detailed evaluation metrics are needed to accurately select models in specific contexts and to create new ImageNet-agnostic benchmarks.

Based on these observations, Meta AI chief scientist Yann LeCun retweeted the study and liked:

In the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm

Model selection

For the supervised model, the researcher used ViT’s pre-trained DeiT3-Base/16, which has the same architecture as ViT-Base/16, but the training Method improvements; additionally ConvNeXt-Base is used. For the CLIP model, the researchers used the visual encoders of ViT-Base/16 and ConvNeXt-Base in OpenCLIP.

Please note that the performance of these models differs slightly from the original OpenAI models. All model checkpoints can be found on the GitHub project homepage. See Table 1 for detailed model comparison:

In the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm

The researcher gave a detailed explanation of the model selection process:

1. Since researchers use pre-trained models, they cannot control the quantity and quality of data samples seen during training.

2. To analyze ConvNets and Transformers, many previous studies have compared ResNet and ViT. This comparison generally goes against ConvNet, as ViT is typically trained with more advanced recipes and achieves higher ImageNet accuracy. ViT also has some architectural design elements, such as LayerNorm, that were not incorporated into ResNet when it was invented many years ago. Therefore, for a more balanced evaluation, we compared ViT with ConvNeXt, a modern representative of ConvNet that performs on par with Transformers and shares many designs.

3. In terms of training mode, the researchers compared the supervised mode and the CLIP mode. Supervised models have maintained state-of-the-art performance in computer vision. CLIP models, on the other hand, perform well in terms of generalization and transferability and provide properties for connecting visual and linguistic representations.

4. Since the self-supervised model showed similar behavior to the supervised model in preliminary testing, it was not included in the results. This may be due to the fact that they ended up being supervised fine-tuned on ImageNet-1K, which affects the study of many features.

Next, let’s take a look at how researchers analyzed different attributes.

Analysis

Model error

ImageNet-X is a dataset extending ImageNet-1K with detailed human annotation of 16 factors of variation, enabling in-depth analysis of model errors in image classification. It uses an error ratio metric (lower is better) to quantify a model's performance on specific factors relative to overall accuracy, allowing for a nuanced analysis of model errors. Results on ImageNet-X show:

#1. CLIP models make fewer errors in ImageNet accuracy relative to supervised models.

2. All models are mainly affected by complex factors such as occlusion.

3. Texture is the most challenging element of all models.

In the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm

Shape/Texture Deviation

Shape-Texture Deviation will detect the model Whether to rely on fragile texture shortcuts rather than high-level shape cues. This bias can be studied by combining cue-conflicting images of different categories of shape and texture. This approach helps to understand to what extent a model's decisions are based on shape compared to texture. The researchers evaluated the shape-texture bias on the cue conflict dataset and found that the texture bias of the CLIP model was smaller than that of the supervised model, while the shape bias of the ViT model was higher than that of ConvNets.

In the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm

Model Calibration

Calibrate the prediction confidence of the quantifiable model and its Consistency of actual accuracy can be assessed through metrics such as expected calibration error (ECE) and visualization tools such as reliability plots and confidence histograms. Calibration was evaluated on ImageNet-1K and ImageNet-R, classifying predictions into 15 levels. During the experiment, the researchers observed the following points:

In the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm

#1. The CLIP model is overconfident, while the supervised model is slightly underconfident.

2. Supervised ConvNeXt performs better calibration than supervised ViT.

Robustness and transferability

The robustness and transferability of the model are important for adapting to changes in data distribution and The new mission is crucial. The researchers evaluated the robustness using various ImageNet variants and found that although the average performance of ViT and ConvNeXt models was comparable, except for ImageNet-R and ImageNet-Sketch, supervised models generally outperformed CLIP in terms of robustness. . In terms of transferability, supervised ConvNeXt outperforms ViT and is almost on par with the performance of the CLIP model, as evaluated on the VTAB benchmark using 19 datasets.

In the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm

Synthetic data

Synthetic data sets such as PUG-ImageNet can accurately Controlling factors such as camera angle and texture is a promising research path, so the researchers analyzed the performance of the model on synthetic data. PUG-ImageNet contains photorealistic ImageNet images with systematic variation in factors such as pose and lighting, and performance is measured as absolute top-1 accuracy. The researchers provide results on different factors in PUG-ImageNet and find that ConvNeXt outperforms ViT in almost all factors. This shows that ConvNeXt outperforms ViT on synthetic data, while the gap for the CLIP model is smaller because the accuracy of the CLIP model is lower than the supervised model, which may be related to the lower accuracy of the original ImageNet.

In the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm

Transformation invariance

Transformation invariance refers to the ability of the model to generate Consistent representations that are not affected by input transformations thus preserving semantics, such as scaling or moving. This property enables the model to generalize well across different but semantically similar inputs. Methods used include resizing images for scale invariance, moving crops for position invariance, and adjusting the resolution of ViT models using interpolated positional embeddings.

They evaluate scale, motion, and resolution invariance on ImageNet-1K by varying crop scale/position and image resolution. ConvNeXt outperforms ViT in supervised training. Overall, the model is more robust to scale/resolution transformations than to movements. For applications that require high robustness to scaling, displacement, and resolution, the results suggest that supervised ConvNeXt may be the best choice.

In the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm

Summary

Overall, each model has its own unique advantages. This suggests that model selection should depend on the target use case, as standard performance metrics may ignore critical nuances of a specific task. Furthermore, many existing benchmarks are derived from ImageNet, which also biases the evaluation. Developing new benchmarks with different data distributions is crucial to evaluate models in a more real-world representative environment.

The following is a summary of the conclusions of this article:

ConvNet and Transformer

1. Supervised ConvNeXt outperforms supervised ViT on many benchmarks: it is better calibrated, more invariant to data transformations, and exhibits better transferability and robustness.

2. ConvNeXt performs better than ViT on synthetic data.

3. ViT has greater shape deviation.

Supervision and CLIP

#1. Although the CLIP model is superior in terms of transferability, there is supervision ConvNeXt performed competitively in this task. This demonstrates the potential of supervised models.

2. Supervised models perform better on robustness benchmarks, probably because these models are all ImageNet variants.

3. The CLIP model has greater shape bias and fewer classification errors compared to ImageNet’s accuracy.

The above is the detailed content of In the post-Sora era, how do CV practitioners choose models? Convolution or ViT, supervised learning or CLIP paradigm. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
Tesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserTesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserApr 22, 2025 am 11:48 AM

Since 2008, I've championed the shared-ride van—initially dubbed the "robotjitney," later the "vansit"—as the future of urban transportation. I foresee these vehicles as the 21st century's next-generation transit solution, surpas

Sam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailSam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailApr 22, 2025 am 11:29 AM

Revolutionizing the Checkout Experience Sam's Club's innovative "Just Go" system builds on its existing AI-powered "Scan & Go" technology, allowing members to scan purchases via the Sam's Club app during their shopping trip.

Nvidia's AI Omniverse Expands At GTC 2025Nvidia's AI Omniverse Expands At GTC 2025Apr 22, 2025 am 11:28 AM

Nvidia's Enhanced Predictability and New Product Lineup at GTC 2025 Nvidia, a key player in AI infrastructure, is focusing on increased predictability for its clients. This involves consistent product delivery, meeting performance expectations, and

Exploring the Capabilities of Google's Gemma 2 ModelsExploring the Capabilities of Google's Gemma 2 ModelsApr 22, 2025 am 11:26 AM

Google's Gemma 2: A Powerful, Efficient Language Model Google's Gemma family of language models, celebrated for efficiency and performance, has expanded with the arrival of Gemma 2. This latest release comprises two models: a 27-billion parameter ver

The Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaThe Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaApr 22, 2025 am 11:21 AM

This Leading with Data episode features Dr. Kirk Borne, a leading data scientist, astrophysicist, and TEDx speaker. A renowned expert in big data, AI, and machine learning, Dr. Borne offers invaluable insights into the current state and future traje

AI For Runners And Athletes: We're Making Excellent ProgressAI For Runners And Athletes: We're Making Excellent ProgressApr 22, 2025 am 11:12 AM

There were some very insightful perspectives in this speech—background information about engineering that showed us why artificial intelligence is so good at supporting people’s physical exercise. I will outline a core idea from each contributor’s perspective to demonstrate three design aspects that are an important part of our exploration of the application of artificial intelligence in sports. Edge devices and raw personal data This idea about artificial intelligence actually contains two components—one related to where we place large language models and the other is related to the differences between our human language and the language that our vital signs “express” when measured in real time. Alexander Amini knows a lot about running and tennis, but he still

Jamie Engstrom On Technology, Talent And Transformation At CaterpillarJamie Engstrom On Technology, Talent And Transformation At CaterpillarApr 22, 2025 am 11:10 AM

Caterpillar's Chief Information Officer and Senior Vice President of IT, Jamie Engstrom, leads a global team of over 2,200 IT professionals across 28 countries. With 26 years at Caterpillar, including four and a half years in her current role, Engst

New Google Photos Update Makes Any Photo Pop With Ultra HDR QualityNew Google Photos Update Makes Any Photo Pop With Ultra HDR QualityApr 22, 2025 am 11:09 AM

Google Photos' New Ultra HDR Tool: A Quick Guide Enhance your photos with Google Photos' new Ultra HDR tool, transforming standard images into vibrant, high-dynamic-range masterpieces. Ideal for social media, this tool boosts the impact of any photo,

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version