search
HomeTechnology peripheralsAILeCun's evaluation: Meta evaluation of ConvNet and Transformer, which one is stronger?

How to choose a visual model based on specific needs?

How do ConvNet/ViT and supervised/CLIP models compare with each other on indicators other than ImageNet?

The latest research published by researchers from MABZUAI and Meta comprehensively compares common visual models on "non-standard" indicators.

LeCuns evaluation: Meta evaluation of ConvNet and Transformer, which one is stronger?

Paper address: https://arxiv.org/pdf/2311.09215.pdf

LeCun The study was highly praised and called excellent. The study compares similarly sized ConvNext and VIT architectures, providing a comprehensive comparison of various properties when trained in supervised mode and using CLIP methods.

LeCuns evaluation: Meta evaluation of ConvNet and Transformer, which one is stronger?

Beyond ImageNet accuracy

The computer vision model landscape is becoming increasingly diverse and complex.

From early ConvNets to the evolution of Vision Transformers, the types of available models are constantly expanding.

Similarly, the training paradigm has evolved from supervised training on ImageNet to self-supervised learning and image-text pair training like CLIP.

LeCuns evaluation: Meta evaluation of ConvNet and Transformer, which one is stronger?

While marking progress, this explosion of options poses a major challenge for practitioners: How to choose the goals that are right for them Model?

ImageNet accuracy has always been the main indicator for evaluating model performance. Since sparking the deep learning revolution, it has driven significant advances in the field of artificial intelligence.

However, it cannot measure the nuances of models resulting from different architectures, training paradigms, and data.

If judged solely by ImageNet accuracy, models with different properties may look similar (Figure 1). This limitation becomes more apparent as the model begins to overfit the features of ImageNet and accuracy reaches saturation.

LeCuns evaluation: Meta evaluation of ConvNet and Transformer, which one is stronger?

#To bridge the gap, the researchers conducted an in-depth exploration of model behavior beyond ImageNet accuracy.

In order to study the impact of architecture and training objectives on model performance, Vision Transformer (ViT) and ConvNeXt were specifically compared. The ImageNet-1K validation accuracy and computational requirements of these two modern architectures are comparable.

In addition, the study compared supervised models represented by DeiT3-Base/16 and ConvNeXt-Base, as well as OpenCLIP's visual encoder based on the CLIP model.

LeCuns evaluation: Meta evaluation of ConvNet and Transformer, which one is stronger?

ANALYSIS OF RESULTS

The researchers' analysis is intended to require no further training or fine-tuning of the study. Evaluated model behavior.

This approach is particularly important for practitioners with limited computing resources, as they often rely on pre-trained models.

In the specific analysis, although the author recognizes the value of downstream tasks such as object detection, the focus is on those features that can provide insights with minimal computational requirements and reflect the application of real-world applications. Very important behavioral properties.

Model Error

ImageNet-X is a dataset that extends ImageNet-1K and contains Detailed manual annotation of 16 changing factors, enabling in-depth analysis of model errors in image classification.

It uses error rates (lower is better) to quantify the model's performance on specific factors relative to overall accuracy, allowing for a nuanced analysis of model errors. Results on ImageNet-X show:

1. Relative to its ImageNet accuracy, the CLIP model makes fewer errors than the supervised model.

2. All models are mainly affected by complex factors such as occlusion.

3. Texture is the most challenging element of all models.

LeCuns evaluation: Meta evaluation of ConvNet and Transformer, which one is stronger?

LeCuns evaluation: Meta evaluation of ConvNet and Transformer, which one is stronger?

Shape/Texture Deviation

Shape/Texture Deviation checks whether the model relies on texture shortcuts rather than advanced shape hints.

This bias can be studied by combining cue-conflicting images of different categories of shape and texture.

This approach helps to understand to what extent the model's decisions are based on shape compared to texture.

The researchers evaluated the shape-texture bias on the cue conflict dataset and found that the texture bias of the CLIP model was smaller than that of the supervised model, while the shape bias of the ViT model was higher than that of ConvNets.

LeCuns evaluation: Meta evaluation of ConvNet and Transformer, which one is stronger?

Model Calibration

Calibrate the prediction confidence of the quantifiable model and its Is the actual accuracy consistent?

This can be assessed through metrics such as expected calibration error (ECE), as well as visualization tools such as reliability plots and confidence histograms.

The researchers evaluated the calibration on ImageNet-1K and ImageNet-R, classifying predictions into 15 levels. In the experiment, the following points were observed:

- The CLIP model has high confidence, while the supervised model is slightly less confident.

- Supervised ConvNeXt is better calibrated than supervised ViT.

LeCuns evaluation: Meta evaluation of ConvNet and Transformer, which one is stronger?

Robustness and portability

The robustness and portability of the model Portability is key to adapting to changes in data distribution and new tasks.

The researchers evaluated the robustness using different ImageNet variants and found that while the ViT and ConvNeXt models had similar average performance, except for ImageNet-R and ImageNet-Sketch, supervision Models generally outperform CLIP in terms of robustness.

In terms of portability, evaluated on 19 datasets using the VTAB benchmark, supervised ConvNeXt outperforms ViT and is almost on par with the performance of the CLIP model.

LeCuns evaluation: Meta evaluation of ConvNet and Transformer, which one is stronger?

Synthetic data

Synthetic datasets like PUG-ImageNet , which can precisely control factors such as camera angle and texture, has become a promising research avenue, so researchers analyzed the model's performance based on synthetic data.

PUG-ImageNet contains photorealistic ImageNet images with systematic variations in lighting and other factors, with performance measured as the absolute highest accuracy.

The researchers provided results for different factors in PUG-ImageNet and found that ConvNeXt outperformed ViT in almost all factors.

This shows that ConvNeXt outperforms ViT on synthetic data, while the gap is smaller for the CLIP model, because the accuracy of the CLIP model is lower than the supervised model, which may be different from the accuracy of the original ImageNet lower related.

LeCuns evaluation: Meta evaluation of ConvNet and Transformer, which one is stronger?

Feature invariance

Feature invariance refers to the ability of the model to produce Consistent representation that is not affected by input transformations, thus preserving semantics such as scaling or moving.

This feature enables the model to generalize well across different but semantically similar inputs.

The researchers’ approach includes resizing images to achieve scale invariance, moving crops to achieve position invariance, and adjusting the resolution of the ViT model using interpolated positional embeddings.

In supervised training, ConvNeXt outperforms ViT.

Overall, the model is more robust to scale/resolution transformations than to movements. For applications that require high robustness to scaling, displacement, and resolution, the results suggest that supervised ConvNeXt may be the best choice.

LeCuns evaluation: Meta evaluation of ConvNet and Transformer, which one is stronger?

#The researchers found that each model had its own unique advantages.

This suggests that the choice of model should depend on the target use case, as standard performance metrics may overlook mission-critical nuances.

Furthermore, many existing benchmarks are derived from ImageNet, which biases the evaluation. Developing new benchmarks with different data distributions is crucial to evaluate models in a more realistically representative context.

ConvNet vs Transformer

- In many benchmarks, supervised ConvNeXt has better performance than supervised VIT Better performance: It is better calibrated, invariant to data transformations, exhibits better transferability and robustness.

- ConvNeXt outperforms ViT on synthetic data.

- ViT has a higher shape bias.

Supervised vs CLIP

- Although the CLIP model is better in terms of transferability, supervised ConvNeXt Demonstrated competence in this task. This demonstrates the potential of supervised models.

- Supervised models are better at robustness benchmarks, probably because these models are variants of ImageNet.

- The CLIP model has higher shape bias and fewer classification errors compared to its ImageNet accuracy.

The above is the detailed content of LeCun's evaluation: Meta evaluation of ConvNet and Transformer, which one is stronger?. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
You Must Build Workplace AI Behind A Veil Of IgnoranceYou Must Build Workplace AI Behind A Veil Of IgnoranceApr 29, 2025 am 11:15 AM

In John Rawls' seminal 1971 book The Theory of Justice, he proposed a thought experiment that we should take as the core of today's AI design and use decision-making: the veil of ignorance. This philosophy provides a simple tool for understanding equity and also provides a blueprint for leaders to use this understanding to design and implement AI equitably. Imagine that you are making rules for a new society. But there is a premise: you don’t know in advance what role you will play in this society. You may end up being rich or poor, healthy or disabled, belonging to a majority or marginal minority. Operating under this "veil of ignorance" prevents rule makers from making decisions that benefit themselves. On the contrary, people will be more motivated to formulate public

Decisions, Decisions… Next Steps For Practical Applied AIDecisions, Decisions… Next Steps For Practical Applied AIApr 29, 2025 am 11:14 AM

Numerous companies specialize in robotic process automation (RPA), offering bots to automate repetitive tasks—UiPath, Automation Anywhere, Blue Prism, and others. Meanwhile, process mining, orchestration, and intelligent document processing speciali

The Agents Are Coming – More On What We Will Do Next To AI PartnersThe Agents Are Coming – More On What We Will Do Next To AI PartnersApr 29, 2025 am 11:13 AM

The future of AI is moving beyond simple word prediction and conversational simulation; AI agents are emerging, capable of independent action and task completion. This shift is already evident in tools like Anthropic's Claude. AI Agents: Research a

Why Empathy Is More Important Than Control For Leaders In An AI-Driven FutureWhy Empathy Is More Important Than Control For Leaders In An AI-Driven FutureApr 29, 2025 am 11:12 AM

Rapid technological advancements necessitate a forward-looking perspective on the future of work. What happens when AI transcends mere productivity enhancement and begins shaping our societal structures? Topher McDougal's upcoming book, Gaia Wakes:

AI For Product Classification: Can Machines Master Tax Law?AI For Product Classification: Can Machines Master Tax Law?Apr 29, 2025 am 11:11 AM

Product classification, often involving complex codes like "HS 8471.30" from systems such as the Harmonized System (HS), is crucial for international trade and domestic sales. These codes ensure correct tax application, impacting every inv

Could Data Center Demand Spark A Climate Tech Rebound?Could Data Center Demand Spark A Climate Tech Rebound?Apr 29, 2025 am 11:10 AM

The future of energy consumption in data centers and climate technology investment This article explores the surge in energy consumption in AI-driven data centers and its impact on climate change, and analyzes innovative solutions and policy recommendations to address this challenge. Challenges of energy demand: Large and ultra-large-scale data centers consume huge power, comparable to the sum of hundreds of thousands of ordinary North American families, and emerging AI ultra-large-scale centers consume dozens of times more power than this. In the first eight months of 2024, Microsoft, Meta, Google and Amazon have invested approximately US$125 billion in the construction and operation of AI data centers (JP Morgan, 2024) (Table 1). Growing energy demand is both a challenge and an opportunity. According to Canary Media, the looming electricity

AI And Hollywood's Next Golden AgeAI And Hollywood's Next Golden AgeApr 29, 2025 am 11:09 AM

Generative AI is revolutionizing film and television production. Luma's Ray 2 model, as well as Runway's Gen-4, OpenAI's Sora, Google's Veo and other new models, are improving the quality of generated videos at an unprecedented speed. These models can easily create complex special effects and realistic scenes, even short video clips and camera-perceived motion effects have been achieved. While the manipulation and consistency of these tools still need to be improved, the speed of progress is amazing. Generative video is becoming an independent medium. Some models are good at animation production, while others are good at live-action images. It is worth noting that Adobe's Firefly and Moonvalley's Ma

Is ChatGPT Slowly Becoming AI's Biggest Yes-Man?Is ChatGPT Slowly Becoming AI's Biggest Yes-Man?Apr 29, 2025 am 11:08 AM

ChatGPT user experience declines: is it a model degradation or user expectations? Recently, a large number of ChatGPT paid users have complained about their performance degradation, which has attracted widespread attention. Users reported slower responses to models, shorter answers, lack of help, and even more hallucinations. Some users expressed dissatisfaction on social media, pointing out that ChatGPT has become “too flattering” and tends to verify user views rather than provide critical feedback. This not only affects the user experience, but also brings actual losses to corporate customers, such as reduced productivity and waste of computing resources. Evidence of performance degradation Many users have reported significant degradation in ChatGPT performance, especially in older models such as GPT-4 (which will soon be discontinued from service at the end of this month). this

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor