There was OpenAI’s GPT-4o in the past, and Google’s series of kings later, and advanced multi-modal large models exploded one after another.
Other practitioners were shocked and began to think about how to catch up with these super models again.
In this paper by HuggingFace and Sorbonne University in France, they summarized the key experiences in building large visual models and pointed out a way for developers.
Picture
These experiences cover many aspects such as model architecture selection, training methods, and training data. The author gives a detailed summary after multiple comparisons. The core points include:
- If you want to do a good job in large visual models, the choice of architecture is very important.
- The language model has a greater impact on the overall performance than the visual module.
- Adopting a staged pre-training strategy is more conducive to building model capabilities.
- Training data should contain multiple types, and pay attention to the balance between them.
It can be said that HF was able to create Idefics2, a SOTA visual model of the same scale, relying on these experiences.
Idefics2 is based on Mistral-7B. It has an overall parameter volume of 8B and can accurately recognize handwritten fonts.
Picture
This is a good review by professionals saying that this is a good survey report and is very useful for visual model developers. It is helpful, but at the same time, I also remind you not to treat it as a snake oil.
Picture
Of course, some people joke that any architecture data is just a cloud, and having a GPU is the most critical.
Picture
There is some truth, but joking aside, let’s take a look at what experiences HuggingFace has brought us.
From SOTA model development practice
These experiences in the HuggingFace paper come from the development process of the visual model Idefics2.
Compared with the previous generation Idefics1 and Flamingo, the same scale ex-SOTA, Idefics2 performed well on multiple data sets, even surpassing the larger 13B model.
At the same time, compared with MM1 which is slightly better than Idefics2 on the COCO data set, Idefics2 consumes significantly less tokens on each picture.
Picture
From the actual development of Idefics2, the experience HuggingFace has brought us includes at least the following aspects:
- Backbone and architecture selection
- Training methods and strategies
- Data diversity and processing strategies
Language models have a greater impact on overall performance
The current large visual models are mainly developed in the form of language model + visual encoder. The author separately evaluated the impact of the two on the overall performance.
The results show that the quality of the language model is more important than the visual model.
With the same number of parameters, using a better language model (such as replacing Llama-7B with Mistral-7B) can significantly improve the performance of large visual models on downstream tasks.
The improvement brought by upgrading the visual encoder is relatively limited, so the best way to make a trade-off is to give priority to a stronger language model.
Picture
Of course this does not mean that upgrading the visual encoder has no effect. If conditions permit, choosing a better visual encoder can also Brings certain performance improvements.
In addition, you should also pay attention to the choice to match the downstream task. For example, on text recognition tasks, you should use a visual encoder that supports variable resolution; if the task requires high inference speed, you can choose a lighter weight magnitude model.
And in practical applications, inference speed and memory usage are also factors that need to be weighed. The SigLIP-SO400M selected by Idefics2 has achieved a good balance between performance and efficiency.
Select the architecture type according to your needs
Regarding the choice of architecture, this paper discusses the two common complete autoregressive and cross-attention.
The fully autoregressive architecture generates each output in an autoregressive manner, taking into account the dependencies of the entire sequence;
The latter allows the model to dynamically focus on one modality while processing another Different parts of each modality, enabling more flexible interaction between modalities.
In specific work, the author found that which architecture performs better depends on whether the pre-trained backbone is frozen.
(Simply put, if the pre-trained backbone participates in the formal training process, it is non-frozen, and if it does not participate, it is frozen)
If it is not frozen, the performance of the fully autoregressive architecture is better. On the contrary, the cross-attention architecture is better.
Picture
As for whether the backbone needs to be frozen, it depends on the focus of the developer's needs.
Under the condition of limited resources, if you need high performance and are highly sensitive to delay, freezing is more appropriate;
If you want the model to have higher flexibility and adaptability, you should Choose a non-freezing training method.
Specifically for Idefics2, we chose not to freeze the backbone, so we adopted a fully autoregressive architecture accordingly.
Picture
Experience in the training phase
Choosing the appropriate architecture is important, but the training process is also essential. In Idefics2 During the training process, the author summarized these experiences for our reference:
First, a staged pre-training strategy is adopted as a whole, using lower resolution images in the initial stage, and then introducing higher resolution PDF document, this approach can gradually build multiple capabilities of the model.
The second is to use Learned Pooling instead of directly feeding image features into the language model, which can significantly reduce the number of image tokens, significantly improve training and inference efficiency, and also bring about performance improvements.
The third is data enhancement. One method is to split the image into multiple sub-images and send them to the model during training. This can exchange computing time for stronger performance during inference, especially in tasks such as text recognition. Works, but not all images need to be treated this way.
Fourth, using more diverse data and tasks in the instruction fine-tuning phase can improve the generalization and robustness of the model.
In addition, in order to stabilize training, when the pre-trained single-modal backbone participates in training (not frozen), the author also uses LoRA technology to adapt the pre-training parameters.
Data diversity and processing strategies
In addition to the training process itself, the selected data will also have a significant impact on the performance of the model.
From the beginning of the collection stage, attention should be paid to selecting multiple types of data. For example, the data used by Idefics2 includes three categories - documents with image and text alignment (such as web pages), image-text pairs (such as Image title), and PDF document with OCR annotation.
The proportions of various types of data should also be appropriately balanced according to actual needs, rather than simply divided into equal parts.
As for the data scale, the more the better if conditions permit. Of course, attention should be paid to filtering out low-quality data.
Of course, collection is only a step to obtain training data. If you want to train the model well, certain processing is required.
Use different preprocessing and enhancement strategies for different types of data. For example, for OCR data, it is necessary to use higher resolution images, while other data can use lower resolution.
It should be noted that the original aspect ratio and resolution should be retained when processing images, which can greatly save the computational overhead of training and inference while improving the adaptability of the model.
If you think these experiences have inspired you, you can read the original paper for more details, and you are welcome to share your development experience in the comment area.
Paper address: https://www.php.cn/link/52c8b8d56837155b4870fc2658b676f0
The above is the detailed content of HuggingFace teaches you how to make a SOTA visual model. For more information, please follow other related articles on the PHP Chinese website!

Exploring the Inner Workings of Language Models with Gemma Scope Understanding the complexities of AI language models is a significant challenge. Google's release of Gemma Scope, a comprehensive toolkit, offers researchers a powerful way to delve in

Unlocking Business Success: A Guide to Becoming a Business Intelligence Analyst Imagine transforming raw data into actionable insights that drive organizational growth. This is the power of a Business Intelligence (BI) Analyst – a crucial role in gu

SQL's ALTER TABLE Statement: Dynamically Adding Columns to Your Database In data management, SQL's adaptability is crucial. Need to adjust your database structure on the fly? The ALTER TABLE statement is your solution. This guide details adding colu

Introduction Imagine a bustling office where two professionals collaborate on a critical project. The business analyst focuses on the company's objectives, identifying areas for improvement, and ensuring strategic alignment with market trends. Simu

Excel data counting and analysis: detailed explanation of COUNT and COUNTA functions Accurate data counting and analysis are critical in Excel, especially when working with large data sets. Excel provides a variety of functions to achieve this, with the COUNT and COUNTA functions being key tools for counting the number of cells under different conditions. Although both functions are used to count cells, their design targets are targeted at different data types. Let's dig into the specific details of COUNT and COUNTA functions, highlight their unique features and differences, and learn how to apply them in data analysis. Overview of key points Understand COUNT and COU

Google Chrome's AI Revolution: A Personalized and Efficient Browsing Experience Artificial Intelligence (AI) is rapidly transforming our daily lives, and Google Chrome is leading the charge in the web browsing arena. This article explores the exciti

Reimagining Impact: The Quadruple Bottom Line For too long, the conversation has been dominated by a narrow view of AI’s impact, primarily focused on the bottom line of profit. However, a more holistic approach recognizes the interconnectedness of bu

Things are moving steadily towards that point. The investment pouring into quantum service providers and startups shows that industry understands its significance. And a growing number of real-world use cases are emerging to demonstrate its value out


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

WebStorm Mac version
Useful JavaScript development tools

Dreamweaver CS6
Visual web development tools

Atom editor mac version download
The most popular open source editor

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.