Home  >  Article  >  Technology peripherals  >  HuggingFace teaches you how to make a SOTA visual model

HuggingFace teaches you how to make a SOTA visual model

王林
王林Original
2024-06-05 21:39:58876browse

There was OpenAI’s GPT-4o in the past, and Google’s series of kings later, and advanced multi-modal large models exploded one after another.

Other practitioners were shocked and began to think about how to catch up with these super models again.

In this paper by HuggingFace and Sorbonne University in France, they summarized the key experiences in building large visual models and pointed out a way for developers.

HuggingFace teaches you how to make a SOTA visual modelPicture

These experiences cover many aspects such as model architecture selection, training methods, and training data. The author gives a detailed summary after multiple comparisons. The core points include:

  • If you want to do a good job in large visual models, the choice of architecture is very important.
  • The language model has a greater impact on the overall performance than the visual module.
  • Adopting a staged pre-training strategy is more conducive to building model capabilities.
  • Training data should contain multiple types, and pay attention to the balance between them.

It can be said that HF ​​was able to create Idefics2, a SOTA visual model of the same scale, relying on these experiences.

Idefics2 is based on Mistral-7B. It has an overall parameter volume of 8B and can accurately recognize handwritten fonts.

HuggingFace teaches you how to make a SOTA visual modelPicture

This is a good review by professionals saying that this is a good survey report and is very useful for visual model developers. It is helpful, but at the same time, I also remind you not to treat it as a snake oil.

HuggingFace teaches you how to make a SOTA visual modelPicture

Of course, some people joke that any architecture data is just a cloud, and having a GPU is the most critical.

HuggingFace teaches you how to make a SOTA visual modelPicture

There is some truth, but joking aside, let’s take a look at what experiences HuggingFace has brought us.

From SOTA model development practice

These experiences in the HuggingFace paper come from the development process of the visual model Idefics2.

Compared with the previous generation Idefics1 and Flamingo, the same scale ex-SOTA, Idefics2 performed well on multiple data sets, even surpassing the larger 13B model.

At the same time, compared with MM1 which is slightly better than Idefics2 on the COCO data set, Idefics2 consumes significantly less tokens on each picture.

HuggingFace teaches you how to make a SOTA visual modelPicture

From the actual development of Idefics2, the experience HuggingFace has brought us includes at least the following aspects:

  • Backbone and architecture selection
  • Training methods and strategies
  • Data diversity and processing strategies

Language models have a greater impact on overall performance

The current large visual models are mainly developed in the form of language model + visual encoder. The author separately evaluated the impact of the two on the overall performance.

The results show that the quality of the language model is more important than the visual model.

With the same number of parameters, using a better language model (such as replacing Llama-7B with Mistral-7B) can significantly improve the performance of large visual models on downstream tasks.

The improvement brought by upgrading the visual encoder is relatively limited, so the best way to make a trade-off is to give priority to a stronger language model.

HuggingFace teaches you how to make a SOTA visual modelPicture

Of course this does not mean that upgrading the visual encoder has no effect. If conditions permit, choosing a better visual encoder can also Brings certain performance improvements.

In addition, you should also pay attention to the choice to match the downstream task. For example, on text recognition tasks, you should use a visual encoder that supports variable resolution; if the task requires high inference speed, you can choose a lighter weight magnitude model.

And in practical applications, inference speed and memory usage are also factors that need to be weighed. The SigLIP-SO400M selected by Idefics2 has achieved a good balance between performance and efficiency.

Select the architecture type according to your needs

Regarding the choice of architecture, this paper discusses the two common complete autoregressive and cross-attention.

The fully autoregressive architecture generates each output in an autoregressive manner, taking into account the dependencies of the entire sequence;

The latter allows the model to dynamically focus on one modality while processing another Different parts of each modality, enabling more flexible interaction between modalities.

In specific work, the author found that which architecture performs better depends on whether the pre-trained backbone is frozen.

(Simply put, if the pre-trained backbone participates in the formal training process, it is non-frozen, and if it does not participate, it is frozen)

If it is not frozen, the performance of the fully autoregressive architecture is better. On the contrary, the cross-attention architecture is better.

HuggingFace teaches you how to make a SOTA visual modelPicture

As for whether the backbone needs to be frozen, it depends on the focus of the developer's needs.

Under the condition of limited resources, if you need high performance and are highly sensitive to delay, freezing is more appropriate;

If you want the model to have higher flexibility and adaptability, you should Choose a non-freezing training method.

Specifically for Idefics2, we chose not to freeze the backbone, so we adopted a fully autoregressive architecture accordingly.

HuggingFace teaches you how to make a SOTA visual modelPicture

Experience in the training phase

Choosing the appropriate architecture is important, but the training process is also essential. In Idefics2 During the training process, the author summarized these experiences for our reference:

First, a staged pre-training strategy is adopted as a whole, using lower resolution images in the initial stage, and then introducing higher resolution PDF document, this approach can gradually build multiple capabilities of the model.

The second is to use Learned Pooling instead of directly feeding image features into the language model, which can significantly reduce the number of image tokens, significantly improve training and inference efficiency, and also bring about performance improvements.

The third is data enhancement. One method is to split the image into multiple sub-images and send them to the model during training. This can exchange computing time for stronger performance during inference, especially in tasks such as text recognition. Works, but not all images need to be treated this way.

Fourth, using more diverse data and tasks in the instruction fine-tuning phase can improve the generalization and robustness of the model.

In addition, in order to stabilize training, when the pre-trained single-modal backbone participates in training (not frozen), the author also uses LoRA technology to adapt the pre-training parameters.

Data diversity and processing strategies

In addition to the training process itself, the selected data will also have a significant impact on the performance of the model.

From the beginning of the collection stage, attention should be paid to selecting multiple types of data. For example, the data used by Idefics2 includes three categories - documents with image and text alignment (such as web pages), image-text pairs (such as Image title), and PDF document with OCR annotation.

The proportions of various types of data should also be appropriately balanced according to actual needs, rather than simply divided into equal parts.

As for the data scale, the more the better if conditions permit. Of course, attention should be paid to filtering out low-quality data.

Of course, collection is only a step to obtain training data. If you want to train the model well, certain processing is required.

Use different preprocessing and enhancement strategies for different types of data. For example, for OCR data, it is necessary to use higher resolution images, while other data can use lower resolution.

It should be noted that the original aspect ratio and resolution should be retained when processing images, which can greatly save the computational overhead of training and inference while improving the adaptability of the model.

If you think these experiences have inspired you, you can read the original paper for more details, and you are welcome to share your development experience in the comment area.

Paper address: https://www.php.cn/link/52c8b8d56837155b4870fc2658b676f0

The above is the detailed content of HuggingFace teaches you how to make a SOTA visual model. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn