Home  >  Article  >  Technology peripherals  >  Too complete! Apple launches new visual model 4M-21, capable of 21 modes

Too complete! Apple launches new visual model 4M-21, capable of 21 modes

WBOY
WBOYOriginal
2024-06-25 17:17:191000browse

Current multimodal and multitasking base models, such as **4M** or **UnifiedIO**, show promising results. However, their out-of-the-box ability to accept different inputs and perform different tasks is limited by the (usually small) number of modalities and tasks they are trained on.

, Based on this, researchers from the Ecole Polytechnique Fédérale de Lausanne (EPFL) and Apple jointly developed an **advanced** any-to-any modality single model that is **widely** diverse in dozens of Conduct training on various modalities, and perform collaborative training on large-scale multi-modal data sets and text corpora.

A key step in the training process is to perform discrete **tokenization** on various modalities, whether they are structured data such as image-like neural network **feature maps**, vectors, instance segmentation or human poses, or Data that can be represented as text.

Too complete! Apple launches new visual model 4M-21, capable of 21 modes

  • Paper address: https://arxiv.org/pdf/2406.09406

  • Paper homepage https://4m.epfl.ch/

  • Paper title: 4M-21: An Any -to-Any Vision Model for Tens of Tasks and Modalities

This study shows that training a single model can also complete at least **three times** as many tasks/**modalities** as existing models, and does not Performance will be lost. In addition, this research also achieves finer-grained and more controllable multi-mode data generation capabilities.

This research builds on the multi-modal mask pre-training scheme and improves model capabilities by training on dozens of highly diverse modalities. By encoding it using modality-specific discrete tokenizers, the study enables training a single unified model on different modalities.

Simply put, this research extends the capabilities of existing models in several key dimensions:

  • Modalities: from 7 modalities of the best existing any-to-any model to 21 different modalities , enabling cross-modal retrieval, controllable generation, and powerful out-of-the-box performance. This is the first time a single vision model can solve dozens of different tasks in an any-to-any manner without compromising performance and without any traditional multi-task learning.

  • Diversity: Add support for more structured data, such as human poses, SAM instances, metadata, and more.

  • tokenization: Study discrete tokenization of different modalities using modality-specific methods, such as global image embeddings, human poses, and semantic instances.

  • Extension: Expand model size to 3B parameters and dataset to 0.5B samples.

  • Collaborative training: collaborative training in vision and language at the same time.

Method Introduction

This study uses the 4M pre-training scheme (the study also came from EPFL and Apple and was released last year), which is proven to be a general method that can be effectively extended to multi-modality.

Specifically, this article keeps the architecture and multi-modal mask training goals unchanged, by expanding the size of the model and data sets, increasing the type and number of modalities involved in training the model, and jointly on multiple data sets Training can improve the performance and adaptability of the model.

Modalities are divided into the following categories: RGB, geometry, semantics, edge, feature map, metadata and text, as shown in the figure below.

Too complete! Apple launches new visual model 4M-21, capable of 21 modes

Tokenization

Tokenization mainly includes converting different modalities and tasks into sequences or discrete tokens, thereby unifying their representation spaces. Researchers use different tokenization methods to discretize modes with different characteristics, as shown in Figure 3. In summary, this article uses three tokenizers, including ViT tokenizer, MLP tokenizer and text tokenizer.

Too complete! Apple launches new visual model 4M-21, capable of 21 modes

In terms of architecture selection, this article adopts the 4M encoder-decoder architecture based on Transformer, and adds additional modal embeddings to adapt to new modalities.

Experimental results

Next, the paper demonstrates the multi-modal capabilities of 4M-21.

Multi-modal generation

Based on iterative decoding token, 4M-21 can be used to predict any training modality. As shown in Figure 2, this paper can generate all modalities in a consistent manner from a given input modality. Too complete! Apple launches new visual model 4M-21, capable of 21 modes

Furthermore, since this study can conditionally and unconditionally generate any training modality from any subset of other modalities, it supports several methods to perform fine-grained and multi-modal generation, as shown in Figure 4, For example, perform multimodal editing. Furthermore, 4M-21 demonstrates improved text understanding, both on T5-XXL embeddings and regular subtitles, enabling geometrically and semantically sound generation (Figure 4, top right).

Too complete! Apple launches new visual model 4M-21, capable of 21 modes

Multi-modal retrieval

As shown in Figure 5, 4M-21 unlocks retrieval capabilities that are not possible with the original DINOv2 and ImageBind models, such as retrieving RGB images or other modalities by using other modalities as queries . In addition, 4M-21 can combine multiple modalities to predict global embeddings for better control of retrieval, as shown in the image on the right.

Too complete! Apple launches new visual model 4M-21, capable of 21 modes

Out of the box

The 4M-21 is capable of performing a range of common vision tasks out of the box, as shown in Figure 6.

Too complete! Apple launches new visual model 4M-21, capable of 21 modes

Table 1 evaluates DIODE surface normal and depth estimation, COCO semantic and instance segmentation, 3DPW 3D human pose estimation, etc.

Too complete! Apple launches new visual model 4M-21, capable of 21 modes

Transfer experiment

In addition, this article also trained models of three different sizes: B, L and XL. Their encoder is then transferred to downstream tasks and evaluated on single-modality (RGB) and multi-modality (RGB + depth) settings. All transfer experiments discard the decoder and instead train a task-specific head. The results are shown in Table 2:

Too complete! Apple launches new visual model 4M-21, capable of 21 modes

Finally, this paper performs multi-modal transfer on NYUv2, Hypersim semantic segmentation and 3D object detection on ARKitScenes. As shown in Table 3, 4M-21 takes full advantage of the optional depth input and significantly improves the baseline.

Too complete! Apple launches new visual model 4M-21, capable of 21 modes

The above is the detailed content of Too complete! Apple launches new visual model 4M-21, capable of 21 modes. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn