


Too complete! Apple launches new visual model 4M-21, capable of 21 modes
Current multimodal and multitasking base models, such as **4M** or **UnifiedIO**, show promising results. However, their out-of-the-box ability to accept different inputs and perform different tasks is limited by the (usually small) number of modalities and tasks they are trained on.
, Based on this, researchers from the Ecole Polytechnique Fédérale de Lausanne (EPFL) and Apple jointly developed an **advanced** any-to-any modality single model that is **widely** diverse in dozens of Conduct training on various modalities, and perform collaborative training on large-scale multi-modal data sets and text corpora.
A key step in the training process is to perform discrete **tokenization** on various modalities, whether they are structured data such as image-like neural network **feature maps**, vectors, instance segmentation or human poses, or Data that can be represented as text.
Paper address: https://arxiv.org/pdf/2406.09406
Paper homepage https://4m.epfl.ch/
Paper title: 4M-21: An Any -to-Any Vision Model for Tens of Tasks and Modalities
This study shows that training a single model can also complete at least **three times** as many tasks/**modalities** as existing models, and does not Performance will be lost. In addition, this research also achieves finer-grained and more controllable multi-mode data generation capabilities.
This research builds on the multi-modal mask pre-training scheme and improves model capabilities by training on dozens of highly diverse modalities. By encoding it using modality-specific discrete tokenizers, the study enables training a single unified model on different modalities.
Simply put, this research extends the capabilities of existing models in several key dimensions:
Modalities: from 7 modalities of the best existing any-to-any model to 21 different modalities , enabling cross-modal retrieval, controllable generation, and powerful out-of-the-box performance. This is the first time a single vision model can solve dozens of different tasks in an any-to-any manner without compromising performance and without any traditional multi-task learning.
Diversity: Add support for more structured data, such as human poses, SAM instances, metadata, and more.
tokenization: Study discrete tokenization of different modalities using modality-specific methods, such as global image embeddings, human poses, and semantic instances.
Extension: Expand model size to 3B parameters and dataset to 0.5B samples.
Collaborative training: collaborative training in vision and language at the same time.
Method Introduction
This study uses the 4M pre-training scheme (the study also came from EPFL and Apple and was released last year), which is proven to be a general method that can be effectively extended to multi-modality.
Specifically, this article keeps the architecture and multi-modal mask training goals unchanged, by expanding the size of the model and data sets, increasing the type and number of modalities involved in training the model, and jointly on multiple data sets Training can improve the performance and adaptability of the model.
Modalities are divided into the following categories: RGB, geometry, semantics, edge, feature map, metadata and text, as shown in the figure below.
Tokenization
Tokenization mainly includes converting different modalities and tasks into sequences or discrete tokens, thereby unifying their representation spaces. Researchers use different tokenization methods to discretize modes with different characteristics, as shown in Figure 3. In summary, this article uses three tokenizers, including ViT tokenizer, MLP tokenizer and text tokenizer.
In terms of architecture selection, this article adopts the 4M encoder-decoder architecture based on Transformer, and adds additional modal embeddings to adapt to new modalities.
Experimental results
Next, the paper demonstrates the multi-modal capabilities of 4M-21.
Multi-modal generation
Based on iterative decoding token, 4M-21 can be used to predict any training modality. As shown in Figure 2, this paper can generate all modalities in a consistent manner from a given input modality.
Furthermore, since this study can conditionally and unconditionally generate any training modality from any subset of other modalities, it supports several methods to perform fine-grained and multi-modal generation, as shown in Figure 4, For example, perform multimodal editing. Furthermore, 4M-21 demonstrates improved text understanding, both on T5-XXL embeddings and regular subtitles, enabling geometrically and semantically sound generation (Figure 4, top right).
Multi-modal retrieval
As shown in Figure 5, 4M-21 unlocks retrieval capabilities that are not possible with the original DINOv2 and ImageBind models, such as retrieving RGB images or other modalities by using other modalities as queries . In addition, 4M-21 can combine multiple modalities to predict global embeddings for better control of retrieval, as shown in the image on the right.
Out of the box
The 4M-21 is capable of performing a range of common vision tasks out of the box, as shown in Figure 6.
Table 1 evaluates DIODE surface normal and depth estimation, COCO semantic and instance segmentation, 3DPW 3D human pose estimation, etc.
Transfer experiment
In addition, this article also trained models of three different sizes: B, L and XL. Their encoder is then transferred to downstream tasks and evaluated on single-modality (RGB) and multi-modality (RGB + depth) settings. All transfer experiments discard the decoder and instead train a task-specific head. The results are shown in Table 2:
Finally, this paper performs multi-modal transfer on NYUv2, Hypersim semantic segmentation and 3D object detection on ARKitScenes. As shown in Table 3, 4M-21 takes full advantage of the optional depth input and significantly improves the baseline.
The above is the detailed content of Too complete! Apple launches new visual model 4M-21, capable of 21 modes. For more information, please follow other related articles on the PHP Chinese website!

Revolutionizing the Checkout Experience Sam's Club's innovative "Just Go" system builds on its existing AI-powered "Scan & Go" technology, allowing members to scan purchases via the Sam's Club app during their shopping trip.

Nvidia's Enhanced Predictability and New Product Lineup at GTC 2025 Nvidia, a key player in AI infrastructure, is focusing on increased predictability for its clients. This involves consistent product delivery, meeting performance expectations, and

Google's Gemma 2: A Powerful, Efficient Language Model Google's Gemma family of language models, celebrated for efficiency and performance, has expanded with the arrival of Gemma 2. This latest release comprises two models: a 27-billion parameter ver

This Leading with Data episode features Dr. Kirk Borne, a leading data scientist, astrophysicist, and TEDx speaker. A renowned expert in big data, AI, and machine learning, Dr. Borne offers invaluable insights into the current state and future traje

There were some very insightful perspectives in this speech—background information about engineering that showed us why artificial intelligence is so good at supporting people’s physical exercise. I will outline a core idea from each contributor’s perspective to demonstrate three design aspects that are an important part of our exploration of the application of artificial intelligence in sports. Edge devices and raw personal data This idea about artificial intelligence actually contains two components—one related to where we place large language models and the other is related to the differences between our human language and the language that our vital signs “express” when measured in real time. Alexander Amini knows a lot about running and tennis, but he still

Caterpillar's Chief Information Officer and Senior Vice President of IT, Jamie Engstrom, leads a global team of over 2,200 IT professionals across 28 countries. With 26 years at Caterpillar, including four and a half years in her current role, Engst

Google Photos' New Ultra HDR Tool: A Quick Guide Enhance your photos with Google Photos' new Ultra HDR tool, transforming standard images into vibrant, high-dynamic-range masterpieces. Ideal for social media, this tool boosts the impact of any photo,

Introduction Transaction Control Language (TCL) commands are essential in SQL for managing changes made by Data Manipulation Language (DML) statements. These commands allow database administrators and users to control transaction processes, thereby


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

SublimeText3 English version
Recommended: Win version, supports code prompts!

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SublimeText3 Mac version
God-level code editing software (SublimeText3)

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Atom editor mac version download
The most popular open source editor