Home  >  Article  >  Technology peripherals  >  Apple's large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

Apple's large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

王林
王林forward
2024-03-15 14:43:21591browse

Since this year, Apple has obviously increased its emphasis and investment in generative artificial intelligence (GenAI). At the recent Apple shareholders meeting, Apple CEO Tim Cook said that the company plans to make significant progress in the field of GenAI this year. In addition, Apple announced that it was abandoning its 10-year car-making project, which caused some team members originally engaged in car-making to begin turning to the GenAI field.

Through these initiatives, Apple has demonstrated to the outside world their determination to strengthen GenAI. Currently, GenAI technology and products in the multi-modal field have attracted much attention, especially OpenAI’s Sora. Apple naturally hopes to make a breakthrough in this area.

In a co-authored research paper "MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training", Apple disclosed their research based on multimodal pre-training As a result, a multi-modal LLM series model containing up to 30B parameters was launched.

Apples large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

Paper address: https://arxiv.org/pdf/2403.09611.pdf

at During the study, the team conducted in-depth discussions on the criticality of different architectural components and data selection. Through careful selection of image encoders, visual language connectors, and various pre-training data, they summarized some important design guidelines. Specifically, the main contributions of this study include the following aspects.

First, the researchers conducted small-scale ablation experiments on model architecture decisions and pre-training data selection, and discovered several interesting trends. The importance of modeling design aspects is in the following order: image resolution, visual encoder loss and capacity, and visual encoder pre-training data.

Secondly, the researchers used three different types of pre-training data: image captions, interleaved image text, and plain text data. They found that interleaved and text-only training data were important when it came to few-shot and text-only performance, while for zero-shot performance, subtitle data was most important. These trends persist after supervised fine-tuning (SFT), indicating that the performance and modeling decisions presented during pre-training are preserved after fine-tuning.

Finally, researchers built MM1, a multi-modal model series with parameters up to 30 billion (others are 3 billion and 7 billion), which consists of dense models It is composed of mixed experts (MoE) variants, which not only achieves SOTA in pre-trained indicators, but also maintains competitive performance after supervised fine-tuning on a series of existing multi-modal benchmarks.

The pre-trained model MM1 performs superiorly on subtitles and question and answer tasks in a few-shot scenario, outperforming Emu2, Flamingo and IDEFICS. MM1 after supervised fine-tuning also shows strong competitiveness on 12 multi-modal benchmarks.

Thanks to large-scale multi-modal pre-training, MM1 has good performance in context prediction, multi-image and thought chain reasoning. Similarly, MM1 demonstrates strong few-shot learning capabilities after instruction tuning.

Apples large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

Apples large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

Method Overview: The Secret to Building MM1

Building a high-performance MLLM (Multimodal Large Language Model, multimodal large language model) is a highly practical work. Although the high-level architecture design and training process are clear, the specific implementation methods are not always obvious. In this work, the researchers describe in detail the ablations performed to build high-performance models. They explored three main design decision directions:

  • Architecture: The researchers looked at different pre-trained image encoders and explored connecting LLMs with these encoders Various ways to get up.
  • Data: The researcher considered different types of data and their relative mixing weights.
  • Training procedure: The researchers explored how to train MLLM, including hyperparameters and which parts of the model were trained when.

Ablation settings

Since training large MLLM will consume a lot of resources, The researchers used a simplified ablation setup. The basic configuration of ablation is as follows:

  • Image encoder: ViT-L/14 model trained with CLIP loss on DFN-5B and VeCap-300M; image size is 336 ×336.
  • Visual language connector: C-Abstractor, containing 144 image tokens.
  • Pre-training data: mixed subtitle images (45%), interleaved image text documents (45%) and plain text (10%) data.
  • Language Model: 1.2B Transformer Decoder Language Model.

To evaluate different design decisions, the researchers used zero-shot and few-shot (4 and 8 samples) performance on various VQA and image description tasks. : COCO Captioning, NoCaps, TextCaps, VQAv2, TextVQA, VizWiz, GQA and OK-VQA.

Model Architecture Ablation Experiment

The researchers analyzed the components that enable LLM to process visual data. Specifically, they studied (1) how to optimally pretrain a visual encoder, and (2) how to connect visual features to the space of LLMs (see Figure 3 left).

Apples large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

  • Image encoder pre-training. In this process, researchers mainly ablated the importance of image resolution and image encoder pre-training goals. It should be noted that unlike other ablation experiments, the researchers used 2.9B LLM (instead of 1.2B) to ensure sufficient capacity to use some larger image encoders.
  • Encoder experience: Image resolution has the greatest impact, followed by model size and training data composition. As shown in Table 1, increasing the image resolution from 224 to 336 improves all metrics for all architectures by approximately 3%. Increasing the model size from ViT-L to ViT-H doubles the parameters, but the performance gain is modest, typically less than 1%. Finally, adding VeCap-300M, a synthetic caption dataset, improves performance by more than 1% in few-shot scenarios.

Apples large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

  • Visual Language Connector and Image Resolution. The goal of this component is to transform visual representations into LLM space. Since the image encoder is ViT, its output is either a single embedding or a set of grid-arranged embeddings corresponding to input image segments. Therefore, the spatial arrangement of image tokens needs to be converted into the sequential arrangement of LLM. At the same time, the actual image token representation must also be mapped to the word embedding space.
  • VL connector experience: The number of visual tokens and image resolution are most important, while the type of VL connector has little impact. As shown in Figure 4, as the number of visual tokens or/and image resolution increases, the recognition rates of zero samples and few samples will increase.

Apples large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

Pre-training data ablation experiment

Generally, the model The training is divided into two stages: pre-training and instruction tuning. The former stage uses network-scale data, while the latter stage uses mission-specific curated data. The following focuses on the pre-training phase of this article and details the researcher’s data selection (Figure 3 right).

There are two types of data commonly used to train MLLM: caption data consisting of image and text pair descriptions; and image-text interleaved documents from the web. Table 2 is the complete list of data sets:

Apples large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese


  • ##Data Lesson 1: Interleaved data helps is used to improve few-sample and plain text performance, while subtitle data can improve zero-sample performance. Figure 5a shows the results for different combinations of interleaved and subtitled data.
  • Data experience 2: Plain text data helps improve few-sample and plain-text performance. As shown in Figure 5b, combining plain text data and subtitle data improves few-shot performance.
  • Data Lesson 3: Carefully blending image and text data results in optimal multimodal performance while retaining strong text performance. Figure 5c tries several mixing ratios between image (title and interlaced) and plain text data.
  • Data experience 4: Synthetic data helps with few-shot learning. As shown in Figure 5d, synthetic data does significantly improve the performance of few-shot learning, with absolute values ​​of 2.4% and 4% respectively.

Apples large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

Final model and training method

The researcher collected previous ablation results, Determine the final recipe for MM1 multi-modal pre-training:

  • Image encoder: Considering the importance of image resolution, the researcher used the ViT-H model with a resolution of 378x378px and pre-trained using the CLIP target on DFN-5B;
  • Visual language connector: Since the number of visual tokens is most important, the researcher used a VL connector with 144 tokens. The actual architecture does not seem to be important, and the researcher chose C-Abstract;
  • Data: In order to maintain the performance of zero samples and few samples, the researcher used the following carefully combined data: 45 % images-text interleaved documents, 45% images-text documents and 10% text-only documents.

To improve the performance of the model, the researchers expanded the size of the LLM to 3B, 7B, and 30B parameters. All models were fully unfrozen pretrained with a batch size of 512 sequences with a sequence length of 4096, a maximum of 16 images per sequence, and a resolution of 378 × 378. All models were trained using the AXLearn framework.

They performed a grid search on learning rates at small scale, 9M, 85M, 302M and 1.2B, using linear regression in log space to extrapolate from smaller models to larger Changes to the model (see Figure 6), the result is to predict the optimal peak learning rate η given the number of (non-embedded) parameters N:

Apples large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

Extended via Mix of Experts (MoE). In experiments, the researchers further explored ways to extend the dense model by adding more experts to the FFN layer of the language model.

To convert a dense model to MoE, simply replace the dense language decoder with the MoE language decoder. To train MoE, the researchers used the same training hyperparameters and the same training settings as Dense Backbone 4, including training data and training tokens.

Regarding the multi-modal pre-training results, the researchers evaluated the pre-trained models on upper bound and VQA tasks with appropriate prompts. Table 3 evaluates zero samples and few samples:

Apples large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

Supervised fine-tuning results

Finally, The researchers introduced supervised fine-tuning (SFT) experiments trained on top of pre-trained models.

They followed LLaVA-1.5 and LLaVA-NeXT and collected about 1 million SFT samples from different datasets. Given that intuitively higher image resolution leads to better performance, the researchers also adopted the SFT method extended to high resolution.

The results of supervised fine-tuning are as follows:

Table 4 shows the comparison with SOTA, "-Chat" indicates the MM1 model after supervised fine-tuning .

First, on average, the MM1-3B-Chat and MM1-7B-Chat outperform all listed models of the same size. MM1-3B-Chat and MM1-7B-Chat perform particularly well on VQAv2, TextVQA, ScienceQA, MMBench, and recent benchmarks (MMMU and MathVista).

Secondly, the researchers explored two MoE models: 3B-MoE (64 experts) and 6B-MoE (32 experts). Apple's MoE model achieved better performance than the dense model in almost all benchmarks. This shows the huge potential for further expansion of the MoE.

Third, for the 30B size model, MM1-30B-Chat performs better than Emu2-Chat37B and CogVLM-30B on TextVQA, SEED and MMMU. MM1 also achieves competitive overall performance compared to LLaVA-NeXT.

However, LLaVA-NeXT does not support multi-image inference, nor does it support few-sample prompts, because each image is represented as 2880 tokens sent to LLM, and the total number of tokens in MM1 There are only 720 of them. This limits certain applications involving multiple images.

Apples large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

Figure 7b shows the impact of input image resolution on the average performance of the SFT evaluation index. Figure 7c shows that as the pre-training data increases, The performance of the model continues to improve.

The impact of image resolution. Figure 7b shows the impact of input image resolution on the average performance of the SFT evaluation metric.

Impact of pre-training: Figure 7c shows that as the pre-training data increases, the performance of the model continues to improve.

Apples large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

For more research details, please refer to the original paper.

The above is the detailed content of Apple's large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete