Home >Technology peripherals >AI >The birth of Cambrian No. 1: Xie Saining and Yann LeCun team released the most powerful open source multi-modal LLM
Just like animals have eyes, Cambrian-1 from Yann LeCun’s team allows AI to gain powerful visual representation learning capabilities.
Throughout the ages, many philosophers have explored this question: Does understanding the meaning of language need to be based on the senses? Although philosophers disagree, one thing is clear: solid and effective sensory grounding can at least help.
For example, scientists generally believe that the emergence of vision during the Cambrian Explosion was a key step in the evolution of early animals; this not only helped animals better find food and avoid predators, but also helped the evolution of the animals themselves. In fact, most knowledge in humans (and nearly all animals) is acquired through sensory experiences that interact with the physical, such as sight, hearing, touch, taste, and smell. These sensory experiences are the basis for our understanding of the world around us and are key to helping us take action and make decisions.
These ideas can not only be used to explore philosophical concepts, but also have practical value. Especially the recent development of multimodal large language models (MLLM) has brought visual representation learning and language understanding to the core of practical application. . Language models exhibit very strong scaling behavior, and recent advances in multimodal learning have largely benefited from bigger and better LLMs.
On the other hand, design choices for visual components are still not fully explored, and exploration in this area is somewhat disconnected from research on visual representation learning. This is mainly because research in this area is very difficult: MLLM involves complex training and evaluation processes, and there are many design choices to consider.
Recently, the team of Xie Saining and Yann LeCun of New York University explored MLLM with vision as the center to fill this gap; they also built the Cambrian-1 (Cambrian 1) series of models based on these exploration results. (This article has three co-authors: Shengbang Tong, Ellis Brown and Penghao Wu.)
Paper title: Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Paper address: https://arxiv.org/pdf/2406.16860
Website: https://cambrian-mllm.github.io
Code: https://github.com/cambrian-mllm/ cambrian
Model: https://huggingface.co/nyu-visionx/
Data: https://huggingface.co/datasets/nyu-visionx/Cambrian-10M
CV-Bench: https://huggingface.co/datasets/nyu-visionx/CV-Bench
Evaluation: https://github.com/cambrian-mllm/cambrian
Specifically, they fine-tune the MLLM instruction A variety of visual representation evaluation protocols were made, as shown in Figure 1 .
The team said: “The motivation for our study stems from two potential problems in current multimodal learning research: 1) Excessive and premature reliance on language, which is a shortcut that can make up for effective visual learning. Shortcomings in representation; 2) Existing benchmarks may not provide sufficient guidance for real-world scenarios - visual grounding is crucial for robust multi-modal understanding. "
These issues are not unfounded, as researchers have already done so. Started to notice: Visual base is becoming a bottleneck in applying MLLM to some difficult real-world applications.
Looking at it from another perspective, traditional visual representation learning assessment protocols have become saturated and fail to reflect the various perceptual difficulties found in real-world distributions. On the other hand, using a language in the form of visual question answering (VQA) provides a flexible and robust evaluation protocol.
The goal of this study by Xie Saining and Yann LeCun’s team is to explore this new protocol design and gain new insights from it to guide future visual representation development. Furthermore, to better evaluate visual representations in this comprehensive setting, they also developed a vision-centric MLLM benchmark CV-Bench by converting traditional vision benchmarks into VQA format.
Cambrian-1 is built on five key pillars, each of which provides important insights into the design of MLLM:
Visual representation: The team explored a number of different visual encoders and their combinations;
Connector design: They designed a new type of connector that is dynamic and space-aware, which can integrate visual features with LLM while reducing the number of tokens.
Instruction fine-tuning data: They compiled high-quality visual instruction fine-tuning data based on public data sources, which particularly emphasized the importance of distribution balance.
Instruction Fine-Tuning Recipes: They discuss strategies and practical measures for instruction fine-tuning.
Benchmark evaluation: They analyzed existing MLLM benchmarks and intuitively divided them into 4 groups, and then proposed a new vision-centric benchmark CV-Bench.
Building on these pillars, the team built the Cambrian-1 series of models, which lead on multiple benchmarks and are particularly good at vision-centric tasks. The team also released the study’s model weights, open source code, data sets, and detailed plans for model training and evaluation.
Multimodal LLM basics
Key components of MLLM research include large language models, visual encoders, multimodal connectors, data assembly processes, instruction fine-tuning strategies, evaluation and benchmarking. Please refer to the original paper for specific instructions and related research.
Evaluating visual representation through MLLM
The visual encoder currently used in MLLM is mainly CLIP, because it is already pre-aligned with the language and is easy to adapt to the LLM token space. However, strong language priors can be a double-edged sword: they can both compensate for shortcomings in learning effective visual representations and curtail the insights gained from extensive research on visual representation learning.
The team systematically evaluated the impact of various visual encoder choices (see Figure 2) on the multi-modal capabilities of MLLM.
They also advocate the use of MLLM evaluation as a robust framework for evaluating visual representation methods to more faithfully reflect the diverse perceptual challenges in real-world scenarios, thereby better guiding people to develop better visual representations . Below we will briefly introduce the research process and findings. For more details, please refer to the original paper.
Analysis Benchmark
Based on 23 different visual backbone networks, the team trained MLLM using a two-stage instruction fine-tuning process: first training the connector based on 1.2M adapter data of ShareGPT-4V, and then fine-tuning on 737K instructions The connector and LLM are fine-tuned simultaneously on the data.
By comparing the performance of the model with or without visual input (see Figure 3), the team made the following findings:
Finding 1: Most benchmarks fail to accurately measure vision-centric capabilities, and a few There are only a very small number of benchmarks that can measure these capabilities.
Cambrian Vision-Centric Benchmark (CV-Bench)
To address the limitations of existing vision-centric benchmarks, the team proposed CV-Bench. It contains 2638 human-inspected samples, which is far more than other vision-centric MLLM benchmarks - 3.5x more than RealWorldQA and 8.8x more than MMVP.
As shown in Figure 4 and Table 1, CV-Bench can evaluate 2D understanding ability through spatial relationships and target counts, and can evaluate 3D understanding ability through depth order and relative distance.
Finding 2: Existing vision benchmarks can be effectively adapted for VQA tasks, enabling evaluation of vision-centric MLLM capabilities.
Instruction fine-tuning scheme
MLLM starts with pre-training LLM and visual backbone network, and then connects these modules through connectors such as projectors (MLP). The team explored different instruction fine-tuning schemes through extensive experiments and made the following findings.
Regarding the choice between single-stage training and dual-stage training, the team found:
Finding 3: Dual-stage training is beneficial; using more adapter data can further improve the results.
In terms of whether to freeze the visual encoder, the team found:
Finding 4: There are many benefits to not freezing the visual encoder. Language-supervised models are always beneficial; SSL models are especially beneficial on vision-centric benchmarks.
Using MLLM as a visual representation evaluator
The team studied the use of MLLM to evaluate visual representations. The results are shown in Figure 6. The findings are as follows:
Finding 5: High-resolution encoders can Significantly improves performance on graph- or vision-centric benchmarks, and convolutional network-based architectures are ideally suited for such tasks.
They also studied whether continuous fine-tuning of MLLM based on the self-supervised model can achieve similar performance to the language supervised model. The results are shown in Figure 7.
Finding 6: Language supervision has strong advantages, but with enough data and appropriate fine-tuning, the performance gap can be reduced through SSL methods.
Combine multiple visual encoders
The team also explored the possibility of combining multiple visual encoders to build a more powerful MLLM, and the results are shown in Table 3.
Finding 7: Combining multiple visual encoders (including visual SSL models) improves MLLM performance on a variety of different benchmarks, especially for vision-centric tasks.
Spatial Vision Aggregator (SVA): A new design of connectors
To effectively aggregate features from multiple visual encoders and prevent information loss introduced by interpolation, they used a learnable set of implicit queries , which can interact with multiple visual features through cross-attention layers.
Specifically, the new approach integrates two new vision-centric design principles:
Introduces spatial induction bias by explicitly defining the aggregation space for each token in the query .
Aggregating visual features multiple times across LLM layers allows the model to repeatedly access and integrate necessary visual information.
This new construction method can flexibly adapt to multiple visual encoders with different feature resolutions, while preserving the spatial structure of the visual data during aggregation and integration with LLM.
Using a combination of the best vision models from the previous section and a Vicuna-1.5-7B base LLM, the team demonstrated the utility of the SVA module.
Table 4 shows: SVA outperforms the two contrasting techniques on all benchmark categories, with huge improvements on OCR and tabular categories (requiring high-resolution feature understanding).
Going a step further, they conducted ablation experiments based on the combination of OpenAI CLIP ViT-L/14@336 + OpenCLIP ConvNeXt-L@1024. The results are shown in Table 5.
Finding 8: Spatial induction bias and deep interaction between LLM and visual features help to better aggregate and condense visual features.
Instruction fine-tuning data for training MLLM
Data collection
Collect instruction fine-tuning data from existing data sources:
The team used both multi-modal benchmarks and datasets involving visual interaction data ( For example, visual question answering (VQA) and OCR data), a small amount of high-quality pure language instruction compliance data has also been collected. They also separated the data into different categories: general conversation, OCR, counting, coding, math, science, and pure language data. Figure 9 shows the data source.
Targeted Internet data collection engine: As shown in Figure 9, the distribution of data is unbalanced.
To create large-scale, reliable, high-quality knowledge-based instruction fine-tuning data, the team proposed a data engine. The engine can pick a target domain and subdomain (such as physics) and then use an LLM like GPT-4 to identify topics (such as Newton's laws). It then searches reliable information sources such as Wikipedia for each topic. The team found that the image-text pairs extracted from Wikipedia were of high quality.
After that, the team used a parser to extract the image-description tuples, and then fed the description text to an LLM, such as GPT-3.5, to generate command-type question and answer pairs about the image through carefully designed prompts. . These question-answer pairs and images form their VQA dataset.
Cambrian-10M: They created a large instruction fine-tuning data pool and named it Cambrian-10M, which contains approximately 9784k data points. Figure 9 shows its composition.
Data reorganization
In order to improve data balance and adjust data proportion (see Figures 10 and 11), the team reorganized Cambrian-10M.
Finally got a smaller but higher quality dataset Cambrian-7M. Tables 6 and 7 illustrate the benefit of reorganizing the instruction data: although there are fewer samples in Cambrian-7M, the resulting performance is better.
Ease the "Answering Machine Phenomenon" through system prompts
They also studied the so-called Answer Machine Phenomenon. They observed that a well-trained MLLM might be good at handling the VQA benchmark, but lack basic conversational capabilities and output short, stilted responses by default. The reason for this is that the responses required for benchmark questions are often limited to a single option or word, unlike more general and realistic use cases. Similar phenomena have been observed in other LLM studies.
They speculate that the cause of this problem is that the instruction fine-tuning data contains too many short-response VQA tasks, which can lead to catastrophic forgetting in LLM.
To solve this problem, the team integrated additional system prompts during training. For example, for questions that generate a single word or phrase in the response, append something like "Use a single word or phrase to answer this question" in the prompt. It was found that such a system prompt can significantly improve the model's conversational capabilities while maintaining its baseline performance. Figure 12 gives an example.
In addition, the system prompt can also improve the reasoning ability by encouraging the model to use thinking chains.
Best performance yet
Finally, using the insights gained during the exploratory study, the team trained a new family of MLLM models: Cambrian-1. They trained the models using LLM backbone networks of different sizes: LLaMA-3-Instruct-8B, Vicuna-1.5-13B, Hermes-2-Yi-34B.
Their vision component combines 4 models through the Spatial Vision Aggregator (SVA): OpenAI CLIP ViT-L/14@336, SigLIP ViT-SO400M/14@384, OpenCLIP ConvNeXt-XXL@1024, DINOv2 ViT-L /14@518. They pre-trained the connector using 2.5M adapter data and then fine-tuned it using Cambrian-7M data mixing.
Table 8 and Figure 13 give the model evaluation results.
As you can see, Cambrian-1 surpasses open source models such as LLaVA-NeXT and Mini-Gemini. Thanks to SVA, Cambrian-1 can also handle tasks requiring high-resolution image processing very well, even using only 576 image tokens, which is only about 1/1 of the number of tokens used by LLaVA-NeXT and Mini-Gemini. 5.
Cambrian-1 also achieves comparable performance to the best proprietary models such as GPT-4V, Gemini-Pro and MM-1 on multiple benchmarks.
Figure 14 gives some examples, and you can see that although Cambrian-1 only uses 576 tokens, it can effectively pay attention to the details in the image.
In addition, it can be seen from the naming of Cambrian-1 that this is an ambitious team. Let us look forward to the next generation upgrade of this series of models.
The above is the detailed content of The birth of Cambrian No. 1: Xie Saining and Yann LeCun team released the most powerful open source multi-modal LLM. For more information, please follow other related articles on the PHP Chinese website!