Home  >  Article  >  Technology peripherals  >  BLIP-2 and InstructBLIP are firmly in the top three! Twelve major models, sixteen lists, comprehensive evaluation of "multimodal large language model"

BLIP-2 and InstructBLIP are firmly in the top three! Twelve major models, sixteen lists, comprehensive evaluation of "multimodal large language model"

王林
王林forward
2023-07-13 14:33:061429browse

Multimodal Large Language Model (MLLM) relies on LLM’s rich knowledge reserves and powerful reasoning and generalization capabilities to solve multimodal problems. Some amazing models have emerged so far. Ability, such as reading pictures and writing and looking at pictures and writing code.

However, it is difficult to fully reflect the performance of MLLM based on these examples alone, and there is still a lack of comprehensive evaluation of MLLM.

To this end, Tencent Youtu Lab and Xiamen University conducted a comprehensive quantitative evaluation of the existing 12 open source MLLM models for the first time on the newly created evaluation benchmark MM and published 16 rankings List, including two general lists of perception and cognition and 14 sub-lists:

BLIP-2 and InstructBLIP are firmly in the top three! Twelve major models, sixteen lists, comprehensive evaluation of multimodal large language model

Paper link: https://arxiv.org/pdf /2306.13394.pdf

Project link: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation

The existing quantitative evaluation methods of MLLM are mainly divided into three categories, but they all have certain limitations that make it difficult to fully reflect their performance.

The first type of methods are evaluated on traditional public datasets, such as Image Caption and Visual Question Answering (VQA) datasets.

But on the one hand, these traditional data sets may be difficult to reflect the new capabilities of MLLM. On the other hand, since the training sets in the large model era are no longer unified, it is difficult to guarantee these evaluation data sets. It has not been trained by other MLLMs.

The second way is to collect new data for open evaluation, but these data are either not public [1] or the number is too small (only 50 pictures) [2].

The third method focuses on a specific aspect of MLLM, such as object hallucination [3] or adversarial robustness [4], and cannot be fully evaluated.

There is an urgent need for a comprehensive evaluation benchmark to match the rapid development of MLLM. Researchers believe that a universal comprehensive assessment benchmark should have the following characteristics:

(1) It should cover as much scope as possible, including perceptual and cognitive abilities. The former refers to identifying objects, including their existence, quantity, location and color. The latter refers to integrating sensory information and knowledge in LLM to perform more complex reasoning. The former is the basis of the latter.

(2) Data or annotations should avoid using existing public data sets as much as possible to reduce the risk of data leakage.

(3) Instructions should be as concise as possible and consistent with human cognitive habits. Different instruction designs may greatly affect the output of the model, but all models are evaluated under unified and concise instructions to ensure fairness. A good MLLM model should have the ability to generalize to such concise instructions to avoid falling into prompt engineering.

(4) The output of MLLM under this concise instruction should be intuitive and convenient for quantitative statistics. The open-ended answers of MLLM pose great challenges to quantitative statistics. Existing methods tend to use GPT or manual scoring, but may face problems of inaccuracy and subjectivity.

BLIP-2 and InstructBLIP are firmly in the top three! Twelve major models, sixteen lists, comprehensive evaluation of multimodal large language model

# Figure 1. MME evaluation benchmark example. Each picture corresponds to two questions, and the answers are Yes[Y] and No[N] respectively. The question plus "Please answer yes or no" together form the command.

Based on the above reasons, a new MLLM evaluation benchmark MME was constructed, which has the above four characteristics at the same time:

1. MME Perceptual and cognitive abilities are assessed simultaneously. In addition to OCR, sensing capabilities include coarse-grained and fine-grained target recognition. The former identifies the presence, quantity, location and color of objects. The latter identifies movie posters, celebrities, scenes, landmarks and artwork. Cognitive abilities include common sense reasoning, numerical calculations, text translation, and code reasoning. The total number of subtasks reaches 14, as shown in Figure 1.

2. All command-answer pairs in MME are constructed manually. For the few public datasets used, only their images are used without relying on their original annotations. At the same time, researchers also try their best to collect data through manual photography and image generation.

3. MME instructions are designed to be as concise as possible to avoid the impact of Prompt Engineering on model output. The researchers reiterate that a good MLLM should generalize to such concise and frequently used instructions, which is fair to all models. The instructions for each subtask are shown in Figure 1.

4. Thanks to the instruction design "Please answer yes or no", quantitative statistics can be easily performed based on the "Yes" or "No" output by the model. This method can simultaneously Ensure accuracy and objectivity. It is worth noting that researchers have also tried to design instructions for multiple-choice questions, but found that the current MLLM is still difficult to follow such more complex instructions.

The researchers evaluated a total of 12 advanced MLLM models, including BLIP-2 [5], LLaVA [6], MiniGPT-4 [7], mPLUG-Owl [2] , LLaMA-Adapter-v2 [8], Otter [9], Multimodal-GPT [10], InstructBLIP [11], VisualGLM-6B [12], PandaGPT [13], ImageBind-LLM [14] and LaVIN [15] .

There are three statistical indicators, including Accuracy, Accuracy and Score. For each task, Accuracy is based on question statistics, Accuracy is based on picture statistics (both questions corresponding to the pictures need to be answered correctly), and Score is the sum of Accuracy and Accuracy.

The total score of perception is the sum of the scores of 10 perceptual sub-tasks, and the total score of cognition is the sum of the scores of 4 cognitive tasks. See the project link for details.

The test comparison of 12 models on 14 sub-tasks is shown in Figure 2:

BLIP-2 and InstructBLIP are firmly in the top three! Twelve major models, sixteen lists, comprehensive evaluation of multimodal large language model

## Figure 2. Comparison of 12 models on 14 subtasks. The full score for each sub-task is 200 points.

A total of 16 lists, including the overall list of perception and cognition categories and the lists of 14 sub-tasks have also been released. The two overall lists are shown in Figures 3 and 4 respectively. It is worth noting that BLIP-2 and InstructBLIP remain in the top three in both lists.

BLIP-2 and InstructBLIP are firmly in the top three! Twelve major models, sixteen lists, comprehensive evaluation of multimodal large language modelPicture

Figure 3. Overall list of perception tasks

BLIP-2 and InstructBLIP are firmly in the top three! Twelve major models, sixteen lists, comprehensive evaluation of multimodal large language model

Figure 4. Overall list of cognitive tasks

BLIP-2 and InstructBLIP are firmly in the top three! Twelve major models, sixteen lists, comprehensive evaluation of multimodal large language model

Figure 5. All lists

In addition, the researchers also summarized some common problems exposed by the MLLM model in experiments, as shown in Figure 6, hoping to provide guidance for subsequent model optimization.

BLIP-2 and InstructBLIP are firmly in the top three! Twelve major models, sixteen lists, comprehensive evaluation of multimodal large language modelPicture

Figure 6. Common problems exposed by MLLM. [Y]/[N] means the real answer is Yes/No. [R] is the answer generated by MLLM.

The first problem is not following instructions.

Although a very concise instruction design has been adopted, there is still MLLM freedom to answer questions rather than follow instructions.

As shown in the first line of Figure 6, the command has stated "Please answer yes or no", but MLLM only gave a declarative answer. If "Yes" or "No" does not appear at the beginning of the answer, the answer is judged to be incorrect. A good MLLM, especially after fine-tuning the instructions, should be able to generalize to such simple instructions.

The second problem is the lack of perception.

As shown in the second row of Figure 6, MLLM incorrectly identifies the number of bananas in the first image and the second image numbers in, resulting in incorrect answers. The researchers also noticed that perceptual performance was easily affected by changes in instructions, as two instructions for the same picture that differed by just one word resulted in completely different perceptual results.

The third problem is the lack of reasoning ability.

As shown in the third line of Figure 6, it can be seen from the red text that MLLM already knows that the first picture is not an office space, but still gives Got an incorrect answer "Yes".

Similarly, in the second picture, MLLM has calculated the correct arithmetic result, but ultimately also gave the wrong answer. Adding a thought chain prompt, such as “Let’s think step by step” may bring better results. Looking forward to more in-depth research in this area.

The fourth question follows the object vision of the command. As shown in the fourth line of Figure 6, when the instruction contains an object that does not exist in the picture, MLLM will imagine that the object exists and finally give a "Yes" answer.

This approach of always answering "Yes" results in Accuracy close to 50% and Accuracy close to 0. This demonstrates the importance of suppressing target hallucinations and also requires further thinking about the reliability of answers generated by MLLM.

The above is the detailed content of BLIP-2 and InstructBLIP are firmly in the top three! Twelve major models, sixteen lists, comprehensive evaluation of "multimodal large language model". For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete