BLIP-2 and InstructBLIP are firmly in the top three! Twelve major models, sixteen lists, comprehensive evaluation of 'multimodal large language model'-AI-php.cn

BLIP-2 and InstructBLIP are firmly in the top three! Twelve major models, sixteen lists, comprehensive evaluation of 'multimodal large language model'

王林

Jul 13, 2023 pm 02:33 PM

dataModel

Multimodal Large Language Model (MLLM) relies on LLM’s rich knowledge reserves and powerful reasoning and generalization capabilities to solve multimodal problems. Some amazing models have emerged so far. Ability, such as reading pictures and writing and looking at pictures and writing code.

However, it is difficult to fully reflect the performance of MLLM based on these examples alone, and there is still a lack of comprehensive evaluation of MLLM.

To this end, Tencent Youtu Lab and Xiamen University conducted a comprehensive quantitative evaluation of the existing 12 open source MLLM models for the first time on the newly created evaluation benchmark MM and published 16 rankings List, including two general lists of perception and cognition and 14 sub-lists:

BLIP-2 and InstructBLIP are firmly in the top three! Twelve major models, sixteen lists, comprehensive evaluation of multimodal large language model

Paper link: https://arxiv.org/pdf /2306.13394.pdf

Project link: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation

The existing quantitative evaluation methods of MLLM are mainly divided into three categories, but they all have certain limitations that make it difficult to fully reflect their performance.

The first type of methods are evaluated on traditional public datasets, such as Image Caption and Visual Question Answering (VQA) datasets.

But on the one hand, these traditional data sets may be difficult to reflect the new capabilities of MLLM. On the other hand, since the training sets in the large model era are no longer unified, it is difficult to guarantee these evaluation data sets. It has not been trained by other MLLMs.

The second way is to collect new data for open evaluation, but these data are either not public [1] or the number is too small (only 50 pictures) [2].

The third method focuses on a specific aspect of MLLM, such as object hallucination [3] or adversarial robustness [4], and cannot be fully evaluated.

There is an urgent need for a comprehensive evaluation benchmark to match the rapid development of MLLM. Researchers believe that a universal comprehensive assessment benchmark should have the following characteristics:

(1) It should cover as much scope as possible, including perceptual and cognitive abilities. The former refers to identifying objects, including their existence, quantity, location and color. The latter refers to integrating sensory information and knowledge in LLM to perform more complex reasoning. The former is the basis of the latter.

(2) Data or annotations should avoid using existing public data sets as much as possible to reduce the risk of data leakage.

(3) Instructions should be as concise as possible and consistent with human cognitive habits. Different instruction designs may greatly affect the output of the model, but all models are evaluated under unified and concise instructions to ensure fairness. A good MLLM model should have the ability to generalize to such concise instructions to avoid falling into prompt engineering.

(4) The output of MLLM under this concise instruction should be intuitive and convenient for quantitative statistics. The open-ended answers of MLLM pose great challenges to quantitative statistics. Existing methods tend to use GPT or manual scoring, but may face problems of inaccuracy and subjectivity.

BLIP-2 and InstructBLIP are firmly in the top three! Twelve major models, sixteen lists, comprehensive evaluation of multimodal large language model

# Figure 1. MME evaluation benchmark example. Each picture corresponds to two questions, and the answers are Yes[Y] and No[N] respectively. The question plus "Please answer yes or no" together form the command.

Based on the above reasons, a new MLLM evaluation benchmark MME was constructed, which has the above four characteristics at the same time:

1. MME Perceptual and cognitive abilities are assessed simultaneously. In addition to OCR, sensing capabilities include coarse-grained and fine-grained target recognition. The former identifies the presence, quantity, location and color of objects. The latter identifies movie posters, celebrities, scenes, landmarks and artwork. Cognitive abilities include common sense reasoning, numerical calculations, text translation, and code reasoning. The total number of subtasks reaches 14, as shown in Figure 1.

2. All command-answer pairs in MME are constructed manually. For the few public datasets used, only their images are used without relying on their original annotations. At the same time, researchers also try their best to collect data through manual photography and image generation.

3. MME instructions are designed to be as concise as possible to avoid the impact of Prompt Engineering on model output. The researchers reiterate that a good MLLM should generalize to such concise and frequently used instructions, which is fair to all models. The instructions for each subtask are shown in Figure 1.

4. Thanks to the instruction design "Please answer yes or no", quantitative statistics can be easily performed based on the "Yes" or "No" output by the model. This method can simultaneously Ensure accuracy and objectivity. It is worth noting that researchers have also tried to design instructions for multiple-choice questions, but found that the current MLLM is still difficult to follow such more complex instructions.

The researchers evaluated a total of 12 advanced MLLM models, including BLIP-2 [5], LLaVA [6], MiniGPT-4 [7], mPLUG-Owl [2] , LLaMA-Adapter-v2 [8], Otter [9], Multimodal-GPT [10], InstructBLIP [11], VisualGLM-6B [12], PandaGPT [13], ImageBind-LLM [14] and LaVIN [15] .

There are three statistical indicators, including Accuracy, Accuracy and Score. For each task, Accuracy is based on question statistics, Accuracy is based on picture statistics (both questions corresponding to the pictures need to be answered correctly), and Score is the sum of Accuracy and Accuracy.

The total score of perception is the sum of the scores of 10 perceptual sub-tasks, and the total score of cognition is the sum of the scores of 4 cognitive tasks. See the project link for details.

The test comparison of 12 models on 14 sub-tasks is shown in Figure 2:

BLIP-2 and InstructBLIP are firmly in the top three! Twelve major models, sixteen lists, comprehensive evaluation of multimodal large language model

## Figure 2. Comparison of 12 models on 14 subtasks. The full score for each sub-task is 200 points.

A total of 16 lists, including the overall list of perception and cognition categories and the lists of 14 sub-tasks have also been released. The two overall lists are shown in Figures 3 and 4 respectively. It is worth noting that BLIP-2 and InstructBLIP remain in the top three in both lists.

BLIP-2 and InstructBLIP are firmly in the top three! Twelve major models, sixteen lists, comprehensive evaluation of multimodal large language model Picture

Figure 3. Overall list of perception tasks

BLIP-2 and InstructBLIP are firmly in the top three! Twelve major models, sixteen lists, comprehensive evaluation of multimodal large language model

Figure 4. Overall list of cognitive tasks

BLIP-2 and InstructBLIP are firmly in the top three! Twelve major models, sixteen lists, comprehensive evaluation of multimodal large language model

Figure 5. All lists

In addition, the researchers also summarized some common problems exposed by the MLLM model in experiments, as shown in Figure 6, hoping to provide guidance for subsequent model optimization.

BLIP-2 and InstructBLIP are firmly in the top three! Twelve major models, sixteen lists, comprehensive evaluation of multimodal large language model Picture

Figure 6. Common problems exposed by MLLM. [Y]/[N] means the real answer is Yes/No. [R] is the answer generated by MLLM.

The first problem is not following instructions.

Although a very concise instruction design has been adopted, there is still MLLM freedom to answer questions rather than follow instructions.

As shown in the first line of Figure 6, the command has stated "Please answer yes or no", but MLLM only gave a declarative answer. If "Yes" or "No" does not appear at the beginning of the answer, the answer is judged to be incorrect. A good MLLM, especially after fine-tuning the instructions, should be able to generalize to such simple instructions.

The second problem is the lack of perception.

As shown in the second row of Figure 6, MLLM incorrectly identifies the number of bananas in the first image and the second image numbers in, resulting in incorrect answers. The researchers also noticed that perceptual performance was easily affected by changes in instructions, as two instructions for the same picture that differed by just one word resulted in completely different perceptual results.

The third problem is the lack of reasoning ability.

As shown in the third line of Figure 6, it can be seen from the red text that MLLM already knows that the first picture is not an office space, but still gives Got an incorrect answer "Yes".

Similarly, in the second picture, MLLM has calculated the correct arithmetic result, but ultimately also gave the wrong answer. Adding a thought chain prompt, such as “Let’s think step by step” may bring better results. Looking forward to more in-depth research in this area.

The fourth question follows the object vision of the command. As shown in the fourth line of Figure 6, when the instruction contains an object that does not exist in the picture, MLLM will imagine that the object exists and finally give a "Yes" answer.

This approach of always answering "Yes" results in Accuracy close to 50% and Accuracy close to 0. This demonstrates the importance of suppressing target hallucinations and also requires further thinking about the reliability of answers generated by MLLM.

The above is the detailed content of BLIP-2 and InstructBLIP are firmly in the top three! Twelve major models, sixteen lists, comprehensive evaluation of 'multimodal large language model'. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

Can't use ChatGPT! Explaining the causes and solutions that can be tested immediately [Latest 2025]May 14, 2025 am 05:04 AM

ChatGPT is not accessible? This article provides a variety of practical solutions! Many users may encounter problems such as inaccessibility or slow response when using ChatGPT on a daily basis. This article will guide you to solve these problems step by step based on different situations. Causes of ChatGPT's inaccessibility and preliminary troubleshooting First, we need to determine whether the problem lies in the OpenAI server side, or the user's own network or device problems. Please follow the steps below to troubleshoot: Step 1: Check the official status of OpenAI Visit the OpenAI Status page (status.openai.com) to see if the ChatGPT service is running normally. If a red or yellow alarm is displayed, it means Open

Calculating The Risk Of ASI Starts With Human MindsMay 14, 2025 am 05:02 AM

On 10 May 2025, MIT physicist Max Tegmark told The Guardian that AI labs should emulate Oppenheimer’s Trinity-test calculus before releasing Artificial Super-Intelligence. “My assessment is that the 'Compton constant', the probability that a race to

An easy-to-understand explanation of how to write and compose lyrics and recommended tools in ChatGPTMay 14, 2025 am 05:01 AM

AI music creation technology is changing with each passing day. This article will use AI models such as ChatGPT as an example to explain in detail how to use AI to assist music creation, and explain it with actual cases. We will introduce how to create music through SunoAI, AI jukebox on Hugging Face, and Python's Music21 library. Through these technologies, everyone can easily create original music. However, it should be noted that the copyright issue of AI-generated content cannot be ignored, and you must be cautious when using it. Let’s explore the infinite possibilities of AI in the music field together! OpenAI's latest AI agent "OpenAI Deep Research" introduces: [ChatGPT]Ope

What is ChatGPT-4? A thorough explanation of what you can do, the pricing, and the differences from GPT-3.5!May 14, 2025 am 05:00 AM

The emergence of ChatGPT-4 has greatly expanded the possibility of AI applications. Compared with GPT-3.5, ChatGPT-4 has significantly improved. It has powerful context comprehension capabilities and can also recognize and generate images. It is a universal AI assistant. It has shown great potential in many fields such as improving business efficiency and assisting creation. However, at the same time, we must also pay attention to the precautions in its use. This article will explain the characteristics of ChatGPT-4 in detail and introduce effective usage methods for different scenarios. The article contains skills to make full use of the latest AI technologies, please refer to it. OpenAI's latest AI agent, please click the link below for details of "OpenAI Deep Research"

Explaining how to use the ChatGPT app! Japanese support and voice conversation functionMay 14, 2025 am 04:59 AM

ChatGPT App: Unleash your creativity with the AI assistant! Beginner's Guide The ChatGPT app is an innovative AI assistant that handles a wide range of tasks, including writing, translation, and question answering. It is a tool with endless possibilities that is useful for creative activities and information gathering. In this article, we will explain in an easy-to-understand way for beginners, from how to install the ChatGPT smartphone app, to the features unique to apps such as voice input functions and plugins, as well as the points to keep in mind when using the app. We'll also be taking a closer look at plugin restrictions and device-to-device configuration synchronization

How do I use the Chinese version of ChatGPT? Explanation of registration procedures and feesMay 14, 2025 am 04:56 AM

ChatGPT Chinese version: Unlock new experience of Chinese AI dialogue ChatGPT is popular all over the world, did you know it also offers a Chinese version? This powerful AI tool not only supports daily conversations, but also handles professional content and is compatible with Simplified and Traditional Chinese. Whether it is a user in China or a friend who is learning Chinese, you can benefit from it. This article will introduce in detail how to use ChatGPT Chinese version, including account settings, Chinese prompt word input, filter use, and selection of different packages, and analyze potential risks and response strategies. In addition, we will also compare ChatGPT Chinese version with other Chinese AI tools to help you better understand its advantages and application scenarios. OpenAI's latest AI intelligence

5 AI Agent Myths You Need To Stop Believing NowMay 14, 2025 am 04:54 AM

These can be thought of as the next leap forward in the field of generative AI, which gave us ChatGPT and other large-language-model chatbots. Rather than simply answering questions or generating information, they can take action on our behalf, inter

An easy-to-understand explanation of the illegality of creating and managing multiple accounts using ChatGPTMay 14, 2025 am 04:50 AM

Efficient multiple account management techniques using ChatGPT | A thorough explanation of how to use business and private life! ChatGPT is used in a variety of situations, but some people may be worried about managing multiple accounts. This article will explain in detail how to create multiple accounts for ChatGPT, what to do when using it, and how to operate it safely and efficiently. We also cover important points such as the difference in business and private use, and complying with OpenAI's terms of use, and provide a guide to help you safely utilize multiple accounts. OpenAI

See all articles