


Multimodal Large Language Model (MLLM) relies on LLM’s rich knowledge reserves and powerful reasoning and generalization capabilities to solve multimodal problems. Some amazing models have emerged so far. Ability, such as reading pictures and writing and looking at pictures and writing code.
However, it is difficult to fully reflect the performance of MLLM based on these examples alone, and there is still a lack of comprehensive evaluation of MLLM.
To this end, Tencent Youtu Lab and Xiamen University conducted a comprehensive quantitative evaluation of the existing 12 open source MLLM models for the first time on the newly created evaluation benchmark MM and published 16 rankings List, including two general lists of perception and cognition and 14 sub-lists:
Paper link: https://arxiv.org/pdf /2306.13394.pdf
Project link: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation
The existing quantitative evaluation methods of MLLM are mainly divided into three categories, but they all have certain limitations that make it difficult to fully reflect their performance.
The first type of methods are evaluated on traditional public datasets, such as Image Caption and Visual Question Answering (VQA) datasets.
But on the one hand, these traditional data sets may be difficult to reflect the new capabilities of MLLM. On the other hand, since the training sets in the large model era are no longer unified, it is difficult to guarantee these evaluation data sets. It has not been trained by other MLLMs.
The second way is to collect new data for open evaluation, but these data are either not public [1] or the number is too small (only 50 pictures) [2].
The third method focuses on a specific aspect of MLLM, such as object hallucination [3] or adversarial robustness [4], and cannot be fully evaluated.
There is an urgent need for a comprehensive evaluation benchmark to match the rapid development of MLLM. Researchers believe that a universal comprehensive assessment benchmark should have the following characteristics:
(1) It should cover as much scope as possible, including perceptual and cognitive abilities. The former refers to identifying objects, including their existence, quantity, location and color. The latter refers to integrating sensory information and knowledge in LLM to perform more complex reasoning. The former is the basis of the latter.
(2) Data or annotations should avoid using existing public data sets as much as possible to reduce the risk of data leakage.
(3) Instructions should be as concise as possible and consistent with human cognitive habits. Different instruction designs may greatly affect the output of the model, but all models are evaluated under unified and concise instructions to ensure fairness. A good MLLM model should have the ability to generalize to such concise instructions to avoid falling into prompt engineering.
(4) The output of MLLM under this concise instruction should be intuitive and convenient for quantitative statistics. The open-ended answers of MLLM pose great challenges to quantitative statistics. Existing methods tend to use GPT or manual scoring, but may face problems of inaccuracy and subjectivity.
# Figure 1. MME evaluation benchmark example. Each picture corresponds to two questions, and the answers are Yes[Y] and No[N] respectively. The question plus "Please answer yes or no" together form the command.
Based on the above reasons, a new MLLM evaluation benchmark MME was constructed, which has the above four characteristics at the same time:
1. MME Perceptual and cognitive abilities are assessed simultaneously. In addition to OCR, sensing capabilities include coarse-grained and fine-grained target recognition. The former identifies the presence, quantity, location and color of objects. The latter identifies movie posters, celebrities, scenes, landmarks and artwork. Cognitive abilities include common sense reasoning, numerical calculations, text translation, and code reasoning. The total number of subtasks reaches 14, as shown in Figure 1.
2. All command-answer pairs in MME are constructed manually. For the few public datasets used, only their images are used without relying on their original annotations. At the same time, researchers also try their best to collect data through manual photography and image generation.
3. MME instructions are designed to be as concise as possible to avoid the impact of Prompt Engineering on model output. The researchers reiterate that a good MLLM should generalize to such concise and frequently used instructions, which is fair to all models. The instructions for each subtask are shown in Figure 1.
4. Thanks to the instruction design "Please answer yes or no", quantitative statistics can be easily performed based on the "Yes" or "No" output by the model. This method can simultaneously Ensure accuracy and objectivity. It is worth noting that researchers have also tried to design instructions for multiple-choice questions, but found that the current MLLM is still difficult to follow such more complex instructions.
The researchers evaluated a total of 12 advanced MLLM models, including BLIP-2 [5], LLaVA [6], MiniGPT-4 [7], mPLUG-Owl [2] , LLaMA-Adapter-v2 [8], Otter [9], Multimodal-GPT [10], InstructBLIP [11], VisualGLM-6B [12], PandaGPT [13], ImageBind-LLM [14] and LaVIN [15] .
There are three statistical indicators, including Accuracy, Accuracy and Score. For each task, Accuracy is based on question statistics, Accuracy is based on picture statistics (both questions corresponding to the pictures need to be answered correctly), and Score is the sum of Accuracy and Accuracy.
The total score of perception is the sum of the scores of 10 perceptual sub-tasks, and the total score of cognition is the sum of the scores of 4 cognitive tasks. See the project link for details.
The test comparison of 12 models on 14 sub-tasks is shown in Figure 2:
## Figure 2. Comparison of 12 models on 14 subtasks. The full score for each sub-task is 200 points.
A total of 16 lists, including the overall list of perception and cognition categories and the lists of 14 sub-tasks have also been released. The two overall lists are shown in Figures 3 and 4 respectively. It is worth noting that BLIP-2 and InstructBLIP remain in the top three in both lists.
Picture
Figure 3. Overall list of perception tasks
Figure 4. Overall list of cognitive tasks
Figure 5. All lists
In addition, the researchers also summarized some common problems exposed by the MLLM model in experiments, as shown in Figure 6, hoping to provide guidance for subsequent model optimization.
Picture
Figure 6. Common problems exposed by MLLM. [Y]/[N] means the real answer is Yes/No. [R] is the answer generated by MLLM.
The first problem is not following instructions.
Although a very concise instruction design has been adopted, there is still MLLM freedom to answer questions rather than follow instructions.
As shown in the first line of Figure 6, the command has stated "Please answer yes or no", but MLLM only gave a declarative answer. If "Yes" or "No" does not appear at the beginning of the answer, the answer is judged to be incorrect. A good MLLM, especially after fine-tuning the instructions, should be able to generalize to such simple instructions.
The second problem is the lack of perception.
As shown in the second row of Figure 6, MLLM incorrectly identifies the number of bananas in the first image and the second image numbers in, resulting in incorrect answers. The researchers also noticed that perceptual performance was easily affected by changes in instructions, as two instructions for the same picture that differed by just one word resulted in completely different perceptual results.
The third problem is the lack of reasoning ability.
As shown in the third line of Figure 6, it can be seen from the red text that MLLM already knows that the first picture is not an office space, but still gives Got an incorrect answer "Yes".
Similarly, in the second picture, MLLM has calculated the correct arithmetic result, but ultimately also gave the wrong answer. Adding a thought chain prompt, such as “Let’s think step by step” may bring better results. Looking forward to more in-depth research in this area.
The fourth question follows the object vision of the command. As shown in the fourth line of Figure 6, when the instruction contains an object that does not exist in the picture, MLLM will imagine that the object exists and finally give a "Yes" answer.
This approach of always answering "Yes" results in Accuracy close to 50% and Accuracy close to 0. This demonstrates the importance of suppressing target hallucinations and also requires further thinking about the reliability of answers generated by MLLM.
The above is the detailed content of BLIP-2 and InstructBLIP are firmly in the top three! Twelve major models, sixteen lists, comprehensive evaluation of 'multimodal large language model'. For more information, please follow other related articles on the PHP Chinese website!
![Can't use ChatGPT! Explaining the causes and solutions that can be tested immediately [Latest 2025]](https://img.php.cn/upload/article/001/242/473/174717025174979.jpg?x-oss-process=image/resize,p_40)
ChatGPT is not accessible? This article provides a variety of practical solutions! Many users may encounter problems such as inaccessibility or slow response when using ChatGPT on a daily basis. This article will guide you to solve these problems step by step based on different situations. Causes of ChatGPT's inaccessibility and preliminary troubleshooting First, we need to determine whether the problem lies in the OpenAI server side, or the user's own network or device problems. Please follow the steps below to troubleshoot: Step 1: Check the official status of OpenAI Visit the OpenAI Status page (status.openai.com) to see if the ChatGPT service is running normally. If a red or yellow alarm is displayed, it means Open

On 10 May 2025, MIT physicist Max Tegmark told The Guardian that AI labs should emulate Oppenheimer’s Trinity-test calculus before releasing Artificial Super-Intelligence. “My assessment is that the 'Compton constant', the probability that a race to

AI music creation technology is changing with each passing day. This article will use AI models such as ChatGPT as an example to explain in detail how to use AI to assist music creation, and explain it with actual cases. We will introduce how to create music through SunoAI, AI jukebox on Hugging Face, and Python's Music21 library. Through these technologies, everyone can easily create original music. However, it should be noted that the copyright issue of AI-generated content cannot be ignored, and you must be cautious when using it. Let’s explore the infinite possibilities of AI in the music field together! OpenAI's latest AI agent "OpenAI Deep Research" introduces: [ChatGPT]Ope

The emergence of ChatGPT-4 has greatly expanded the possibility of AI applications. Compared with GPT-3.5, ChatGPT-4 has significantly improved. It has powerful context comprehension capabilities and can also recognize and generate images. It is a universal AI assistant. It has shown great potential in many fields such as improving business efficiency and assisting creation. However, at the same time, we must also pay attention to the precautions in its use. This article will explain the characteristics of ChatGPT-4 in detail and introduce effective usage methods for different scenarios. The article contains skills to make full use of the latest AI technologies, please refer to it. OpenAI's latest AI agent, please click the link below for details of "OpenAI Deep Research"

ChatGPT App: Unleash your creativity with the AI assistant! Beginner's Guide The ChatGPT app is an innovative AI assistant that handles a wide range of tasks, including writing, translation, and question answering. It is a tool with endless possibilities that is useful for creative activities and information gathering. In this article, we will explain in an easy-to-understand way for beginners, from how to install the ChatGPT smartphone app, to the features unique to apps such as voice input functions and plugins, as well as the points to keep in mind when using the app. We'll also be taking a closer look at plugin restrictions and device-to-device configuration synchronization

ChatGPT Chinese version: Unlock new experience of Chinese AI dialogue ChatGPT is popular all over the world, did you know it also offers a Chinese version? This powerful AI tool not only supports daily conversations, but also handles professional content and is compatible with Simplified and Traditional Chinese. Whether it is a user in China or a friend who is learning Chinese, you can benefit from it. This article will introduce in detail how to use ChatGPT Chinese version, including account settings, Chinese prompt word input, filter use, and selection of different packages, and analyze potential risks and response strategies. In addition, we will also compare ChatGPT Chinese version with other Chinese AI tools to help you better understand its advantages and application scenarios. OpenAI's latest AI intelligence

These can be thought of as the next leap forward in the field of generative AI, which gave us ChatGPT and other large-language-model chatbots. Rather than simply answering questions or generating information, they can take action on our behalf, inter

Efficient multiple account management techniques using ChatGPT | A thorough explanation of how to use business and private life! ChatGPT is used in a variety of situations, but some people may be worried about managing multiple accounts. This article will explain in detail how to create multiple accounts for ChatGPT, what to do when using it, and how to operate it safely and efficiently. We also cover important points such as the difference in business and private use, and complying with OpenAI's terms of use, and provide a guide to help you safely utilize multiple accounts. OpenAI


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Atom editor mac version download
The most popular open source editor

WebStorm Mac version
Useful JavaScript development tools

SublimeText3 English version
Recommended: Win version, supports code prompts!

Dreamweaver Mac version
Visual web development tools

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.
