The Bengio team proposes a new multi-modal benchmark, targeting the weaknesses of Claude 3.5 and GPT-4o-AI-php.cn

Home

Technology peripherals

The Bengio team proposes a new multi-modal benchmark, targeting the weaknesses of Claude 3.5 and GPT-4o

王林

Jun 29, 2024 am 12:06 AM

projectVisual subtitle recovery

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com

The author of this article, Zhang Tianyu, studied at the Mila Artificial Intelligence Institute in Canada and studied under Professor Yoshua Bengio, the winner of the Turing Award. The main work during the doctoral period focused on multi-modal, GFlowNet, multi-agent reinforcement learning, and the application of AI in climate change. Currently, he has published papers at top machine learning conferences such as ICML, ICLR, and ICASSP. Represented as Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation (CLAP).

To achieve the ultimate goal of general artificial intelligence AGI, the first thing that must be achieved is that the model must be able to complete tasks that humans can easily do. In order to do this, one of the key guidelines for large model development is how to make machines think and reason like humans. Technologies such as attention mechanisms and Chain-of-Thought were inspired by this.

However, many people may not realize that many very simple cognitive tasksfor humans are often accompanied by very complex reasoning processes. As an example, please try filling in the blocked text gaps based on the image below:

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

(Correct answer: Machine learning researchers from around the world are excited about the new GPU. Its cutting-edge features can also enable Large-scale experiments are more efficient and cheaper, even if it is as big as a stove )

For most native Chinese speakers, this task should not be difficult, and I believe you can get the answer in just a few seconds. But inferring the complete text from the exposed part of the text still requires a very complex reasoning process: contemporary neuroscience research shows that recovering partially occluded objects requires a high degree of involvement of the prefrontal cortex, which is capable of high-level decision-making.

We know that the current visual language models (Vision-Language Models, VLM) can perform object recognition and text recognition very accurately. However, when the occluded part is text; when the optical character recognition (OCR) of the model fails; when the only key information is only a few pixels of the occluded text, can the model simulate the human reasoning process to complete this task?

To this end, the team from Turing Award winner Yoshua Bengio proposed a new visual question and answer task: Visual Caption Restoration (VCR). Let us use this task to explore the reasoning capabilities of visual language models: How far are the current visual language models from human cognitive levels?

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

Paper title: VCR: Visual Caption Restoration
Paper link: arxiv.org/abs/2406.06462
Code repository: github.com/tianyu-z/VCR (Click to read the original text for direct access, including Review the data generation code for model evaluation and pre-training)
Hugging Face link: huggingface.co/vcr-org

VCR dataset introduction

For development For the VCR task, the researchers built a process for generating VCR composite images from image-text. In this process, you can change the visibility of the text in the image by controlling the size of the white rectangle that covers the text, thereby controlling the difficulty of the task.

With this data generation process, the researchers generated the VCR-wiki data set through Wikipedia’s main image - introduction pair . There are two difficulty levels for both languages: “Easy” and “Hard”. Among them:

"Easy" difficulty VCR task can make the OCR model invalid ;
"Difficulty" VCR task only retain 1-2 top and bottom for each occluded text The height of pixels, but still allows users of the corresponding language to complete the task.

In each language and difficulty, there are 5000 samples in the test set and validation set, and the remaining samples are in the training set.

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

^{- - to}

The example at the beginning of the article is only a small challenge for humans. It cannot well demonstrate the ultimate level of humans in doing this task and the thinking and skills humans use when solving problems. A sample VCR mission on "Hard" difficulty is shown below. Readers can focus more intently on trying to fill in the blank text gaps below themselves.

(Correct answer: The Great Treatise, a treatise on mathematics and astronomy compiled by Ptolemy in ancient Greece in about 140 AD, which proposed the complex motion paths of stars and planets. Until the Middle Ages and the early Renaissance, the The geocentric model proposed in the book was adopted by Islam and Europe...)

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点 How do humans complete partially obscured text?

There is a concept in education and cognitive science called

meta-cognition. When designing AI, we humans, as teachers, can use monitoring our own thinking processes

as a reference to help students who serve as models improve their learning efficiency. Therefore, thinking about “how humans complete VCR tasks” can be instructive for model design.

The picture below shows one of the author's problem-solving ideas for the VCR task as a reference:

It seems like there are many steps, but in fact, it is just constantly getting information through different areas
and then verifying it repeatedly

to increase the answers confidence level.

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

When I first saw the picture, I only had a vague guess in my mind. As I continued to read the pictures to obtain new information, I gradually verified the guess. After reading, when you start to fill in the blanks, you still don’t stop comparing different aspects of the information to confirm your answers. When the "hypothesis" is not consistent with other information, the "hypothesis" is overturned and a new hypothesis is tried again.

Human evaluation results

How good are humans at VCR tasks? The figure below shows the accuracy of native speakers or fluent users of each language in English/Chinese on easy/hard settings:

If errors including time, place names, and people’s names are taken into account, The average accuracy of Chinese in easy difficulty is about 98.58%, and the average accuracy of Chinese in hard difficulty is about 91.84%. Excluding these errors due to time, place names, and people's names, humans are almost close to full marks in the easy Chinese difficulty level, and the accuracy rate in the Chinese hard difficulty level has also reached 96.63%. As can be seen, the VCR task is very simple for humans.

Existing model results

The author tested the "all-star lineup": Claude 3 Opus, Claude 3.5 Sonnet, Gemini 1.5 Pro, GPT-4o, GPT-4 Turbo, Qwen-VL- Max, Reka Core and some of the best performing open source models available today.

The following figure shows the performance of each model on the simple difficulty of VCR-Wiki Chinese:

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

The red box measurement indicators include representatives including image (VI) and text in the image ( TEI)The two parts are used as contextual information, and the model can restore the accuracy of the obscured text. The blue box indicates the accuracy of the model that can restore the covered text that only contains the text in the image (TEI) as contextual information and does not include the image (VI).

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

The results show that:

The vast majority of models are currently unable to do this task;
The vast majority of models do not make good use of image information, not because of image information (VI) And improve the accuracy.

On the Chinese Hard difficulty , the model ran into greater trouble. The best performer is GPT-4o, but its accuracy is only 2.2%. Except for CogVLM2-Chinese and Qwen-VL-Max, the accuracy of most models is close to 0%.

It can be observed that in hard mode, the original model has a hard time answering this question correctly at a significant rate, let alone getting close to humans.

English VCR evaluation results

The author also tested the current best open source and closed source visual-language models on the English VCR-Wiki. Before showing the test results, please take a look at two examples of the English VCR-Wiki task:

Simple English example:

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

(Correct answer: Since the United States Post Office issued its first stamp in 1847, over 4,000 stamps have been issued and over 800 people featured. Many of these people...)

English Difficulty Example:

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

(Correct answer: Lincoln is the luxury vehicle division of American automobile manufacturer Ford. Marketed among the top luxury vehicle brands in the United States, for...)

The test results of the English VCR-Wiki shown in the article are as follows:

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

從結果整體來看，模型在英文的簡單模式和困難模式下都分別比中文表現得要好。這個結果與我們一般認為的 "因為特殊的模組化構形，殘缺的中文更加容易被補全" 的直覺不一致。或許這是由於在預訓練過程中，英文在資料量和資料品質上相比中文有更大的優勢。

在所測試的眾多模型中，GPT-4o 是閉源模型中的效果最佳的，CogVLM2 是開源模型中表現最佳的。

一個很有趣的現像是加入了圖片對 CogVLM2 來說有了明顯的幫助（在困難模式下提升了 20.3%），而對於 GPT-4o 而言反而結果有下降。在中文測驗中，也有相似的現象。筆者認為這是模型的結構所導致的。具體的細節，歡迎讀者參閱 CogVLM 系列的論文以及程式碼。

另外，閉源模型普遍取得了比開源模型更優的結果，這可能歸功於更優的訓練策略或更多的模型參數。但即使如此，模型依然在 “困難” 設定下遇到了很大的挑戰。開源模型雖然可以部分完成 “簡單” 設定，但在困難設定下，大多數開源模型都無法完成這個對人類而言十分簡單的任務。

相關任務簡介

VQA

VQA

的圖像

由於沒有唯一的標準答案，評估 VQA 具有很大的挑戰性

。傳統的 VQA 方法主要集中在圖像中可見元素的直接查詢，而不涉及圖像中嵌入的文字內容與整體圖像上下文之間的複雜關係。

在一些文字在圖片中資訊佔比比較大的 VQA 評測中，模型的視覺模組甚至可能完全不需要與語言模組對齊就可以勝任。此類流程為：影像輸入至 OCR 視覺模組，OCR 視覺模組輸出影像中的字元資訊並以此為上下文輸入給語言模組。這樣就導致了 VQA 任務退化了不需要影像資訊的 QA 任務。原本比較不同的 VLM 所需的視覺模組對齊能力被忽略而 OCR 能力被重視。

OCR

光學字元辨識（Optical Character Recognition, OCR）任務通常輸入影像中的完整字元，並輸出表示影像中字元的字串文字，而無需考慮影像中的完整字元。

預訓練過 OCR 的模型能夠從輸入圖像中提取嵌入的文本，即使這些文本是不完整或模糊的。然而，

隨著文字組件模糊或被遮蔽的程度增加

，只利用可見部分恢復原始文字變得困難，

OCR 方法在這種情況下效果有限

。

可以看出，VQA 任務沒有標準答案，評估模型回答的品質仍然是一個開放性問題。而 OCR 任務不需要透過上下文來完成，無法檢驗模型是否真的學會利用了上下文中的資訊。

VCR 任務的不可替代性

視覺字幕恢復（Visual Caption Restoration, VCR

。
VCR 任務的獨特挑戰在於要求
模型在視覺和文字訊息之間進行精確的對齊
利用可用的部分像素級文字提示和視覺上下文來準確地重建被遮蔽的內容
。這不僅測試了模型處理嵌入文字和視覺元素的能力，還考驗了其保持內部一致性的能力，類似於人類透過情境和視覺線索進行理解和回應的認知過程。
VCR 任務的問題有唯一的答案
，這使得評估可以透過準確度進行，使評測指標更加明確。
🎜透過調整文本的遮蓋比例，可以控制任務的難度🎜，從而提供一個豐富的測試環境。 🎜🎜🎜🎜與 OCR 任務一樣，VCR 任務也可以充當 VLM 的訓練任務。作者開放了 transform 程式碼，可以產生任意給定圖像 - 文字對的 VCR 任務圖。

小結

本文提出的視覺字幕恢復（VCR）透過看似簡單的字幕恢復任務巧妙地揭開了現有模型，現有圖片模型與人類在高階認知任務上的推理能力差異。相信這項任務可以啟發未來更有效的 VLM 訓練、評測和推理方法，進一步拉近多模態模型和人類認知能力的差距。

The above is the detailed content of The Bengio team proposes a new multi-modal benchmark, targeting the weaknesses of Claude 3.5 and GPT-4o. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

The Hidden Dangers Of AI Internal Deployment: Governance Gaps And Catastrophic RisksApr 28, 2025 am 11:12 AM

The unchecked internal deployment of advanced AI systems poses significant risks, according to a new report from Apollo Research. This lack of oversight, prevalent among major AI firms, allows for potential catastrophic outcomes, ranging from uncont

Building The AI PolygraphApr 28, 2025 am 11:11 AM

Traditional lie detectors are outdated. Relying on the pointer connected by the wristband, a lie detector that prints out the subject's vital signs and physical reactions is not accurate in identifying lies. This is why lie detection results are not usually adopted by the court, although it has led to many innocent people being jailed. In contrast, artificial intelligence is a powerful data engine, and its working principle is to observe all aspects. This means that scientists can apply artificial intelligence to applications seeking truth through a variety of ways. One approach is to analyze the vital sign responses of the person being interrogated like a lie detector, but with a more detailed and precise comparative analysis. Another approach is to use linguistic markup to analyze what people actually say and use logic and reasoning. As the saying goes, one lie breeds another lie, and eventually

Is AI Cleared For Takeoff In The Aerospace Industry?Apr 28, 2025 am 11:10 AM

The aerospace industry, a pioneer of innovation, is leveraging AI to tackle its most intricate challenges. Modern aviation's increasing complexity necessitates AI's automation and real-time intelligence capabilities for enhanced safety, reduced oper

Watching Beijing's Spring Robot RaceApr 28, 2025 am 11:09 AM

The rapid development of robotics has brought us a fascinating case study. The N2 robot from Noetix weighs over 40 pounds and is 3 feet tall and is said to be able to backflip. Unitree's G1 robot weighs about twice the size of the N2 and is about 4 feet tall. There are also many smaller humanoid robots participating in the competition, and there is even a robot that is driven forward by a fan. Data interpretation The half marathon attracted more than 12,000 spectators, but only 21 humanoid robots participated. Although the government pointed out that the participating robots conducted "intensive training" before the competition, not all robots completed the entire competition. Champion - Tiangong Ult developed by Beijing Humanoid Robot Innovation Center

The Mirror Trap: AI Ethics And The Collapse Of Human ImaginationApr 28, 2025 am 11:08 AM

Artificial intelligence, in its current form, isn't truly intelligent; it's adept at mimicking and refining existing data. We're not creating artificial intelligence, but rather artificial inference—machines that process information, while humans su

New Google Leak Reveals Handy Google Photos Feature UpdateApr 28, 2025 am 11:07 AM

A report found that an updated interface was hidden in the code for Google Photos Android version 7.26, and each time you view a photo, a row of newly detected face thumbnails are displayed at the bottom of the screen. The new facial thumbnails are missing name tags, so I suspect you need to click on them individually to see more information about each detected person. For now, this feature provides no information other than those people that Google Photos has found in your images. This feature is not available yet, so we don't know how Google will use it accurately. Google can use thumbnails to speed up finding more photos of selected people, or may be used for other purposes, such as selecting the individual to edit. Let's wait and see. As for now

Guide to Reinforcement Finetuning - Analytics VidhyaApr 28, 2025 am 09:30 AM

Reinforcement finetuning has shaken up AI development by teaching models to adjust based on human feedback. It blends supervised learning foundations with reward-based updates to make them safer, more accurate, and genuinely help

Let's Dance: Structured Movement To Fine-Tune Our Human Neural NetsApr 27, 2025 am 11:09 AM

Scientists have extensively studied human and simpler neural networks (like those in C. elegans) to understand their functionality. However, a crucial question arises: how do we adapt our own neural networks to work effectively alongside novel AI s

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

1 months agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks agoByDDD

Where to find the Crane Control Keycard in Atomfall

1 months agoByDDD

How to fix KB5055523 fails to install in Windows 11?

2 weeks agoByDDD

InZoi: How To Apply To School And University

3 weeks agoByDDD

Hot Tools

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

SublimeText3 Chinese version

Chinese version, very easy to use

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Hot Topics

Where is the login entrance for gmail email?

7789

1644

1401

1298

1234