


The Bengio team proposes a new multi-modal benchmark, targeting the weaknesses of Claude 3.5 and GPT-4o

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com
The author of this article, Zhang Tianyu, studied at the Mila Artificial Intelligence Institute in Canada and studied under Professor Yoshua Bengio, the winner of the Turing Award. The main work during the doctoral period focused on multi-modal, GFlowNet, multi-agent reinforcement learning, and the application of AI in climate change. Currently, he has published papers at top machine learning conferences such as ICML, ICLR, and ICASSP. Represented as Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation (CLAP).
Paper title: VCR: Visual Caption Restoration Paper link: arxiv.org/abs/2406.06462 Code repository: github.com/tianyu-z/VCR (Click to read the original text for direct access, including Review the data generation code for model evaluation and pre-training) Hugging Face link: huggingface.co/vcr-org
"Easy" difficulty VCR task can make the OCR model invalid ; "Difficulty" VCR task only retain 1-2 top and bottom for each occluded text The height of pixels, but still allows users of the corresponding language to complete the task.
How do humans complete partially obscured text?
and then verifying it repeatedly
Human evaluation results
The vast majority of models are currently unable to do this task; The vast majority of models do not make good use of image information, not because of image information (VI) And improve the accuracy.
預訓練過 OCR 的模型能夠從輸入圖像中提取嵌入的文本,即使這些文本是不完整或模糊的。然而,
- 視覺字幕恢復(Visual Caption Restoration, VCR) OCR 之間架起了橋樑
- 。
VCR 任務的獨特挑戰在於要求 模型在視覺和文字訊息之間進行精確的對齊 ,這與 OCR 的簡單文字擷取任務形成鮮明對比。在 OCR 中,主要關注的是識別可見字符,而無需理解它們在圖像敘事中的上下文相關性。相較之下,VCR 要求模型協同 - 利用可用的部分像素級文字提示和視覺上下文來準確地重建被遮蔽的內容
。這不僅測試了模型處理嵌入文字和視覺元素的能力,還考驗了其保持內部一致性的能力,類似於人類透過情境和視覺線索進行理解和回應的認知過程。 與 VQA 不同, - VCR 任務的問題有唯一的答案
,這使得評估可以透過準確度進行,使評測指標更加明確。 🎜透過調整文本的遮蓋比例,可以控制任務的難度🎜,從而提供一個豐富的測試環境。 🎜🎜🎜🎜與 OCR 任務一樣,VCR 任務也可以充當 VLM 的訓練任務。作者開放了 transform 程式碼,可以產生任意給定圖像 - 文字對的 VCR 任務圖。
The above is the detailed content of The Bengio team proposes a new multi-modal benchmark, targeting the weaknesses of Claude 3.5 and GPT-4o. For more information, please follow other related articles on the PHP Chinese website!

The unchecked internal deployment of advanced AI systems poses significant risks, according to a new report from Apollo Research. This lack of oversight, prevalent among major AI firms, allows for potential catastrophic outcomes, ranging from uncont

Traditional lie detectors are outdated. Relying on the pointer connected by the wristband, a lie detector that prints out the subject's vital signs and physical reactions is not accurate in identifying lies. This is why lie detection results are not usually adopted by the court, although it has led to many innocent people being jailed. In contrast, artificial intelligence is a powerful data engine, and its working principle is to observe all aspects. This means that scientists can apply artificial intelligence to applications seeking truth through a variety of ways. One approach is to analyze the vital sign responses of the person being interrogated like a lie detector, but with a more detailed and precise comparative analysis. Another approach is to use linguistic markup to analyze what people actually say and use logic and reasoning. As the saying goes, one lie breeds another lie, and eventually

The aerospace industry, a pioneer of innovation, is leveraging AI to tackle its most intricate challenges. Modern aviation's increasing complexity necessitates AI's automation and real-time intelligence capabilities for enhanced safety, reduced oper

The rapid development of robotics has brought us a fascinating case study. The N2 robot from Noetix weighs over 40 pounds and is 3 feet tall and is said to be able to backflip. Unitree's G1 robot weighs about twice the size of the N2 and is about 4 feet tall. There are also many smaller humanoid robots participating in the competition, and there is even a robot that is driven forward by a fan. Data interpretation The half marathon attracted more than 12,000 spectators, but only 21 humanoid robots participated. Although the government pointed out that the participating robots conducted "intensive training" before the competition, not all robots completed the entire competition. Champion - Tiangong Ult developed by Beijing Humanoid Robot Innovation Center

Artificial intelligence, in its current form, isn't truly intelligent; it's adept at mimicking and refining existing data. We're not creating artificial intelligence, but rather artificial inference—machines that process information, while humans su

A report found that an updated interface was hidden in the code for Google Photos Android version 7.26, and each time you view a photo, a row of newly detected face thumbnails are displayed at the bottom of the screen. The new facial thumbnails are missing name tags, so I suspect you need to click on them individually to see more information about each detected person. For now, this feature provides no information other than those people that Google Photos has found in your images. This feature is not available yet, so we don't know how Google will use it accurately. Google can use thumbnails to speed up finding more photos of selected people, or may be used for other purposes, such as selecting the individual to edit. Let's wait and see. As for now

Reinforcement finetuning has shaken up AI development by teaching models to adjust based on human feedback. It blends supervised learning foundations with reward-based updates to make them safer, more accurate, and genuinely help

Scientists have extensively studied human and simpler neural networks (like those in C. elegans) to understand their functionality. However, a crucial question arises: how do we adapt our own neural networks to work effectively alongside novel AI s


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function

SublimeText3 Chinese version
Chinese version, very easy to use

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software
