search
HomeTechnology peripheralsAIThe multimodal model evaluation framework lmms-eval is released! Comprehensive coverage, low cost, zero pollution

The multimodal model evaluation framework lmms-eval is released! Comprehensive coverage, low cost, zero pollution

The AIxiv column is a column where academic and technical content is published on this site. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com

With the deepening of research on large models, how to promote them to more modalities has become a hot topic in academia and industry. Recently released large closed-source models such as GPT-4o and Claude 3.5 already have strong image understanding capabilities, and open-source field models such as LLaVA-NeXT, MiniCPM, and InternVL have also shown performance that is getting closer to closed-source.

In this era of "80,000 kilograms per mu" and "one SoTA every 10 days", a multi-modal evaluation framework that is easy to use, has transparent standards and is reproducible has become increasingly important, and this is not easy.

In order to solve the above problems, researchers from Nanyang Technological University's LMMs-Lab jointly open sourced LMMs-Eval, which is an evaluation framework specially designed for multi-modal large-scale models and provides evaluation of multi-modal models (LMMs). A one-stop, efficient solution.

The multimodal model evaluation framework lmms-eval is released! Comprehensive coverage, low cost, zero pollution

  • Code repository: https://github.com/EvolvingLMMs-Lab/lmms-eval

  • Official homepage: https://lmms-lab.github.io/

  • Paper address : https://arxiv.org/abs/2407.12772

  • List address: https://huggingface.co/spaces/lmms-lab/LiveBench

Since its release in March 2024, LMMs-Eval The framework has received collaborative contributions from the open source community, companies, and universities. It has now received 1.1K Stars on Github, with more than 30+ contributors, including a total of more than 80 data sets and more than 10 models, and it continues to increase.

The multimodal model evaluation framework lmms-eval is released! Comprehensive coverage, low cost, zero pollution

Standardized assessment framework

In order to provide a standardized assessment platform, LMMs-Eval includes the following features:

  1. Unified interface: LMMs-Eval is based on the text assessment framework lm-evaluation-harness It has been improved and expanded to facilitate users to add new multi-modal models and data sets by defining a unified interface for models, data sets and evaluation indicators.

  2. One-Click Launch: LMMs-Eval hosts over 80 (and growing) datasets on HuggingFace, carefully transformed from the original sources, including all variants, versions, and splits. Users do not need to make any preparations. With just one command, multiple data sets and models will be automatically downloaded and tested, and the results will be available in a few minutes.

  3. Transparent and reproducible: LMMs-Eval has a built-in unified logging tool. Each question answered by the model and whether it is correct or not will be recorded, ensuring reproducibility and transparency. It also facilitates comparison of the advantages and disadvantages of different models.

The vision of LMMs-Eval is that future multi-modal models no longer need to write their own data processing, inference and submission code. In today's environment where multi-modal test sets are highly concentrated, this approach is unrealistic, and the measured scores are difficult to directly compare with other models. By accessing LMMs-Eval, model trainers can focus more on improving and optimizing the model itself, rather than spending time on evaluation and alignment results.

The "Impossible Triangle" of Evaluation

The ultimate goal of LMMs-Eval is to find a 1. wide coverage 2. low cost 3. zero data leakage method to evaluate LMMs. However, even with LMMs-Eval, the author team found it difficult or even impossible to do all three at the same time.

As shown in the figure below, when they expanded the evaluation dataset to more than 50, it became very time-consuming to perform a comprehensive evaluation of these datasets. Furthermore, these benchmarks are also susceptible to contamination during training. To this end, LMMs-Eval proposed LMMs-Eval-Lite to take into account wide coverage and low cost. They also designed LiveBench to be low cost and with zero data leakage.

The multimodal model evaluation framework lmms-eval is released! Comprehensive coverage, low cost, zero pollution

LMMs-Eval-Lite: Wide coverage lightweight evaluation

The multimodal model evaluation framework lmms-eval is released! Comprehensive coverage, low cost, zero pollution

大規模なモデルを評価する場合、膨大な数のパラメータとテストタスクにより、評価タスクの時間とコストが大幅に増加することが多いため、誰もが小規模なモデルを使用することを選択することがよくあります。データ セットを使用するか、評価に特定のデータ セットを使用します。ただし、限定された評価では、モデルの機能の理解が不足することがよくあります。評価の多様性と評価コストの両方を考慮するために、LMMs-Eval は LMMs-Eval-Lite

# を立ち上げました。 🎜🎜 #The multimodal model evaluation framework lmms-eval is released! Comprehensive coverage, low cost, zero pollution
LMMs-Eval-Lite は、モデル開発中に便利で高速な信号を提供するための簡素化されたベンチマーク セットを構築し、今日のテストの肥大化の問題を回避することを目的としています。モデル間の絶対スコアと相対ランキングがフルセットと同様のままである既存のテストセットのサブセットを見つけることができれば、これらのデータセットをプルーニングしても安全であると考えることができます。

データセット内のデータの顕著な点を見つけるために、LMMs-EvalはまずCLIPおよびBGEモデルを使用してマルチモーダル評価データセットをベクトル埋め込みの形式に変換し、k-データ内の重要なポイントを見つけるための貪欲クラスタリング手法。テストでは、これらの小さなデータセットでも完全なセットと同様の評価機能が実証されました。

The multimodal model evaluation framework lmms-eval is released! Comprehensive coverage, low cost, zero pollution
次に、LMMs-Eval は同じ方法を使用して、より多くのデータセットをカバーする Lite バージョンを作成しました。これは、人々が開発コストを節約できるように設計されています。モデルのパフォーマンスを迅速に判断するためのコスト

The multimodal model evaluation framework lmms-eval is released! Comprehensive coverage, low cost, zero pollution

LiveBench: LMM 動的テスト

従来のベンチマークは、固定された質問と回答を使用した静的な評価に重点を置いています。マルチモーダル研究の進歩により、スコア比較ではオープンソース モデルが GPT-4V などの商用モデルよりも優れていることがよくありますが、実際のユーザー エクスペリエンスでは劣ります。動的なユーザー主導のチャットボット Arenas と WildVision は、モデルの評価にますます人気が高まっていますが、何千ものユーザー設定を収集する必要があり、評価には非常にコストがかかります。

LiveBench の中心的なアイデアは、汚染ゼロを達成し、コストを低く抑えるために、継続的に更新されるデータセットでモデルのパフォーマンスを評価することです。著者チームは Web から評価データを収集し、ニュースやコミュニティ フォーラムなどの Web サイトから最新のグローバル情報を自動的に収集するパイプラインを構築しました。情報の適時性と信頼性を確保するために、著者チームは CNN、BBC、日本の朝日新聞、中国の新華社通信を含む 60 以上の報道機関や Reddit などのフォーラムから情報源を選択しました。具体的な手順は次のとおりです。

  1. ホームページのスクリーンショットをキャプチャし、広告やニュース以外の要素を削除します。

  2. GPT4-V、Claude-3-Opus、Gemini-1.5-Pro など、現在利用可能な最も強力なマルチモーダル モデルを使用して質問と回答のセットを設計します。正確さと関連性を確保するために、質問は別のモデルによってレビューおよび修正されます。

  3. 最終的な Q&A セットは手動でレビューされ、毎月約 500 個の質問が収集され、100 ~ 300 個が最終的なライブベンチ質問セットとして保持されます。

  4. LLaVA-Wilder および Vibe-Eval の採点基準を使用します -- 採点モデルは提供された標準回答に基づいて得点し、得点範囲は [1, 10] です。 ]。デフォルトのスコアリング モデルは GPT-4o で、代替として Claude-3-Opus および Gemini 1.5 Pro も含まれています。最終的に報告される結果は、0 ~ 100 の範囲の精度メトリクスに変換されたスコアに基づきます。

  5. 将来的には、動的に更新されるリストでマルチモーダル モデルを表示することもできます。毎月動的に更新される最新の評価データと最新の評価結果を一覧で表示します。

The above is the detailed content of The multimodal model evaluation framework lmms-eval is released! Comprehensive coverage, low cost, zero pollution. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Let's Dance: Structured Movement To Fine-Tune Our Human Neural NetsLet's Dance: Structured Movement To Fine-Tune Our Human Neural NetsApr 27, 2025 am 11:09 AM

Scientists have extensively studied human and simpler neural networks (like those in C. elegans) to understand their functionality. However, a crucial question arises: how do we adapt our own neural networks to work effectively alongside novel AI s

New Google Leak Reveals Subscription Changes For Gemini AINew Google Leak Reveals Subscription Changes For Gemini AIApr 27, 2025 am 11:08 AM

Google's Gemini Advanced: New Subscription Tiers on the Horizon Currently, accessing Gemini Advanced requires a $19.99/month Google One AI Premium plan. However, an Android Authority report hints at upcoming changes. Code within the latest Google P

How Data Analytics Acceleration Is Solving AI's Hidden BottleneckHow Data Analytics Acceleration Is Solving AI's Hidden BottleneckApr 27, 2025 am 11:07 AM

Despite the hype surrounding advanced AI capabilities, a significant challenge lurks within enterprise AI deployments: data processing bottlenecks. While CEOs celebrate AI advancements, engineers grapple with slow query times, overloaded pipelines, a

MarkItDown MCP Can Convert Any Document into Markdowns!MarkItDown MCP Can Convert Any Document into Markdowns!Apr 27, 2025 am 09:47 AM

Handling documents is no longer just about opening files in your AI projects, it’s about transforming chaos into clarity. Docs such as PDFs, PowerPoints, and Word flood our workflows in every shape and size. Retrieving structured

How to Use Google ADK for Building Agents? - Analytics VidhyaHow to Use Google ADK for Building Agents? - Analytics VidhyaApr 27, 2025 am 09:42 AM

Harness the power of Google's Agent Development Kit (ADK) to create intelligent agents with real-world capabilities! This tutorial guides you through building conversational agents using ADK, supporting various language models like Gemini and GPT. W

Use of SLM over LLM for Effective Problem Solving - Analytics VidhyaUse of SLM over LLM for Effective Problem Solving - Analytics VidhyaApr 27, 2025 am 09:27 AM

summary: Small Language Model (SLM) is designed for efficiency. They are better than the Large Language Model (LLM) in resource-deficient, real-time and privacy-sensitive environments. Best for focus-based tasks, especially where domain specificity, controllability, and interpretability are more important than general knowledge or creativity. SLMs are not a replacement for LLMs, but they are ideal when precision, speed and cost-effectiveness are critical. Technology helps us achieve more with fewer resources. It has always been a promoter, not a driver. From the steam engine era to the Internet bubble era, the power of technology lies in the extent to which it helps us solve problems. Artificial intelligence (AI) and more recently generative AI are no exception

How to Use Google Gemini Models for Computer Vision Tasks? - Analytics VidhyaHow to Use Google Gemini Models for Computer Vision Tasks? - Analytics VidhyaApr 27, 2025 am 09:26 AM

Harness the Power of Google Gemini for Computer Vision: A Comprehensive Guide Google Gemini, a leading AI chatbot, extends its capabilities beyond conversation to encompass powerful computer vision functionalities. This guide details how to utilize

Gemini 2.0 Flash vs o4-mini: Can Google Do Better Than OpenAI?Gemini 2.0 Flash vs o4-mini: Can Google Do Better Than OpenAI?Apr 27, 2025 am 09:20 AM

The AI landscape of 2025 is electrifying with the arrival of Google's Gemini 2.0 Flash and OpenAI's o4-mini. These cutting-edge models, launched weeks apart, boast comparable advanced features and impressive benchmark scores. This in-depth compariso

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software