search
HomeTechnology peripheralsAIDAMO Academy's mPLUG-Owl debuts: a modular multi-modal large model, catching up with GPT-4 multi-modal capabilities

Pure text large models are in the ascendant, and multimodal large model work has begun to emerge in the multimodal field. GPT-4, the strongest on the surface, has the multimodal ability to read images, but it has not yet been open to the public for experience, so The Hu research community began to research and open source in this direction. Shortly after the advent of MiniGPT-4 and LLaVA, Alibaba DAMO Academy launched mPLUG-Owl, a large multi-modal model based on modular implementation.

mPLUG-Owl is the latest work of the mPLUG series of Alibaba Damo Academy. It continues the modular training idea of ​​the mPLUG series and upgrades the LLM into a large multi-modal model. In the mPLUG series of work, the previous E2E-VLP, mPLUG, and mPLUG-2 were accepted by ACL2021, EMNLP2022, and ICML2023 respectively. Among them, the mPLUG work topped the VQA list with superhuman results.

What I want to introduce today is mPLUG-Owl. This work not only demonstrates excellent multi-modal capabilities through a large number of cases, but also proposes a comprehensive test set for vision-related instruction understanding for the first time. OwlEval compared existing models through manual evaluation, including LLaVA, MiniGPT-4, BLIP-2 and system-based MM-REACT. The experimental results show that mPLUG-Owl exhibits better multi-modal capabilities, especially in multi-modal Outstanding performance in aspects such as the ability to understand dynamic instructions, multi-turn dialogue ability, and knowledge reasoning ability

DAMO Academys mPLUG-Owl debuts: a modular multi-modal large model, catching up with GPT-4 multi-modal capabilities

##Paper link: https://arxiv.org/abs/2304.14178

Code link: https://github.com/X-PLUG /mPLUG-Owl

ModelScope experience address:

https://modelscope. cn/studios/damo/mPLUG-Owl/summary

HuggingFace experience address:

https://huggingface.co/spaces/MAGAer13/mPLUG-Owl

Multi-modal capability demonstration

We combine mPLUG-Owl with existing Compare the work to feel the multi-modal effect of mPLUG-Owl. It is worth mentioning that the test samples evaluated in this work are basically from existing work, avoiding the cherry pick problem.

Figure 6 below shows mPLUG-Owl’s strong multi-round dialogue capabilities.

DAMO Academys mPLUG-Owl debuts: a modular multi-modal large model, catching up with GPT-4 multi-modal capabilities

It can be found from Figure 7 that mPLUG-Owl has strong reasoning capabilities.

DAMO Academys mPLUG-Owl debuts: a modular multi-modal large model, catching up with GPT-4 multi-modal capabilities

Figure 9 shows some examples of joke explanations.

DAMO Academys mPLUG-Owl debuts: a modular multi-modal large model, catching up with GPT-4 multi-modal capabilities

##In this work, in addition to evaluation and comparison, the research team also observed that mPLUG-Owl initially showed some interest. Unexpected capabilities, such as multi-image association, multi-language, text recognition and document understanding.

As shown in Figure 10, although multi-graph correlation data was not trained during the training phase, mPLUG-Owl has demonstrated certain multi-graph correlation capabilities.

DAMO Academys mPLUG-Owl debuts: a modular multi-modal large model, catching up with GPT-4 multi-modal capabilities

##As shown in Figure 11, although mPLUG-Owl only uses English data in the training phase, it shows Developed interesting multilingual capabilities. This may be because the language model in mPLUG-Owl uses LLaMA, resulting in this phenomenon.

DAMO Academys mPLUG-Owl debuts: a modular multi-modal large model, catching up with GPT-4 multi-modal capabilities

Although mPLUG-Owl was not trained on annotated document data, it still demonstrated certain text recognition and document understanding. Capability, the test results are shown in Figure 12.

DAMO Academys mPLUG-Owl debuts: a modular multi-modal large model, catching up with GPT-4 multi-modal capabilities

Method introduction

The overall architecture of mPLUG-Owl proposed in this work is shown in Figure 2 Show.

DAMO Academys mPLUG-Owl debuts: a modular multi-modal large model, catching up with GPT-4 multi-modal capabilities

## Model structure: It consists of the visual basic module

DAMO Academys mPLUG-Owl debuts: a modular multi-modal large model, catching up with GPT-4 multi-modal capabilities

(open source ViT-L), visual abstraction module

DAMO Academys mPLUG-Owl debuts: a modular multi-modal large model, catching up with GPT-4 multi-modal capabilities

and pre-trained language model

DAMO Academys mPLUG-Owl debuts: a modular multi-modal large model, catching up with GPT-4 multi-modal capabilities

(LLaMA-7B) Composition. The visual abstraction module summarizes longer, fine-grained image features into a small number of learnable Tokens, thereby achieving efficient modeling of visual information. The generated visual tokens are input into the language model together with the text query to generate corresponding responses.

Model training: adopt a two-stage training method

The first stage: the main purpose is also to first Learn the opposition between visual and verbal modalities. Different from previous work, mPLUG-Owl proposes that freezing the basic visual module will limit the model's ability to associate visual knowledge and textual knowledge. Therefore, mPLUG-Owl only freezes the parameters of LLM in the first stage, and uses LAION-400M, COYO-700M, CC and MSCOCO to train the visual basic module and visual summary module.

The second stage: Continuing the discovery that mixed training of different modalities in mPLUG and mPLUG-2 are beneficial to each other, Owl also uses pure training in the second stage of instruction fine-tuning training. Textual command data (52k from Alpaca 90k from Vicuna 50k from Baize) and multimodal command data (150k from LLaVA). Through detailed ablation experiments, the author verified the benefits brought by the introduction of pure text instruction fine-tuning in aspects such as instruction understanding. In the second stage, the parameters of the visual basic module, visual summary module and the original LLM are frozen. Referring to LoRA, only an adapter structure with a small number of parameters is introduced in the LLM for instruction fine-tuning.

Experimental results

SOTA comparison

In order to compare the multi-modal capabilities of different models, This work builds a multi-modal instruction evaluation set OwlEval. Since there are currently no suitable automated indicators, refer to Self-Intruct for manual evaluation of the model's responses. The scoring rules are: A="Correct and satisfactory"; B="Some imperfections, but acceptable"; C ="Understood the instructions but there were obvious errors in the responses"; D="Completely irrelevant or incorrect responses".

The comparison results are shown in Figure 3 below. Experiments prove that Owl is superior to existing OpenFlamingo, BLIP-2, LLaVA, and MiniGPT-4 in visual-related command response tasks.

DAMO Academys mPLUG-Owl debuts: a modular multi-modal large model, catching up with GPT-4 multi-modal capabilities

Multi-dimensional ability comparison

Multimodal command response tasks involve a variety of abilities, such as command understanding, visual understanding, text understanding on pictures, and reasoning. In order to explore the level of different capabilities of the model in a fine-grained manner, this article further defines 6 main capabilities in multi-modal scenarios, and manually annotates each OwlEval test instruction with relevant capability requirements and the model's responses reflected in them. What abilities have been acquired.

The results are shown in Table 6 below. In this part of the experiment, the author not only conducted Owl ablation experiments to verify the effectiveness of the training strategy and multi-modal instruction fine-tuning data, but also The best-performing baseline in the previous experiment—MiniGPT4—was compared, and the results showed that Owl was superior to MiniGPT4 in all aspects of capabilities.

DAMO Academys mPLUG-Owl debuts: a modular multi-modal large model, catching up with GPT-4 multi-modal capabilities

The above is the detailed content of DAMO Academy's mPLUG-Owl debuts: a modular multi-modal large model, catching up with GPT-4 multi-modal capabilities. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
How to Build Your Personal AI Assistant with Huggingface SmolLMHow to Build Your Personal AI Assistant with Huggingface SmolLMApr 18, 2025 am 11:52 AM

Harness the Power of On-Device AI: Building a Personal Chatbot CLI In the recent past, the concept of a personal AI assistant seemed like science fiction. Imagine Alex, a tech enthusiast, dreaming of a smart, local AI companion—one that doesn't rely

AI For Mental Health Gets Attentively Analyzed Via Exciting New Initiative At Stanford UniversityAI For Mental Health Gets Attentively Analyzed Via Exciting New Initiative At Stanford UniversityApr 18, 2025 am 11:49 AM

Their inaugural launch of AI4MH took place on April 15, 2025, and luminary Dr. Tom Insel, M.D., famed psychiatrist and neuroscientist, served as the kick-off speaker. Dr. Insel is renowned for his outstanding work in mental health research and techno

The 2025 WNBA Draft Class Enters A League Growing And Fighting Online HarassmentThe 2025 WNBA Draft Class Enters A League Growing And Fighting Online HarassmentApr 18, 2025 am 11:44 AM

"We want to ensure that the WNBA remains a space where everyone, players, fans and corporate partners, feel safe, valued and empowered," Engelbert stated, addressing what has become one of women's sports' most damaging challenges. The anno

Comprehensive Guide to Python Built-in Data Structures - Analytics VidhyaComprehensive Guide to Python Built-in Data Structures - Analytics VidhyaApr 18, 2025 am 11:43 AM

Introduction Python excels as a programming language, particularly in data science and generative AI. Efficient data manipulation (storage, management, and access) is crucial when dealing with large datasets. We've previously covered numbers and st

First Impressions From OpenAI's New Models Compared To AlternativesFirst Impressions From OpenAI's New Models Compared To AlternativesApr 18, 2025 am 11:41 AM

Before diving in, an important caveat: AI performance is non-deterministic and highly use-case specific. In simpler terms, Your Mileage May Vary. Don't take this (or any other) article as the final word—instead, test these models on your own scenario

AI Portfolio | How to Build a Portfolio for an AI Career?AI Portfolio | How to Build a Portfolio for an AI Career?Apr 18, 2025 am 11:40 AM

Building a Standout AI/ML Portfolio: A Guide for Beginners and Professionals Creating a compelling portfolio is crucial for securing roles in artificial intelligence (AI) and machine learning (ML). This guide provides advice for building a portfolio

What Agentic AI Could Mean For Security OperationsWhat Agentic AI Could Mean For Security OperationsApr 18, 2025 am 11:36 AM

The result? Burnout, inefficiency, and a widening gap between detection and action. None of this should come as a shock to anyone who works in cybersecurity. The promise of agentic AI has emerged as a potential turning point, though. This new class

Google Versus OpenAI: The AI Fight For StudentsGoogle Versus OpenAI: The AI Fight For StudentsApr 18, 2025 am 11:31 AM

Immediate Impact versus Long-Term Partnership? Two weeks ago OpenAI stepped forward with a powerful short-term offer, granting U.S. and Canadian college students free access to ChatGPT Plus through the end of May 2025. This tool includes GPT‑4o, an a

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use