search
HomeTechnology peripheralsAIAfter selecting 200 pieces of data, MiniGPT-4 was surpassed by matching the same model.

GPT-4 has demonstrated its extraordinary ability to generate detailed and accurate image descriptions, marking the arrival of a new era of language and visual processing.

Therefore, multimodal large language models (MLLM) similar to GPT-4 have recently emerged and become a hot emerging research field. The core of its research is to combine powerful The LLM is used as a cognitive framework to perform multimodal tasks. The unexpected and outstanding performance of MLLM not only surpasses traditional methods, but also makes it one of the potential ways to achieve general artificial intelligence.

To create useful MLLMs, one needs to use large-scale paired image-text data and visual-linguistic fine-tuning data to train frozen LLMs (such as LLaMA and Vicuna ) and visual representations (such as CLIP and BLIP-2).

#The training of MLLM is usually divided into two stages: pre-training stage and fine-tuning stage. The purpose of pre-training is to allow the MLLM to acquire a large amount of knowledge, while fine-tuning is to teach the model to better understand human intentions and generate accurate responses.

In order to enhance MLLM's ability to understand visual-language and follow instructions, a powerful fine-tuning technology called instruction tuning has recently emerged. This technology helps align models with human preferences so that the model produces human-desired results under a variety of different instructions. In terms of developing instruction fine-tuning technology, a quite constructive direction is to introduce image annotation, visual question answering (VQA) and visual reasoning data sets in the fine-tuning stage. Previous techniques such as InstructBLIP and Otter have used a series of visual-linguistic data sets to fine-tune visual instructions, and have also achieved promising results.

However, it has been observed that commonly used multi-modal instruction fine-tuning datasets contain a large number of low-quality instances, where responses are incorrect or irrelevant. Such data is misleading and can negatively impact model performance.

This question prompted researchers to explore the possibility of achieving robust performance using small amounts of high-quality follow-instruction data.

Some recent studies have obtained encouraging results, indicating that this direction has potential. For example, Zhou et al. proposed LIMA, which is a language model fine-tuned using high-quality data carefully selected by human experts. This study shows that large language models can achieve satisfactory results even with limited amounts of high-quality follow-instruction data. So, the researchers concluded: Less is more when it comes to alignment. However, there has not been a clear guideline on how to select suitable high-quality datasets for fine-tuning multi-modal language models.

A research team from Qingyuan Research Institute of Shanghai Jiao Tong University and Lehigh University has filled this gap and proposed a robust and effective data selector. This data selector automatically identifies and filters low-quality visual-verbal data, ensuring that the most relevant and informative samples are used for model training.

After selecting 200 pieces of data, MiniGPT-4 was surpassed by matching the same model.
##Paper address: https://arxiv.org/abs/2308.12067

The researchers said that the focus of this study is to explore the effectiveness of a small but high-quality instruction fine-tuning data in fine-tuning multi-modal large-scale language models. In addition to this, this paper introduces several new metrics specifically designed to evaluate the quality of multimodal instruction data. After performing spectral clustering on the image, the data selector calculates a weighted score that combines the CLIP score, GPT score, bonus score, and answer length for each piece of visual-verbal data.

By using this selector on 3400 raw data used to fine-tune MiniGPT-4, the researchers found that most of the data had low quality issues. Using this data selector, the researcher got a much smaller subset of the curated data - just 200 data, only 6% of the original data set. Then they used the same training configuration as MiniGPT-4 and fine-tuned it to get a new model: InstructionGPT-4.

#The researchers said this is an exciting finding because it shows that the quality of data is more important than the quantity in fine-tuning visual-verbal instructions. Furthermore, this change with greater emphasis on data quality provides a new and more effective paradigm that can improve MLLM fine-tuning.

The researchers conducted rigorous experiments and the experimental evaluation of the fine-tuned MLLM focused on seven diverse and complex open-domain multi-modal datasets, including Flick- 30k, ScienceQA, VSR, etc. They compared the inference performance of models fine-tuned using different data set selection methods (using data selectors, randomly sampling the data set, using the complete data set) on different multi-modal tasks. The results showed the performance of InstructionGPT-4 superiority.

In addition, it should be noted: The evaluator used by the researcher for evaluation is GPT-4. Specifically, the researchers used prompt to turn GPT-4 into an evaluator, which can use the test set in LLaVA-Bench to compare the response results of InstructionGPT-4 and the original MiniGPT-4.

It was found that although the fine-tuning data used by InstructionGPT-4 is only a little bit 6% compared to the original instruction compliance data used by MiniGPT-4, the latter The response was the same or better 73% of the time.

The main contributions of this paper include:

  • By selecting 200 (approx. 6%) high-quality instruction follow data to train InstructionGPT-4, researchers show that it is possible to use less instruction data for multi-modal large language models to achieve better alignment.
  • This paper proposes a data selector that uses a simple and interpretable principle to select high-quality multi-modal instruction compliance data for fine-tuning. This approach strives to achieve validity and portability in the evaluation and adjustment of subsets of data.
  • Researchers have shown through experiments that this simple technology can handle different tasks well. Compared with the original MiniGPT-4, InstructionGPT-4, which is fine-tuned using only 6% filtered data, achieves better performance on a variety of tasks.

method

This The goal of the research is to propose a simple and portable data selector that can automatically select a subset from the original fine-tuned dataset. To this end, the researchers defined a selection principle that focuses on the diversity and quality of multimodal data sets. A brief introduction will be given below.

Selection Principle

##In order to effectively train MLLM, Selecting useful multimodal instruction data is critical. In order to select the optimal instruction data, researchers proposed two key principles: diversity and quality. For diversity, the approach taken by researchers is to cluster image embeddings to separate the data into different groups. To assess quality, the researchers adopted some key metrics for efficient evaluation of multimodal data.

Data Selector

Given a visual - Linguistic instruction data set and a pre-trained MLLM (such as MiniGPT-4 and LLaVA), the ultimate goal of the data selector is to identify a subset for fine-tuning and make this subset bring improvements to the pre-trained MLLM.

To select this subset and ensure its diversity, the researchers first used a clustering algorithm to divide the original data set into multiple categories.

In order to ensure the quality of the selected multi-modal instruction data, the researchers developed a set of indicators for evaluation, as shown in Table 1 below.

After selecting 200 pieces of data, MiniGPT-4 was surpassed by matching the same model.
Table 2 shows the weight of each different score when calculating the final score.

After selecting 200 pieces of data, MiniGPT-4 was surpassed by matching the same model.
Algorithm 1 shows the entire workflow of the data selector.
After selecting 200 pieces of data, MiniGPT-4 was surpassed by matching the same model.
Experiment

Data used in experimental evaluation The set is shown in Table 3 below.
After selecting 200 pieces of data, MiniGPT-4 was surpassed by matching the same model.
Benchmark Score

Table 4 compares MiniGPT- 4 Performance of the baseline model, MiniGPT-4 fine-tuned using randomly sampled data, and InstructionGPT-4 fine-tuned using data selectors.

After selecting 200 pieces of data, MiniGPT-4 was surpassed by matching the same model.
It can be observed that the average performance of InstructionGPT-4 is the best. Specifically, InstructionGPT-4 outperforms the baseline model by 2.12% on ScienceQA, and outperforms the baseline model by 2.49% and 4.19% on OKVQA and VCR-OC respectively.

Furthermore, InstructionGPT-4 outperforms models trained with random samples on all other tasks except VSR. By evaluating and comparing these models on a range of tasks, it is possible to discern their respective capabilities and determine the efficacy of newly proposed data selectors that effectively identify high-quality data.

# Such a comprehensive analysis shows that wise data selection can improve the zero-shot performance of the model on a variety of different tasks.

GPT-4 Assessment

LLM itself has inherent Positional bias, for this please refer to the article "Language Model is Quietly Lazy?" New research: If the context is too long, the model will skip the middle and not read it》. Therefore, the researchers took measures to solve this problem. Specifically, they used two orders of arranging responses to perform evaluation at the same time, that is, placing the responses generated by InstructionGPT-4 before or after the responses generated by MiniGPT-4. In order to develop clear evaluation criteria, they adopted the "Win-Tie-Lose" framework:

1) Win: InstructionGPT-4 in two In either case, win or win once and draw once;
2) Draw: InstructionGPT-4 and MiniGPT-4 Draw twice or win once and lose once;
3) Lose: InstructionGPT-4 Lose twice or lose once and draw once.

Figure 1 shows the results of this evaluation method.

After selecting 200 pieces of data, MiniGPT-4 was surpassed by matching the same model.
Out of 60 problems, InstructionGPT-4 won 29 games, lost 16 games, and tied in the remaining 15 games. This is enough to prove that InstructionGPT-4 is significantly better than MiniGPT-4 in terms of response quality.

Ablation study

Ablation is given in Table 5 The analysis results of the experiment, from which the importance of the clustering algorithm and various evaluation scores can be seen.

After selecting 200 pieces of data, MiniGPT-4 was surpassed by matching the same model.
Demo

In order to gain a deeper understanding of InstructionGPT-4’s ability to understand visual input and generate reasonable responses, the researchers also conducted a comparative evaluation of the image understanding and dialogue capabilities of InstructionGPT-4 and MiniGPT-4. The analysis is based on a striking example involving description and further understanding of the image, and the results are shown in Table 6 .

After selecting 200 pieces of data, MiniGPT-4 was surpassed by matching the same model.
InstructionGPT-4 is better at providing comprehensive image descriptions and identifying interesting aspects of images. Compared to MiniGPT-4, InstructionGPT-4 is more capable of recognizing text present in images. Here, InstructionGPT-4 is able to correctly point out that there is a phrase in the image: Monday, just Monday.

See the original paper for more details.

The above is the detailed content of After selecting 200 pieces of data, MiniGPT-4 was surpassed by matching the same model.. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:机器之心. If there is any infringement, please contact admin@php.cn delete
Laravel入门教程:从零开始学习最流行的PHP框架Laravel入门教程:从零开始学习最流行的PHP框架Aug 13, 2023 pm 01:21 PM

Laravel入门教程:从零开始学习最流行的PHP框架引言:Laravel是当前最流行的PHP框架之一,它易于上手、功能强大且拥有活跃的开发社区。本文将带您从零开始学习Laravel框架,并提供一些实例代码,帮助您更好地理解和掌握这个强大的工具。第一步:安装Laravel在开始之前,您需要在计算机上安装Laravel框架。最简单的方法是通过Composer进

VUE3入门实例:制作一个简单的图片裁剪器VUE3入门实例:制作一个简单的图片裁剪器Jun 15, 2023 pm 08:45 PM

Vue.js是一款流行的JavaScript前端框架,目前已经推出了最新的版本——Vue3,新版Vue在性能、体积以及开发体验上均有所提升,受到越来越多的开发者欢迎。本文将介绍如何使用Vue3制作一个简单的图片裁剪器。首先,我们需要创建一个Vue项目并安装所需的插件。可以使用VueCLI来创建项目,也可以手动搭建。这里我们以使用VueCLI的方式为例:#

从入门到精通:掌握go-zero框架从入门到精通:掌握go-zero框架Jun 23, 2023 am 11:37 AM

Go-zero是一款优秀的Go语言框架,它提供了一整套解决方案,包括RPC、缓存、定时任务等功能。事实上,使用go-zero建立一个高性能的服务非常简单,甚至可以在数小时内从入门到精通。本文旨在介绍使用go-zero框架构建高性能服务的过程,并帮助读者快速掌握该框架的核心概念。一、安装和配置在开始使用go-zero之前,我们需要安装它并配置一些必要的环境。1

快速入门:使用Go语言函数实现简单的数据可视化功能快速入门:使用Go语言函数实现简单的数据可视化功能Aug 02, 2023 pm 04:25 PM

快速入门:使用Go语言函数实现简单的数据可视化功能随着数据的快速增长和复杂性的提高,数据可视化成为了数据分析和数据表达的重要手段。在数据可视化中,我们需要使用合适的工具和技术来将数据转化为易读且易理解的图表或图形。Go语言作为一种高效且易于使用的编程语言,在数据科学领域也有着广泛的应用。本文将介绍如何使用Go语言函数来实现简单的数据可视化功能。我们将使用Go

如何快速入门Beego开发框架?如何快速入门Beego开发框架?Jun 22, 2023 am 09:15 AM

Beego是一个基于Go语言的开发框架,它提供了一套完整的Web开发工具链,包括路由、模板引擎、ORM等。如果你想快速入门Beego开发框架,以下是一些简单易懂的步骤和建议。第一步:安装Beego和Bee工具安装Beego和Bee工具是开始学习Beego的第一步。你可以在Beego官网上找到详细的安装步骤,也可以使用以下命令来安装:gogetgithub

PHP中的人脸识别入门指南PHP中的人脸识别入门指南Jun 11, 2023 am 09:16 AM

随着科技的不断发展,人脸识别技术也越来越得到了广泛的应用。而在Web开发领域中,PHP是一种被广泛采用的技术,因此PHP中的人脸识别技术也备受关注。本文将介绍PHP中的人脸识别入门指南,帮助初学者快速掌握这一领域。一、什么是人脸识别技术人脸识别技术是一种基于计算机视觉技术的生物特征识别技术,其主要应用领域包括安防、金融、电商等。人脸识别技术的核心就是对人脸进

Laravel 8:快速入门指南Laravel 8:快速入门指南Jun 20, 2023 am 09:37 AM

Laravel是一个流行的PHP框架,它提供了许多工具和功能,以使开发Web应用程序变得更加轻松和快速。Laravel8已经发布,它带来了许多新的功能和改进。在本文中,我们将学习如何快速入门Laravel8。安装Laravel8要安装Laravel8,您需要满足以下要求:PHP>=7.3MySQL>=5.6或MariaDB>=10.

PHP摄像头调用教程:快速入门指南PHP摄像头调用教程:快速入门指南Jul 29, 2023 pm 11:13 PM

PHP摄像头调用教程:快速入门指南引言:在当今的数字时代,摄像头成为了人们生活中不可或缺的设备之一。在Web开发中,如何通过PHP调用摄像头,实现视频流的显示和处理,成为了很多开发者关注的问题。本文将为大家介绍如何快速入门使用PHP来调用摄像头。一、环境准备要使用PHP调用摄像头,我们需要准备以下环境:PHP:确保已经安装了PHP,并且安装了相应的扩展库,如

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.