LLaVA-1.6, which catches up with Gemini Pro and improves reasoning and OCR capabilities, is too powerful-AI-php.cn

Home

Technology peripherals

LLaVA-1.6, which catches up with Gemini Pro and improves reasoning and OCR capabilities, is too powerful

PHPz

Feb 01, 2024 pm 04:51 PM

Modeltrain

In April last year, researchers from the University of Wisconsin-Madison, Microsoft Research, and Columbia University jointly released LLaVA (Large Language and Vision Assistant). Although LLaVA is only trained with a small multi-modal instruction data set, it shows very similar inference results to GPT-4 on some samples. Then in October, they launched LLaVA-1.5, which refreshed the SOTA in 11 benchmarks with simple modifications to the original LLaVA. The results of this upgrade are very exciting, bringing new breakthroughs to the field of multi-modal AI assistants.

The research team announced the launch of LLaVA-1.6 version, which has made major performance improvements in reasoning, OCR and world knowledge. This version of LLaVA-1.6 even outperforms the Gemini Pro in multiple benchmarks.

赶超Gemini Pro，提升推理、OCR能力的LLaVA-1.6太强了

##demo address: https://llava.hliu.cc/
Project address: https://github.com/haotian-liu/LLaVA

Compared with LLaVA-1.5, LLaVA-1.6 has the following improvements:

Increases the input image resolution by 4 times, supports three aspect ratios, up to Up to 672x672, 336x1344, 1344x336 resolution. This enables LLaVA-1.6 to capture more visual details.
LLaVA-1.6 gains better visual reasoning and OCR capabilities through improved visual instructions to adjust data mixing.
Better visual dialogue, more scenarios, covering different applications. LLaVA-1.6 has mastered more world knowledge and has better logical reasoning ability.
Use SGLang for efficient deployment and inference.

赶超Gemini Pro，提升推理、OCR能力的LLaVA-1.6太强了

## Source: https://twitter.com/imhaotian/status/1752621754273472927

LLaVA-1.6 is fine-tuned and optimized based on LLaVA-1.5. It retains the simple design and efficient data processing capabilities of LLaVA-1.5, and continues to use less than 1M visual instruction tuning samples. By using 32 A100 graphics cards, the largest 34B model was trained in approximately 1 day. In addition, LLaVA-1.6 utilizes 1.3 million data samples, and its calculation/training data cost is only 100-1000 times that of other methods. These improvements make LLaVA-1.6 a more efficient and cost-effective version.

赶超Gemini Pro，提升推理、OCR能力的LLaVA-1.6太强了

LLaVA-1.6 achieves SOTA performance compared to open source LMMs like CogVLM or Yi-VL. Compared to commercial products, LLaVA-1.6 is comparable to Gemini Pro and better than Qwen-VL-Plus in selected benchmarks.

赶超Gemini Pro，提升推理、OCR能力的LLaVA-1.6太强了

It is worth mentioning that LLaVA-1.6 demonstrates strong zero-shot Chinese capabilities. It achieves SOTA performance on the multi-modal benchmark MMBench-CN.

Method Improvement

Dynamic High Resolution

Research Team The LLaVA-1.6 model was designed at high resolution to maintain its data efficiency. When provided with high-resolution images and detail-preserving representations, the model's ability to perceive complex details in images improves significantly. It reduces model hallucination when faced with low-resolution images, i.e. guessing the imagined visual content.

赶超Gemini Pro，提升推理、OCR能力的LLaVA-1.6太强了

Data Mixing

High quality user command data. The study’s definition of high-quality visual instruction following data depends on two main criteria: first, the diversity of task instructions, ensuring that the broad range of user intentions that may be encountered in real-life scenarios is adequately represented, especially It is during the model deployment phase. Second, prioritization of responses is critical, aiming to solicit favorable user feedback.

Therefore, the study considered two data sources:

Existing GPT-V data (LAION-GPT-V and ShareGPT -4V);

In order to further promote better visual dialogue in more scenarios, the research team collected a small 15K visual instruction tuning data set covering different applications, carefully filtering samples that may have privacy issues or may be harmful, And use GPT-4V to generate the response.

Multimodal document/chart data. (1) Remove TextCap from the training data because the research team realized that TextCap uses the same training image set as TextVQA. This allowed the research team to better understand the model's zero-shot OCR capabilities when evaluating TextVQA. In order to maintain and further improve the OCR capabilities of the model, this study replaced TextCap with DocVQA and SynDog-EN. (2) With Qwen-VL-7B-Chat, this study further adds ChartQA, DVQA, and AI2D for better understanding of plots and charts.

The research team also stated that in addition to Vicuna-1.5 (7B and 13B), it is also considering using more LLM solutions, including Mistral-7B and Nous-Hermes-2-Yi-34B, to Enable LLaVA to support a wider range of users and more scenarios.

赶超Gemini Pro，提升推理、OCR能力的LLaVA-1.6太强了

The above is the detailed content of LLaVA-1.6, which catches up with Gemini Pro and improves reasoning and OCR capabilities, is too powerful. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

Let's Dance: Structured Movement To Fine-Tune Our Human Neural NetsApr 27, 2025 am 11:09 AM

Scientists have extensively studied human and simpler neural networks (like those in C. elegans) to understand their functionality. However, a crucial question arises: how do we adapt our own neural networks to work effectively alongside novel AI s

New Google Leak Reveals Subscription Changes For Gemini AIApr 27, 2025 am 11:08 AM

Google's Gemini Advanced: New Subscription Tiers on the Horizon Currently, accessing Gemini Advanced requires a $19.99/month Google One AI Premium plan. However, an Android Authority report hints at upcoming changes. Code within the latest Google P

How Data Analytics Acceleration Is Solving AI's Hidden BottleneckApr 27, 2025 am 11:07 AM

Despite the hype surrounding advanced AI capabilities, a significant challenge lurks within enterprise AI deployments: data processing bottlenecks. While CEOs celebrate AI advancements, engineers grapple with slow query times, overloaded pipelines, a

MarkItDown MCP Can Convert Any Document into Markdowns!Apr 27, 2025 am 09:47 AM

Handling documents is no longer just about opening files in your AI projects, it’s about transforming chaos into clarity. Docs such as PDFs, PowerPoints, and Word flood our workflows in every shape and size. Retrieving structured

How to Use Google ADK for Building Agents? - Analytics VidhyaApr 27, 2025 am 09:42 AM

Harness the power of Google's Agent Development Kit (ADK) to create intelligent agents with real-world capabilities! This tutorial guides you through building conversational agents using ADK, supporting various language models like Gemini and GPT. W

Use of SLM over LLM for Effective Problem Solving - Analytics VidhyaApr 27, 2025 am 09:27 AM

summary: Small Language Model (SLM) is designed for efficiency. They are better than the Large Language Model (LLM) in resource-deficient, real-time and privacy-sensitive environments. Best for focus-based tasks, especially where domain specificity, controllability, and interpretability are more important than general knowledge or creativity. SLMs are not a replacement for LLMs, but they are ideal when precision, speed and cost-effectiveness are critical. Technology helps us achieve more with fewer resources. It has always been a promoter, not a driver. From the steam engine era to the Internet bubble era, the power of technology lies in the extent to which it helps us solve problems. Artificial intelligence (AI) and more recently generative AI are no exception

How to Use Google Gemini Models for Computer Vision Tasks? - Analytics VidhyaApr 27, 2025 am 09:26 AM

Harness the Power of Google Gemini for Computer Vision: A Comprehensive Guide Google Gemini, a leading AI chatbot, extends its capabilities beyond conversation to encompass powerful computer vision functionalities. This guide details how to utilize

Gemini 2.0 Flash vs o4-mini: Can Google Do Better Than OpenAI?Apr 27, 2025 am 09:20 AM

The AI landscape of 2025 is electrifying with the arrival of Google's Gemini 2.0 Flash and OpenAI's o4-mini. These cutting-edge models, launched weeks apart, boast comparable advanced features and impressive benchmark scores. This in-depth compariso

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

1 months agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks agoByDDD

Where to find the Crane Control Keycard in Atomfall

1 months agoByDDD

How to fix KB5055523 fails to install in Windows 11?

2 weeks agoByDDD

InZoi: How To Apply To School And University

3 weeks agoByDDD

Hot Tools

SublimeText3 English version

Recommended: Win version, supports code prompts!

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.