Significantly surpassing SFT, the secret behind o1/DeepSeek-R1 can also be used in multimodal large models-Hardware Review-php.cn

Home

Hardware Tutorial

Hardware Review

Significantly surpassing SFT, the secret behind o1/DeepSeek-R1 can also be used in multimodal large models

Barbara Streisand

Mar 12, 2025 pm 01:03 PM

gitaiModelbehindmodalDeepSeeko1sft显著

Researchers from Shanghai Jiaotong University, Shanghai AI Lab and the Chinese University of Hong Kong have launched the Visual-RFT (Visual Enhancement Fine Tuning) open source project, which requires only a small amount of data to significantly improve the performance of visual language mockups (LVLM). Visual-RFT cleverly combines DeepSeek-R1's rule-based reinforcement learning approach with OpenAI's reinforcement fine-tuning (RFT) paradigm, successfully extending this approach from the text field to the visual field.

Significantly surpassing SFT, the secret behind o1/DeepSeek-R1 can also be used in multimodal large models

By designing corresponding rule rewards for tasks such as visual subcategorization and object detection, Visual-RFT overcomes the limitations of the DeepSeek-R1 method being limited to text, mathematical reasoning and other fields, providing a new way for LVLM training.

Significantly surpassing SFT, the secret behind o1/DeepSeek-R1 can also be used in multimodal large models

Advantages of Visual-RFT:

Compared with traditional visual instruction fine-tuning (SFT) methods, Visual-RFT has the following significant advantages:

Less sample learning ability: only 10 to 1000 pieces of data can be used to achieve effective fine-tuning.
Stronger generalization: In scenarios with limited data, performance is better than SFT.

The researchers verified Visual-RFT on multiple visual perception tasks (detection, classification, location, etc.), and the results showed that Visual-RFT achieved significant performance improvements and easily achieved capability transfer even under the settings of open vocabulary and small sample learning.

Significantly surpassing SFT, the secret behind o1/DeepSeek-R1 can also be used in multimodal large models

The researchers designed corresponding verifiable rewards for different tasks: IoU-based rewards are used for detection and positioning tasks, and classification correctness-based rewards are used for classification tasks.

Significantly surpassing SFT, the secret behind o1/DeepSeek-R1 can also be used in multimodal large models

In the inference positioning task, Visual-RFT demonstrates strong visual reasoning capabilities, such as accurately identifying waterproof glasses that athletes need to wear in pictures.

Significantly surpassing SFT, the secret behind o1/DeepSeek-R1 can also be used in multimodal large models

Experimental results:

Experiments based on the QWen2-VL 2B/7B model show that Visual-RFT is superior to SFT in open object detection, small sample detection, fine-grained classification and inference positioning tasks. Even if you detect a specific anime character (such as Slime), Visual-RFT can be achieved with just a small amount of data.

Significantly surpassing SFT, the secret behind o1/DeepSeek-R1 can also be used in multimodal large models

Open source information:

The Visual-RFT project is open source and contains training, evaluation code and data.

Project address: https://www.php.cn/link/ec56522bc9c2e15be17d11962eeec453

Significantly surpassing SFT, the secret behind o1/DeepSeek-R1 can also be used in multimodal large models

The above is the detailed content of Significantly surpassing SFT, the secret behind o1/DeepSeek-R1 can also be used in multimodal large models. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Chat Commands and How to Use Them

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),