Home >Hardware Tutorial >Hardware Review >Significantly surpassing SFT, the secret behind o1/DeepSeek-R1 can also be used in multimodal large models

Significantly surpassing SFT, the secret behind o1/DeepSeek-R1 can also be used in multimodal large models

Barbara Streisand
Barbara StreisandOriginal
2025-03-12 13:03:01748browse

Researchers from Shanghai Jiaotong University, Shanghai AI Lab and the Chinese University of Hong Kong have launched the Visual-RFT (Visual Enhancement Fine Tuning) open source project, which requires only a small amount of data to significantly improve the performance of visual language mockups (LVLM). Visual-RFT cleverly combines DeepSeek-R1's rule-based reinforcement learning approach with OpenAI's reinforcement fine-tuning (RFT) paradigm, successfully extending this approach from the text field to the visual field.

Significantly surpassing SFT, the secret behind o1/DeepSeek-R1 can also be used in multimodal large models

By designing corresponding rule rewards for tasks such as visual subcategorization and object detection, Visual-RFT overcomes the limitations of the DeepSeek-R1 method being limited to text, mathematical reasoning and other fields, providing a new way for LVLM training.

Significantly surpassing SFT, the secret behind o1/DeepSeek-R1 can also be used in multimodal large models

Advantages of Visual-RFT:

Compared with traditional visual instruction fine-tuning (SFT) methods, Visual-RFT has the following significant advantages:

  • Less sample learning ability: only 10 to 1000 pieces of data can be used to achieve effective fine-tuning.
  • Stronger generalization: In scenarios with limited data, performance is better than SFT.

The researchers verified Visual-RFT on multiple visual perception tasks (detection, classification, location, etc.), and the results showed that Visual-RFT achieved significant performance improvements and easily achieved capability transfer even under the settings of open vocabulary and small sample learning.

Significantly surpassing SFT, the secret behind o1/DeepSeek-R1 can also be used in multimodal large models

The researchers designed corresponding verifiable rewards for different tasks: IoU-based rewards are used for detection and positioning tasks, and classification correctness-based rewards are used for classification tasks.

Significantly surpassing SFT, the secret behind o1/DeepSeek-R1 can also be used in multimodal large models

In the inference positioning task, Visual-RFT demonstrates strong visual reasoning capabilities, such as accurately identifying waterproof glasses that athletes need to wear in pictures.

Significantly surpassing SFT, the secret behind o1/DeepSeek-R1 can also be used in multimodal large models

Significantly surpassing SFT, the secret behind o1/DeepSeek-R1 can also be used in multimodal large models

Experimental results:

Experiments based on the QWen2-VL 2B/7B model show that Visual-RFT is superior to SFT in open object detection, small sample detection, fine-grained classification and inference positioning tasks. Even if you detect a specific anime character (such as Slime), Visual-RFT can be achieved with just a small amount of data.

Significantly surpassing SFT, the secret behind o1/DeepSeek-R1 can also be used in multimodal large models

Open source information:

The Visual-RFT project is open source and contains training, evaluation code and data.

Project address: https://www.php.cn/link/ec56522bc9c2e15be17d11962eeec453

Significantly surpassing SFT, the secret behind o1/DeepSeek-R1 can also be used in multimodal large models

The above is the detailed content of Significantly surpassing SFT, the secret behind o1/DeepSeek-R1 can also be used in multimodal large models. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn