Home >Technology peripherals >AI >Yan Shuicheng took charge and established the ultimate form of the 'universal visual multi-modal large model'! Unified understanding/generation/segmentation/editing

Yan Shuicheng took charge and established the ultimate form of the 'universal visual multi-modal large model'! Unified understanding/generation/segmentation/editing

WBOY
WBOYforward
2024-04-25 20:04:15932browse

## Recently, Professor Yan Shuicheng’s team jointly released and open sourced Vitron universal pixel-level vision Multimodal large language model.


Yan Shuicheng took charge and established the ultimate form of the universal visual multi-modal large model! Unified understanding/generation/segmentation/editing

#Project homepage & Demo:

https://www.php.cn/link/d8a3b2dde3181c8257e2e45efbd1e8aePaper link: https://www.php.cn/link/0ec5ba872f1179835987f9028c4cc4dfOpen source code: https:// www.php.cn/link/26d6e896db39edc7d7bdd357d6984c95
##This is a heavy-duty general visual multi-modal large model that supports a series of visual tasks from visual understanding to visual generation, from low level to high level, and solves the image/image problem that has plagued the large language model industry for a long time. The problem of video model segmentation provides a pixel-level general visual multi-modal large model that comprehensively unifies the understanding, generation, segmentation, and editing of static images and dynamic video content, laying the foundation for the ultimate form of the next generation general visual large model. , also marks another big step towards general artificial intelligence (AGI) for large models.

Vitron, as a unified pixel-level visual multi-modal large language model, realizes visual tasks from low-level to high-level With comprehensive support, it can handle complex visual tasks and understand and generate image and video content, providing powerful visual understanding and task execution capabilities. At the same time, Vitron supports continuous operations with users, enabling flexible human-computer interaction, demonstrating the great potential towards a more unified visual multi-modal universal model.

Vitron-related papers, codes and demos have all been made public. Its comprehensiveness, technological innovation, human-computer interaction and application potential The unique advantages and potential demonstrated in other aspects not only promote the development of multi-modal large models, but also provide a new direction for future visual large model research.

The current development of visual large language models (LLMs) has made gratifying progress. The community increasingly believes that building more general and powerful multimodal large models (MLLMs) will be the only way to achieve general artificial intelligence (AGI). However, there are still some key challenges in the process of moving towards a multi-modal general model (Generalist). For example, a large part of the work does not achieve fine-grained pixel-level visual understanding, or lacks unified support for images and videos. Or the support for various visual tasks is insufficient, and it is far from a universal large model. In order to fill this gap, the team recently jointly released the open source Vitron universal pixel-level visual multi-modal large language model. Vitron supports a series of visual tasks from visual understanding to visual generation, from low level to high level, including comprehensive understanding, generation, segmentation and editing of static images and dynamic video content.

The above figure comprehensively depicts Vitron’s functional support for four major vision-related tasks, as well as its key advantages. Vitron also supports continuous operation with users to achieve flexible human-machine interaction. This project demonstrates the great potential for a more unified vision multi-modal general model, laying the foundation for the ultimate form of the next generation of general vision large models. Vitron related papers, codes, and demos are now all public. Yan Shuicheng took charge and established the ultimate form of the universal visual multi-modal large model! Unified understanding/generation/segmentation/editing

#

The unified ultimate multi-modal large language model

In recent years, large language models (LLMs) have demonstrated unprecedented powerful capabilities, and they have been gradually verified to be the path to AGI. Technical route. Multimodal large language models (MLLMs) are developing rapidly in many communities and are rapidly emerging. By introducing modules that can perform visual perception, pure language-based LLMs are extended to MLLMs. Many MLLMs that are powerful and excellent in image understanding have been developed. , such as BLIP-2, LLaVA, MiniGPT-4, etc. At the same time, MLLMs focusing on video understanding have also been launched, such as VideoChat, Video-LLaMA, Video-LLaVA, etc.

Subsequently, researchers mainly tried to further expand the capabilities of MLLMs from two dimensions. On the one hand, researchers are trying to deepen MLLMs' understanding of vision, transitioning from rough instance-level understanding to pixel-level fine-grained understanding of images, so as to achieve visual region positioning (Regional Grounding) capabilities, such as GLaMM, PixelLM, NExT-Chat and MiniGPT-v2 etc.

On the other hand, researchers try to expand the visual functions that MLLMs can support. Some research has begun to study how MLLMs not only understand input visual signals, but also support the generation of output visual content. For example, MLLMs such as GILL and Emu can flexibly generate image content, and GPT4Video and NExT-GPT realize video generation.

At present, the artificial intelligence community has gradually reached a consensus that the future trend of visual MLLMs will inevitably develop in the direction of highly unified and stronger capabilities. However, despite the numerous MLLMs developed by the community, a clear gap still exists.

1. Almost all existing visual LLMs treat images and videos as different entities and either support only images or only videos.

Researchers advocate that vision should include both static images and dynamic videos - both of which are part of the visual world The core components are even interchangeable in most scenarios. Therefore, it is necessary to build a unified MLLM framework that can support both image and video modalities.

#2. Currently, MLLMs’ support for visual functions is still insufficient.

#Most models are only capable of understanding, or at most generating images or videos. Researchers believe that future MLLMs should be a general large language model that can cover a wider range of visual tasks and operations, achieve unified support for all vision-related tasks, and achieve "one for all" capabilities. This is crucial for practical applications, especially in visual creation that often involves a series of iterative and interactive operations.

For example, users often first start with text and convert an idea into visual content through Vincent diagrams; then refine the initial idea and add more details through further fine-grained image editing ; Then, create dynamic content by generating videos from images; finally, perform several rounds of iterative interactions, such as video editing, to perfect the creation.

Yan Shuicheng took charge and established the ultimate form of the universal visual multi-modal large model! Unified understanding/generation/segmentation/editing

The above table simply summarizes the capabilities of existing visual MLLM (only representatively includes some models, and the coverage is incomplete). To bridge these gaps, the team proposes Vitron, a general pixel-level visual MLLM.

Vitron system architecture: three key modules

The overall framework of Vitron is shown in the figure below. Vitron adopts a similar architecture to existing related MLLMs, including three key parts: 1) front-end visual & language encoding module, 2) central LLM understanding and text generation module, and 3) back-end user response and module calls for visual control module.

Yan Shuicheng took charge and established the ultimate form of the universal visual multi-modal large model! Unified understanding/generation/segmentation/editing

##Front-end module: Visual-Language Coding

In order to perceive image and video modal signals and support fine-grained user visual input, Vitron integrates image encoders, video encoders, and region box/sketch encoders.

Center module: Core LLM

Vitron uses Vicuna (7B, v1 .5), to enable understanding, reasoning, decision-making and multiple rounds of user interaction.

Backend module: User response and module call

Vitron uses text as the The central calling strategy integrates several powerful off-the-shelf (SoTA) image and video processing modules for decoding and executing a series of visual terminal tasks from low-level to high-level. By adopting a text-centric module integration calling method, Vitron not only achieves system unification, but also ensures alignment efficiency and system scalability.

Yan Shuicheng took charge and established the ultimate form of the universal visual multi-modal large model! Unified understanding/generation/segmentation/editing

Three stages of Vitron model training

Based on the above architecture, Vitron is trained and fine-tuned to give it powerful visual understanding and task execution capabilities. Model training mainly includes three different stages.

Step 1: Visual-language overall alignment learning. The input visual language features are mapped into a unified feature space, thereby enabling it to effectively understand the input multi-modal signals. This is a coarse-grained visual-linguistic alignment learning that enables the system to effectively process incoming visual signals as a whole. The researchers used existing image-caption pair (CC3M), video-caption pair (Webvid) and region-caption pair (RefCOCO) datasets for training.

Step 2: Fine-grained fine-tuning of spatio-temporal visual positioning instructions. The system uses external modules to perform various pixel-level visual tasks, but LLM itself has not undergone any fine-grained visual training, which will hinder the system from achieving true pixel-level visual understanding. To this end, the researchers proposed a fine-grained spatiotemporal visual positioning instruction fine-tuning training. The core idea is to enable LLM to locate the fine-grained spatiality of the image and the specific temporal characteristics of the video.

Step 3: The output end is fine-tuned to the instruction called by the command. The second stage of training described above gives the LLM and front-end encoder the ability to understand vision at the pixel level. This final step, instruction fine-tuning for command invocation, aims to equip the system with the ability to execute commands accurately, allowing LLM to generate appropriate and correct invocation text. Since different terminal vision tasks may require different calling commands, in order to unify this, the researchers proposed to standardize the response output of LLM into a structured text format, which includes:

1) User response output, directly reply to the user's input

2) Module name, indicating the function or task to be performed.

3) Call the command to trigger the meta-instruction of the task module.

4) area (optional output) that specifies the fine-grained visual features required for certain tasks, such as in video tracking or visual editing, where backend modules require this information. For regions, based on LLM’s pixel-level understanding, bounding boxes described by coordinates will be output.

Yan Shuicheng took charge and established the ultimate form of the universal visual multi-modal large model! Unified understanding/generation/segmentation/editing

Evaluation experiments

Researchers based on Vitron on 22 common benchmark data sets and 12 image/video vision tasks Extensive experimental evaluations were performed. Vitron demonstrates strong capabilities in four major visual task groups (segmentation, understanding, content generation and editing), while at the same time it has flexible human-computer interaction capabilities. The following representatively shows some qualitative comparison results:

##Vision Segmentation

Yan Shuicheng took charge and established the ultimate form of the universal visual multi-modal large model! Unified understanding/generation/segmentation/editing

Results of image referring image segmentation

Fine-grained Vision Understanding

Yan Shuicheng took charge and established the ultimate form of the universal visual multi-modal large model! Unified understanding/generation/segmentation/editing

Results of image referring expression comprehension.

Yan Shuicheng took charge and established the ultimate form of the universal visual multi-modal large model! Unified understanding/generation/segmentation/editing

Results on video QA.

Vision Generation

Yan Shuicheng took charge and established the ultimate form of the universal visual multi-modal large model! Unified understanding/generation/segmentation/editing

#Text-to-Image Generation/Text-to-Video generation/Image-to-Video generation

Vision Editing

Yan Shuicheng took charge and established the ultimate form of the universal visual multi-modal large model! Unified understanding/generation/segmentation/editing

##Image editing results

Yan Shuicheng took charge and established the ultimate form of the universal visual multi-modal large model! Unified understanding/generation/segmentation/editingPlease refer to the paper for more detailed experimental content and details.

Future Direction Outlook

Overall, this work demonstrates the great potential of developing a unified visual multi-modal general large model, laying a new foundation for the research of next-generation visual large models. form, taking the first step in this direction. Although the Vitron system proposed by the team shows strong general capabilities, it still has its own limitations. The following researchers list some directions that could be further explored in the future.

System architecture

The Vitron system still uses a semi-joint, semi-agent approach to call external tools. Although this call-based method facilitates the expansion and replacement of potential modules, it also means that the back-end modules of this pipeline structure do not participate in the joint learning of the front-end and LLM core modules.

This limitation is not conducive to the overall learning of the system, which means that the performance upper limit of different visual tasks will be limited by the back-end module. Future work should integrate various vision task modules into a unified unit. Achieving unified understanding and output of images and videos while supporting generation and editing capabilities through a single generative paradigm remains a challenge. Currently, a promising approach is to combine modularity-persistent tokenization to improve the unification of the system on different inputs and outputs and various tasks.

User interactivity

Comparison with previous models focusing on a single vision task (e.g. , Stable Diffusion and SEEM), Vitron aims to promote in-depth interaction between LLM and users, similar to OpenAI's DALL-E series, Midjourney, etc. in the industry. Achieving optimal user interactivity is one of the core goals of this work.

Vitron leverages existing language-based LLMs, combined with appropriate directive adjustments to achieve a certain level of interactivity. For example, the system can flexibly respond to any expected message input by the user and produce corresponding visual operation results without requiring the user input to exactly match the back-end module conditions. However, this work still leaves a lot of room for improvement in terms of enhancing interactivity. For example, drawing inspiration from the closed-source Midjourney system, no matter what decision LLM makes at each step, the system should actively provide feedback to users to ensure that its actions and decisions are consistent with user intentions.

Modal capabilities

##Currently, Vitron integrates a 7B Vicuna model, It may impose certain limitations on its ability to understand language, images and videos. Future exploration directions could be to develop a comprehensive end-to-end system, such as expanding the scale of the model to achieve a more thorough and comprehensive understanding of vision. Furthermore, efforts should be made to enable LLM to fully unify the understanding of image and video modalities.

The above is the detailed content of Yan Shuicheng took charge and established the ultimate form of the 'universal visual multi-modal large model'! Unified understanding/generation/segmentation/editing. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete