Home >Technology peripherals >AI >Graphical language: Kuaishou and Beida multi-modal large models are comparable to DALLE-3
Current large-scale language models such as GPT, LLaMA, etc. have made significant progress in the field of natural language processing and can understand and generate complex text content. However, can we extend this powerful understanding and generation capabilities to multimodal data? This idea is gradually becoming a reality. The latest multi-modal large model LaVIT was developed by Kuaishou and Peking University. By combining image and video data, it enables the model to easily understand massive multimedia content and assists in the creation of illustrated content. The emergence of LaVIT is of great significance for the understanding and creation of multimedia content. It not only identifies objects, scenes and emotions in images and videos, but also generates natural language descriptions related to them. In this way, we can better utilize multi-modal data and create more vivid and interesting graphic content. The development of LaVIT is an important attempt at large-scale language models in the multi-modal field. It is expected to bring more possibilities to the processing and creation of multimedia content and promote further development in the fields of natural language processing and computer vision.
LaVIT is a new general multimodal base model that can be understood like a language model and generating visual content. It uses a similar training approach to large language models, using an autoregressive approach to predict the next image or text token. Once trained, LaVIT can serve as a general multimodal interface that can perform multimodal understanding and generation tasks without further fine-tuning. For example, LaVIT can achieve the following functions:
LaVIT is a powerful text-to-image generation model that is able to generate high quality, multiple aspect ratios and high aesthetics based on given text prompts Image. Compared with state-of-the-art image generation models such as Parti, SDXL and DALLE-3, LaVIT has comparable image generation capabilities. What makes it unique is its ability to generate diverse images while maintaining high quality and aesthetics. Whether in portrait or landscape orientation, LaVIT is capable of producing satisfying image compositions. By combining advanced technology and high-quality training data, LaVIT provides users with an outstanding text-to-image
In LaVIT, images and Text is represented as discretized tokens. Therefore, it can leverage multimodal cues for image generation, including combinations of text, image text, and image image. This multi-modal generation does not require any fine-tuning, and the system can generate corresponding images based on prompts.
LaVIT is an image understanding model that can read images and understand their semantics. It can generate relevant descriptions for input images and answer relevant questions.
The model structure of LaVIT is shown in the figure below. The entire optimization process includes two Stages:
Figure: The overall architecture of the LaVIT model
Phase 1: Dynamic Visual Tokenizer
To be able to understand and generate visual content like natural language, LaVIT introduces a well-designed visual tokenizer for Visual content (continuous signals) is converted into a text-like sequence of tokens, just like a foreign language that LLM can understand. The author believes that in order to achieve unified visual and language modeling, the visual tokenizer (Tokenizer) should have the following two characteristics:
The following picture is the visual word segmenter structure proposed by LaVIT:
Picture: (a) Dynamic visual token generator (b) token combiner
The dynamic visual tokenizer includes a token selector and a token combiner. As shown in the figure, the token selector is used to select the most informative image blocks, while the token merger compresses the information of those uninformative visual blocks into the retained tokens to achieve the merging of redundant tokens. The entire dynamic visual word segmenter is trained by maximizing the semantic reconstruction of the input image.
Token selector
##Token selector receives N image block levels Features are taken as input, and the goal is to evaluate the importance of each image patch and select the most informative patch to fully represent the semantics of the entire image. To achieve this goal, a lightweight module consisting of multiple MLP layers is used to predict the distribution π. By sampling from the distribution π, a binary decision mask is generated that indicates whether to keep the corresponding image patch.
Token combiner
Token combiner divides N image blocks according to the generated decision mask There are two groups for retaining X_r and discarding X_d. Unlike discarding X_d directly, the token combiner can preserve the detailed semantics of the input image to the maximum extent. The token combiner consists of L stacked blocks, each of which includes a causal self-attention layer, a cross-attention layer, and a feed-forward layer. In the causal self-attention layer, each token in X_r only pays attention to its previous token to ensure consistency with the text token form in LLM. This strategy performs better compared to bidirectional self-attention. The cross-attention layer takes the retained token X_r as query and merges the tokens in X_d based on their semantic similarity.
Phase 2: Unified generative pre-training
Visual token and text processed by visual tokenizer The tokens are connected to form a multi-modal sequence as input during training. In order to distinguish the two modalities, the author inserts special tokens at the beginning and end of the image token sequence: [IMG] and [/IMG], which are used to indicate the beginning and end of visual content. In order to be able to generate text and images, LaVIT uses two image-text connection forms: [image, text] and [text; image].
For these multi-modal input sequences, LaVIT uses a unified, autoregressive approach to directly maximize the likelihood of each multi-modal sequence for pre-training. This complete unification of representation space and training methods helps LLM better learn multi-modal interaction and alignment. After pre-training is completed, LaVIT has the ability to perceive images and can understand and generate images like text.
Zero-shot multimodal understanding
LaVIT It has achieved leading performance on zero-shot multi-modal understanding tasks such as image subtitle generation (NoCaps, Flickr30k) and visual question answering (VQAv2, OKVQA, GQA, VizWiz).
Table 1 Evaluation of multi-modal understanding tasks with zero samples
Multiple zero samples Modal generation
In this experiment, since the proposed visual tokenizer is able to represent images as discretized tokens, LaVIT has the ability to synthesize images by generating text-like visual tokens through autoregression. The author conducted a quantitative evaluation of the image synthesis performance of the model under zero-sample text conditions, and the comparison results are shown in Table 2.
Table 2 Zero-sample text to image generation performance of different models
As can be seen from the table It turns out that LaVIT outperforms all other multi-modal language models. Compared to Emu, LaVIT achieves further improvements on smaller LLM models, demonstrating excellent visual-verbal alignment capabilities. Furthermore, LaVIT achieves comparable performance to the state-of-the-art text-to-image expert Parti while using less training data.
Multi-modal prompt image generation
LaVIT is able to seamlessly accepts multiple modal combinations as cues and generates corresponding images without any fine-tuning. LaVIT generates images that accurately reflect the style and semantics of a given multimodal cue. And it can modify the original input image with multi-modal cues of the input. Traditional image generation models such as Stable Diffusion cannot achieve this capability without additional fine-tuned downstream data.
Example of multi-modal image generation results
Qualitative analysis
As shown in the figure below, LaVIT’s dynamic tokenizer can dynamically select the most informative image blocks based on the image content. The learned code can generate high-level Visual encoding of semantics.
Visualization of dynamic visual tokenizer (left) and learned codebook (right)
The emergence of LaVIT provides an innovative paradigm for the processing of multi-modal tasks. By using a dynamic visual word segmenter to represent vision and language into a unified discrete token representation, inheritance A successful autoregressive generative learning paradigm for LLM. By optimizing under a unified generation goal, LaVIT can treat images as a foreign language, understanding and generating them like text. The success of this method provides new inspiration for the development direction of future multimodal research, using the powerful reasoning capabilities of LLM to open new possibilities for smarter and more comprehensive multimodal understanding and generation.
The above is the detailed content of Graphical language: Kuaishou and Beida multi-modal large models are comparable to DALLE-3. For more information, please follow other related articles on the PHP Chinese website!