Home > Article > Technology peripherals > Microsoft multi-modal ChatGPT is coming? 1.6 billion parameters to handle tasks such as looking at pictures and answering questions, IQ tests, etc.
In the field of NLP, large language models (LLMs) have successfully served as common interfaces in various natural language tasks. As long as we can convert the input and output to text, we can adapt the LLM-based interface to a task. For example, the summary task takes in documents and outputs summary information. So, we can feed input documents into a summary language model and generate a summary.
Despite the successful application of LLM in NLP tasks, researchers still struggle to use it natively for multi-modal data such as images and audio. As a fundamental component of intelligence, multimodal perception is a necessary condition for achieving general artificial intelligence, both for knowledge acquisition and dealing with the real world. More importantly, unlocking multimodal input can greatly expand the application of language models in more high-value fields, such as multimodal robotics, document intelligence, and robotics.
Therefore, the Microsoft team introduced a multi-modal large-scale language in the paper "Language Is Not All You Need: Aligning Perception with Language Models" Model (MLLM) - KOSMOS-1, which can perceive general modalities, follow instructions (i.e. zero-shot learning), and learn in context (i.e. few-shot learning) . The research goal is to align perception with LLM so that the model can see and talk. The researchers trained KOSMOS-1 from scratch according to the method of METALM (see the paper "Language models are general-purpose interfaces").
As shown in Figure 1 below, the researcher uses a Transformer-based language model as a general interface and connects it with the perception module. They trained the model on a web-scale multimodal corpus, which includes text data, arbitrarily interleaved images and text, and image-caption pairs. In addition, the researchers calibrated cross-modal instruction following ability by transmitting pure language data.
Finally, the KOSMOS-1 model natively supports language, perceptual language and visual tasks in zero-shot and few-shot learning settings, as shown in Table 1 below.
The researchers show some generated examples in Figures 2 and 3 below. In addition to various natural language tasks, the KOSMOS-1 model is able to natively handle a wide range of perceptually intensive tasks, such as visual dialogue, visual explanation, visual question answering, image subtitles, simple mathematical equations, OCR and Zero-shot image classification with description. They also established an IQ test benchmark based on the Raven's Progressive Matrices (RPM) to assess MLLM's non-verbal reasoning abilities.
#These examples demonstrate that native support for multimodal perception provides new opportunities to apply LLM to new tasks . In addition, compared with LLM, MLLM achieves better commonsense reasoning performance, indicating that cross-modal transfer facilitates knowledge acquisition.
Since the number of parameters of the KOSMOS-1 model is 1.6 billion, some netizens expressed the hope of running this large multi-modal model on their computers.
As shown in Figure 1, KOSMOS-1 is a multimodal language model that can both perceive general modalities and follow Instructions can also learn and generate output in context. Specifically, the backbone of KOSMOS-1 is a causal language model based on Transformer. In addition to text, other modalities can also be embedded and input into the model. As shown in the figure below, in addition to language, there are also embeddings of vision, speech, etc. Transformer decoders serve as a general interface for multimodal inputs. Once the model is trained, KOSMOS-1 can also be evaluated on language tasks and multi-modal tasks in zero-shot and few-shot settings.
Transformer The decoder perceives the modality in a unified way, and the input information will be flattened into a sequence with special tokens. For example, indicates the beginning of the sequence, and indicates the end of the sequence. The special tokens
The embedding module encodes text tokens and other input modalities into vector representations. For input token, the study uses a lookup table to map it into embeddings. For continuous signal modalities (e.g., images and audio), the input can also be represented as discrete codes.
After that, the obtained input sequence embedding is fed to the Transformer-based decoder. The causal model then processes the sequence in an autoregressive manner, resulting in the next token. In summary, the MLLM framework can flexibly handle various data types as long as the inputs are represented as vectors.
The first is the training data set. Datasets include text corpora, image-subtitle pairs, and image and text cross-datasets. Specifically, the text corpus includes The Pile and Common Crawl (CC); the image-caption pairs include English LAION-2B, LAION-400M, COYO-700M and Conceptual Captions; the image and text cross-multimodal data set comes from Common Crawl snapshot .
The data set is there, and then there is the training settings. The MLLM component contains 24 layers, hidden dimensions of 2048, 8192 FFNs, 32 attention heads, and parameter size of 1.3B. To enable better model convergence, image representations are obtained from the pre-trained CLIP ViT-L/14 model with 1024 feature dimensions. Images are preprocessed to 224 × 224 resolution during training. Additionally, all CLIP model parameters except the last layer are frozen during training. The total number of parameters for KOSMOS-1 is approximately 1.6B.
This study conducted a series of rich experiments To evaluate KOSMOS-1: language tasks (language understanding, language generation, OCR-free text classification); cross-modal transfer (common sense reasoning); non-verbal reasoning (IQ test); perceptual-language tasks (image subtitles, visual question and answer, Web Q&A); visual tasks (zero-shot image classification, zero-shot image classification with description).
Image subtitles. The following table shows the zero-sample performance of different models on COCO and Flickr30k. Compared with other models, KOSMOS-1 has achieved significant results, and its performance is also good even on the basis that the number of parameters is much smaller than Flamingo.
The following table shows the performance comparison of few samples:
##
Visual Q&A. KOSMOS-1 has higher accuracy and robustness than Flamingo-3B and Flamingo-9B models:
The following table shows the performance comparison of few samples:
IQ Test. The Raven's Reasoning Test is one of the most common tests used to assess nonverbal reasoning. Figure 4 shows an example.
Table 6 shows the evaluation results on the IQ test data set. KOSMOS-1 is able to perceive abstract conceptual patterns in a nonverbal environment and then reason out subsequent elements among multiple choices. To our knowledge, this is the first time a model has been able to perform such a zero-sample Raven IQ test.
##Web Q&A. Web Q&A aims to find answers to questions from web pages. It requires the model to understand both the semantics and the structure of the text. The results are as follows:
##Multimodal thinking chain prompts. Inspired by the thinking chain prompts, this article conducted an experiment in this regard. As shown in Figure 5, this article decomposes the language perception task into two steps. Given an image in the first stage, cues are used to guide the model to generate output that meets the requirements to produce the final result.
## As can be seen from Table 9, the score of the multi-modal thinking chain prompt is 72.9 points, which is higher than the standard prompt Scored 5.8 points:For more experimental content, please refer to the original paper.
The above is the detailed content of Microsoft multi-modal ChatGPT is coming? 1.6 billion parameters to handle tasks such as looking at pictures and answering questions, IQ tests, etc.. For more information, please follow other related articles on the PHP Chinese website!