


In the field of NLP, large language models (LLMs) have successfully served as common interfaces in various natural language tasks. As long as we can convert the input and output to text, we can adapt the LLM-based interface to a task. For example, the summary task takes in documents and outputs summary information. So, we can feed input documents into a summary language model and generate a summary.
Despite the successful application of LLM in NLP tasks, researchers still struggle to use it natively for multi-modal data such as images and audio. As a fundamental component of intelligence, multimodal perception is a necessary condition for achieving general artificial intelligence, both for knowledge acquisition and dealing with the real world. More importantly, unlocking multimodal input can greatly expand the application of language models in more high-value fields, such as multimodal robotics, document intelligence, and robotics.
Therefore, the Microsoft team introduced a multi-modal large-scale language in the paper "Language Is Not All You Need: Aligning Perception with Language Models" Model (MLLM) - KOSMOS-1, which can perceive general modalities, follow instructions (i.e. zero-shot learning), and learn in context (i.e. few-shot learning) . The research goal is to align perception with LLM so that the model can see and talk. The researchers trained KOSMOS-1 from scratch according to the method of METALM (see the paper "Language models are general-purpose interfaces").
- ##Paper address: https://arxiv.org/ pdf/2302.14045.pdf
- Project address: https://github.com/microsoft/unilm
As shown in Figure 1 below, the researcher uses a Transformer-based language model as a general interface and connects it with the perception module. They trained the model on a web-scale multimodal corpus, which includes text data, arbitrarily interleaved images and text, and image-caption pairs. In addition, the researchers calibrated cross-modal instruction following ability by transmitting pure language data.
Finally, the KOSMOS-1 model natively supports language, perceptual language and visual tasks in zero-shot and few-shot learning settings, as shown in Table 1 below.
The researchers show some generated examples in Figures 2 and 3 below. In addition to various natural language tasks, the KOSMOS-1 model is able to natively handle a wide range of perceptually intensive tasks, such as visual dialogue, visual explanation, visual question answering, image subtitles, simple mathematical equations, OCR and Zero-shot image classification with description. They also established an IQ test benchmark based on the Raven's Progressive Matrices (RPM) to assess MLLM's non-verbal reasoning abilities.
#These examples demonstrate that native support for multimodal perception provides new opportunities to apply LLM to new tasks . In addition, compared with LLM, MLLM achieves better commonsense reasoning performance, indicating that cross-modal transfer facilitates knowledge acquisition.
Since the number of parameters of the KOSMOS-1 model is 1.6 billion, some netizens expressed the hope of running this large multi-modal model on their computers.
KOSMOS-1: A multimodal large-scale language model
As shown in Figure 1, KOSMOS-1 is a multimodal language model that can both perceive general modalities and follow Instructions can also learn and generate output in context. Specifically, the backbone of KOSMOS-1 is a causal language model based on Transformer. In addition to text, other modalities can also be embedded and input into the model. As shown in the figure below, in addition to language, there are also embeddings of vision, speech, etc. Transformer decoders serve as a general interface for multimodal inputs. Once the model is trained, KOSMOS-1 can also be evaluated on language tasks and multi-modal tasks in zero-shot and few-shot settings.
Transformer The decoder perceives the modality in a unified way, and the input information will be flattened into a sequence with special tokens. For example, indicates the beginning of the sequence, and indicates the end of the sequence. The special tokens
The embedding module encodes text tokens and other input modalities into vector representations. For input token, the study uses a lookup table to map it into embeddings. For continuous signal modalities (e.g., images and audio), the input can also be represented as discrete codes.
After that, the obtained input sequence embedding is fed to the Transformer-based decoder. The causal model then processes the sequence in an autoregressive manner, resulting in the next token. In summary, the MLLM framework can flexibly handle various data types as long as the inputs are represented as vectors.
Model training
The first is the training data set. Datasets include text corpora, image-subtitle pairs, and image and text cross-datasets. Specifically, the text corpus includes The Pile and Common Crawl (CC); the image-caption pairs include English LAION-2B, LAION-400M, COYO-700M and Conceptual Captions; the image and text cross-multimodal data set comes from Common Crawl snapshot .
The data set is there, and then there is the training settings. The MLLM component contains 24 layers, hidden dimensions of 2048, 8192 FFNs, 32 attention heads, and parameter size of 1.3B. To enable better model convergence, image representations are obtained from the pre-trained CLIP ViT-L/14 model with 1024 feature dimensions. Images are preprocessed to 224 × 224 resolution during training. Additionally, all CLIP model parameters except the last layer are frozen during training. The total number of parameters for KOSMOS-1 is approximately 1.6B.
Experimental results
This study conducted a series of rich experiments To evaluate KOSMOS-1: language tasks (language understanding, language generation, OCR-free text classification); cross-modal transfer (common sense reasoning); non-verbal reasoning (IQ test); perceptual-language tasks (image subtitles, visual question and answer, Web Q&A); visual tasks (zero-shot image classification, zero-shot image classification with description).
Image subtitles. The following table shows the zero-sample performance of different models on COCO and Flickr30k. Compared with other models, KOSMOS-1 has achieved significant results, and its performance is also good even on the basis that the number of parameters is much smaller than Flamingo.
The following table shows the performance comparison of few samples:
Visual Q&A. KOSMOS-1 has higher accuracy and robustness than Flamingo-3B and Flamingo-9B models:
The following table shows the performance comparison of few samples:
IQ Test. The Raven's Reasoning Test is one of the most common tests used to assess nonverbal reasoning. Figure 4 shows an example.
Table 6 shows the evaluation results on the IQ test data set. KOSMOS-1 is able to perceive abstract conceptual patterns in a nonverbal environment and then reason out subsequent elements among multiple choices. To our knowledge, this is the first time a model has been able to perform such a zero-sample Raven IQ test.
##Web Q&A. Web Q&A aims to find answers to questions from web pages. It requires the model to understand both the semantics and the structure of the text. The results are as follows:
##Multimodal thinking chain prompts. Inspired by the thinking chain prompts, this article conducted an experiment in this regard. As shown in Figure 5, this article decomposes the language perception task into two steps. Given an image in the first stage, cues are used to guide the model to generate output that meets the requirements to produce the final result.
The above is the detailed content of Microsoft multi-modal ChatGPT is coming? 1.6 billion parameters to handle tasks such as looking at pictures and answering questions, IQ tests, etc.. For more information, please follow other related articles on the PHP Chinese website!

Introduction In prompt engineering, “Graph of Thought” refers to a novel approach that uses graph theory to structure and guide AI’s reasoning process. Unlike traditional methods, which often involve linear s

Introduction Congratulations! You run a successful business. Through your web pages, social media campaigns, webinars, conferences, free resources, and other sources, you collect 5000 email IDs daily. The next obvious step is

Introduction In today’s fast-paced software development environment, ensuring optimal application performance is crucial. Monitoring real-time metrics such as response times, error rates, and resource utilization can help main

“How many users do you have?” he prodded. “I think the last time we said was 500 million weekly actives, and it is growing very rapidly,” replied Altman. “You told me that it like doubled in just a few weeks,” Anderson continued. “I said that priv

Introduction Mistral has released its very first multimodal model, namely the Pixtral-12B-2409. This model is built upon Mistral’s 12 Billion parameter, Nemo 12B. What sets this model apart? It can now take both images and tex

Imagine having an AI-powered assistant that not only responds to your queries but also autonomously gathers information, executes tasks, and even handles multiple types of data—text, images, and code. Sounds futuristic? In this a

Introduction The finance industry is the cornerstone of any country’s development, as it drives economic growth by facilitating efficient transactions and credit availability. The ease with which transactions occur and credit

Introduction Data is being generated at an unprecedented rate from sources such as social media, financial transactions, and e-commerce platforms. Handling this continuous stream of information is a challenge, but it offers an


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

WebStorm Mac version
Useful JavaScript development tools

Dreamweaver CS6
Visual web development tools