Microsoft multi-modal ChatGPT is coming? 1.6 billion parameters to handle tasks such as looking at pictures and answering questions, IQ tests, etc.-AI-php.cn

Microsoft multi-modal ChatGPT is coming? 1.6 billion parameters to handle tasks such as looking at pictures and answering questions, IQ tests, etc.

PHPz

Apr 14, 2023 pm 06:28 PM

aiModel

In the field of NLP, large language models (LLMs) have successfully served as common interfaces in various natural language tasks. As long as we can convert the input and output to text, we can adapt the LLM-based interface to a task. For example, the summary task takes in documents and outputs summary information. So, we can feed input documents into a summary language model and generate a summary.

Despite the successful application of LLM in NLP tasks, researchers still struggle to use it natively for multi-modal data such as images and audio. As a fundamental component of intelligence, multimodal perception is a necessary condition for achieving general artificial intelligence, both for knowledge acquisition and dealing with the real world. More importantly, unlocking multimodal input can greatly expand the application of language models in more high-value fields, such as multimodal robotics, document intelligence, and robotics.

Therefore, the Microsoft team introduced a multi-modal large-scale language in the paper "Language Is Not All You Need: Aligning Perception with Language Models" Model (MLLM) - KOSMOS-1, which can perceive general modalities, follow instructions (i.e. zero-shot learning), and learn in context (i.e. few-shot learning) . The research goal is to align perception with LLM so that the model can see and talk. The researchers trained KOSMOS-1 from scratch according to the method of METALM (see the paper "Language models are general-purpose interfaces").

Microsoft multi-modal ChatGPT is coming? 1.6 billion parameters to handle tasks such as looking at pictures and answering questions, IQ tests, etc.

##Paper address: https://arxiv.org/ pdf/2302.14045.pdf
Project address: https://github.com/microsoft/unilm

As shown in Figure 1 below, the researcher uses a Transformer-based language model as a general interface and connects it with the perception module. They trained the model on a web-scale multimodal corpus, which includes text data, arbitrarily interleaved images and text, and image-caption pairs. In addition, the researchers calibrated cross-modal instruction following ability by transmitting pure language data.

Finally, the KOSMOS-1 model natively supports language, perceptual language and visual tasks in zero-shot and few-shot learning settings, as shown in Table 1 below.

Microsoft multi-modal ChatGPT is coming? 1.6 billion parameters to handle tasks such as looking at pictures and answering questions, IQ tests, etc.

The researchers show some generated examples in Figures 2 and 3 below. In addition to various natural language tasks, the KOSMOS-1 model is able to natively handle a wide range of perceptually intensive tasks, such as visual dialogue, visual explanation, visual question answering, image subtitles, simple mathematical equations, OCR and Zero-shot image classification with description. They also established an IQ test benchmark based on the Raven's Progressive Matrices (RPM) to assess MLLM's non-verbal reasoning abilities.

Microsoft multi-modal ChatGPT is coming? 1.6 billion parameters to handle tasks such as looking at pictures and answering questions, IQ tests, etc.

#These examples demonstrate that native support for multimodal perception provides new opportunities to apply LLM to new tasks . In addition, compared with LLM, MLLM achieves better commonsense reasoning performance, indicating that cross-modal transfer facilitates knowledge acquisition.

Since the number of parameters of the KOSMOS-1 model is 1.6 billion, some netizens expressed the hope of running this large multi-modal model on their computers.

Microsoft multi-modal ChatGPT is coming? 1.6 billion parameters to handle tasks such as looking at pictures and answering questions, IQ tests, etc.

KOSMOS-1: A multimodal large-scale language model

As shown in Figure 1, KOSMOS-1 is a multimodal language model that can both perceive general modalities and follow Instructions can also learn and generate output in context. Specifically, the backbone of KOSMOS-1 is a causal language model based on Transformer. In addition to text, other modalities can also be embedded and input into the model. As shown in the figure below, in addition to language, there are also embeddings of vision, speech, etc. Transformer decoders serve as a general interface for multimodal inputs. Once the model is trained, KOSMOS-1 can also be evaluated on language tasks and multi-modal tasks in zero-shot and few-shot settings.

Microsoft multi-modal ChatGPT is coming? 1.6 billion parameters to handle tasks such as looking at pictures and answering questions, IQ tests, etc.

Transformer The decoder perceives the modality in a unified way, and the input information will be flattened into a sequence with special tokens. For example, indicates the beginning of the sequence, and indicates the end of the sequence. The special tokens and represent the start and end of the encoded image embedding.

Microsoft multi-modal ChatGPT is coming? 1.6 billion parameters to handle tasks such as looking at pictures and answering questions, IQ tests, etc.

The embedding module encodes text tokens and other input modalities into vector representations. For input token, the study uses a lookup table to map it into embeddings. For continuous signal modalities (e.g., images and audio), the input can also be represented as discrete codes.

After that, the obtained input sequence embedding is fed to the Transformer-based decoder. The causal model then processes the sequence in an autoregressive manner, resulting in the next token. In summary, the MLLM framework can flexibly handle various data types as long as the inputs are represented as vectors.

Model training

The first is the training data set. Datasets include text corpora, image-subtitle pairs, and image and text cross-datasets. Specifically, the text corpus includes The Pile and Common Crawl (CC); the image-caption pairs include English LAION-2B, LAION-400M, COYO-700M and Conceptual Captions; the image and text cross-multimodal data set comes from Common Crawl snapshot .

The data set is there, and then there is the training settings. The MLLM component contains 24 layers, hidden dimensions of 2048, 8192 FFNs, 32 attention heads, and parameter size of 1.3B. To enable better model convergence, image representations are obtained from the pre-trained CLIP ViT-L/14 model with 1024 feature dimensions. Images are preprocessed to 224 × 224 resolution during training. Additionally, all CLIP model parameters except the last layer are frozen during training. The total number of parameters for KOSMOS-1 is approximately 1.6B.

Microsoft multi-modal ChatGPT is coming? 1.6 billion parameters to handle tasks such as looking at pictures and answering questions, IQ tests, etc.

Experimental results

This study conducted a series of rich experiments To evaluate KOSMOS-1: language tasks (language understanding, language generation, OCR-free text classification); cross-modal transfer (common sense reasoning); non-verbal reasoning (IQ test); perceptual-language tasks (image subtitles, visual question and answer, Web Q&A); visual tasks (zero-shot image classification, zero-shot image classification with description).

Image subtitles. The following table shows the zero-sample performance of different models on COCO and Flickr30k. Compared with other models, KOSMOS-1 has achieved significant results, and its performance is also good even on the basis that the number of parameters is much smaller than Flamingo.

Microsoft multi-modal ChatGPT is coming? 1.6 billion parameters to handle tasks such as looking at pictures and answering questions, IQ tests, etc.

The following table shows the performance comparison of few samples:

Microsoft multi-modal ChatGPT is coming? 1.6 billion parameters to handle tasks such as looking at pictures and answering questions, IQ tests, etc.

Visual Q&A. KOSMOS-1 has higher accuracy and robustness than Flamingo-3B and Flamingo-9B models:

Microsoft multi-modal ChatGPT is coming? 1.6 billion parameters to handle tasks such as looking at pictures and answering questions, IQ tests, etc.

The following table shows the performance comparison of few samples:

Microsoft multi-modal ChatGPT is coming? 1.6 billion parameters to handle tasks such as looking at pictures and answering questions, IQ tests, etc.

IQ Test. The Raven's Reasoning Test is one of the most common tests used to assess nonverbal reasoning. Figure 4 shows an example.

Microsoft multi-modal ChatGPT is coming? 1.6 billion parameters to handle tasks such as looking at pictures and answering questions, IQ tests, etc.

Table 6 shows the evaluation results on the IQ test data set. KOSMOS-1 is able to perceive abstract conceptual patterns in a nonverbal environment and then reason out subsequent elements among multiple choices. To our knowledge, this is the first time a model has been able to perform such a zero-sample Raven IQ test.

Microsoft multi-modal ChatGPT is coming? 1.6 billion parameters to handle tasks such as looking at pictures and answering questions, IQ tests, etc.

##Web Q&A. Web Q&A aims to find answers to questions from web pages. It requires the model to understand both the semantics and the structure of the text. The results are as follows:

Microsoft multi-modal ChatGPT is coming? 1.6 billion parameters to handle tasks such as looking at pictures and answering questions, IQ tests, etc.

##Multimodal thinking chain prompts. Inspired by the thinking chain prompts, this article conducted an experiment in this regard. As shown in Figure 5, this article decomposes the language perception task into two steps. Given an image in the first stage, cues are used to guide the model to generate output that meets the requirements to produce the final result.

Microsoft multi-modal ChatGPT is coming? 1.6 billion parameters to handle tasks such as looking at pictures and answering questions, IQ tests, etc.

## As can be seen from Table 9, the score of the multi-modal thinking chain prompt is 72.9 points, which is higher than the standard prompt Scored 5.8 points:

Microsoft multi-modal ChatGPT is coming? 1.6 billion parameters to handle tasks such as looking at pictures and answering questions, IQ tests, etc.

For more experimental content, please refer to the original paper.

The above is the detailed content of Microsoft multi-modal ChatGPT is coming? 1.6 billion parameters to handle tasks such as looking at pictures and answering questions, IQ tests, etc.. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

What is Graph of Thought in Prompt EngineeringApr 13, 2025 am 11:53 AM

Introduction In prompt engineering, “Graph of Thought” refers to a novel approach that uses graph theory to structure and guide AI’s reasoning process. Unlike traditional methods, which often involve linear s

Optimize Your Organisation's Email Marketing with GenAI AgentsApr 13, 2025 am 11:44 AM

Introduction Congratulations! You run a successful business. Through your web pages, social media campaigns, webinars, conferences, free resources, and other sources, you collect 5000 email IDs daily. The next obvious step is

Real-Time App Performance Monitoring with Apache PinotApr 13, 2025 am 11:40 AM

Introduction In today’s fast-paced software development environment, ensuring optimal application performance is crucial. Monitoring real-time metrics such as response times, error rates, and resource utilization can help main

ChatGPT Hits 1 Billion Users? 'Doubled In Just Weeks' Says OpenAI CEOApr 13, 2025 am 11:23 AM

“How many users do you have?” he prodded. “I think the last time we said was 500 million weekly actives, and it is growing very rapidly,” replied Altman. “You told me that it like doubled in just a few weeks,” Anderson continued. “I said that priv

Pixtral-12B: Mistral AI's First Multimodal Model - Analytics VidhyaApr 13, 2025 am 11:20 AM

Introduction Mistral has released its very first multimodal model, namely the Pixtral-12B-2409. This model is built upon Mistral’s 12 Billion parameter, Nemo 12B. What sets this model apart? It can now take both images and tex

Agentic Frameworks for Generative AI Applications - Analytics VidhyaApr 13, 2025 am 11:13 AM

Imagine having an AI-powered assistant that not only responds to your queries but also autonomously gathers information, executes tasks, and even handles multiple types of data—text, images, and code. Sounds futuristic? In this a

Applications of Generative AI in the Financial SectorApr 13, 2025 am 11:12 AM

Introduction The finance industry is the cornerstone of any country’s development, as it drives economic growth by facilitating efficient transactions and credit availability. The ease with which transactions occur and credit

Guide to Online Learning and Passive-Aggressive AlgorithmsApr 13, 2025 am 11:09 AM

Introduction Data is being generated at an unprecedented rate from sources such as social media, financial transactions, and e-commerce platforms. Handling this continuous stream of information is a challenge, but it offers an

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

WebStorm Mac version

Useful JavaScript development tools

Dreamweaver CS6

Visual web development tools

Hot Topics

Where is the login entrance for gmail email?

7493

CakePHP Tutorial

1377

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers