Home  >  Article  >  Technology peripherals  >  How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

WBOY
WBOYforward
2024-05-02 16:01:01554browse

Multimodal AI systems are characterized by their ability to process and learn various types of data including natural language, vision, audio, etc., to guide their behavioral decisions. Recently, research on incorporating visual data into large language models (such as GPT-4V) has made important progress, but how to effectively convert image information into executable operations for AI systems still faces challenges. In order to achieve the transformation of image information, a common method is to convert image data into corresponding text descriptions, and then the AI ​​system operates based on the descriptions. This can be done by performing supervised learning on existing image data sets, allowing the AI ​​system to automatically learn the image-to-text mapping relationship. In addition, reinforcement learning methods can also be used to learn how to make decisions based on image information by interacting with the environment. Another method is to directly combine image information with language models to construct

In a recent paper, researchers proposed a multi-modal model designed specifically for AI applications, introducing "functional "token" concept.

  • Paper title: Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent

  • Paper link: https://arxiv .org/pdf/2404.11459.pdf

  • Model weights and inference code: https://www.nexa4ai.com/apply

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

This model can fully support edge devices, and researchers have optimized its parameter size to within 1 billion. Similar to GPT-4, this model can handle both English and Chinese. Experiments have proven that the model can run efficiently on various resource-limited terminal devices including Raspberry Pi.

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?
How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?
How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

#Research Background

The rapid development of artificial intelligence technology has completely changed the way human-computer interaction occurs, giving rise to a number of intelligent AI systems that can perform complex tasks and make decisions based on natural language, vision and other forms of input. These systems are expected to automate everything from simple tasks such as image recognition and language translation to complex applications such as medical diagnosis and autonomous driving. Multimodal language models are at the core of these intelligent systems, enabling them to understand and generate near-human responses by processing and integrating multimodal data such as text, images, and even audio and video. Compared with traditional language models that mainly focus on text processing and generation, multimodal language models are a big leap forward. By incorporating visual information, these models are able to better understand the context and semantics of the input data, resulting in more accurate and relevant output. The ability to process and integrate multimodal data is crucial for developing multimodal AI systems that can simultaneously understand tasks such as language and visual information, such as visual question answering, image navigation, multimodal sentiment analysis, etc.

One of the challenges in developing multimodal language models is how to effectively encode visual information into a format that the model can process. This is usually done with the help of neural network architectures, such as visual transformers (ViT) and convolutional neural networks (CNN). The ability to extract hierarchical features from images is widely used in computer vision tasks. Using these architectures as models, one can learn to extract more complex representations from input data. Furthermore, the transformer-based architecture is not only capable of capturing long-distance dependencies but also excels in understanding the relationships between objects in images. Very popular in recent years. These architectures enable models to extract meaningful features from input images and convert them into vector representations that can be combined with text input.

Another way to encode visual information is image symbolization (tokenization), which is to divide the image into smaller discrete units or tokens. This approach allows the model to process images in a similar way to text, enabling a more seamless integration of the two modalities. Image token information can be fed into the model along with text input, allowing it to focus on both modalities and produce more accurate and contextual output. For example, the DALL-E model developed by OpenAI uses a variant of VQ-VAE (Vector Quantized Variational Autoencoder) to symbolize images, allowing the model to generate novel images based on text descriptions. Developing small, efficient models that can act on user-supplied queries and images will have profound implications for the future development of AI systems. These models can be deployed on resource-constrained devices such as smartphones and IoT devices, expanding their application scope and scenarios. Leveraging the power of multimodal language models, these small systems can understand and respond to user queries in a more natural and intuitive way, while taking into account the visual context provided by the user. This opens up the possibility of more engaging, personalized human-machine interactions, such as virtual assistants that provide visual recommendations based on user preferences, or smart home devices that adjust settings based on the user’s facial expressions.

In addition, the development of multi-modal AI systems is expected to democratize artificial intelligence technology and benefit a wider range of users and industries. Smaller and more efficient models can be trained on hardware with weaker computing power, reducing the computing resources and energy consumption required for deployment. This may lead to the widespread application of AI systems in various fields such as medical care, education, entertainment, e-commerce, etc., ultimately changing the way people live and work.

Related Work

Multimodal models have attracted much attention due to their ability to process and learn multiple data types such as text, images, and audio. This type of model can capture the complex interactions between different modalities and use their complementary information to improve the performance of various tasks. Vision-Language Pre-trained (VLP) models such as ViLBERT, LXMERT, VisualBERT, etc. learn the alignment of visual and text features through cross-modal attention to generate rich multi-modal representations. Multi-modal transformer architectures such as MMT, ViLT, etc. have improved transformers to efficiently handle multiple modalities. Researchers have also tried to incorporate other modalities such as audio and facial expressions into models, such as multimodal sentiment analysis (MSA) models, multimodal emotion recognition (MER) models, etc. By utilizing the complementary information of different modalities, multimodal models achieve better performance and generalization capabilities than single-modal methods.

Terminal language models are defined as models with less than 7 billion parameters, because researchers have found that even with quantization, it is very difficult to run a 13 billion parameter model on edge devices. Recent advances in this area include Google’s Gemma 2B and 7B, Stable Diffusion’s Stable Code 3B, and Meta’s Llama 7B. Interestingly, Meta’s research shows that, unlike large language models, small language models perform better with deep and narrow architectures. Other techniques that are beneficial to the terminal model include embedding sharing, grouped query attention, and instant block weight sharing proposed in MobileLLM. These findings highlight the need to consider different optimization methods and design strategies when developing small language models for end applications than for large models.

Octopus Method

Main techniques used in the development of the Octopus v3 model. Two key aspects of multimodal model development are integrating image information with textual input and optimizing the model's ability to predict actions.

Visual information encoding

There are many visual information encoding methods in image processing, and embedding of hidden layers is commonly used. For example, the hidden layer embedding of the VGG-16 model is used for style transfer tasks. OpenAI’s CLIP model demonstrates the ability to align text and image embedding, leveraging its image encoder to embed images. Methods such as ViT use more advanced technologies such as image tokenization. The researchers evaluated various image coding techniques and found that the CLIP model method was the most effective. Therefore, this paper uses a CLIP-based model for image coding.

Functional token

Similar to tokenization applied to natural language and images, specific functions can also be encapsulated as functional tokens. The researchers introduced a training strategy for these tokens, drawing on the technology of natural language models to process unseen words. This method is similar to word2vec and enriches the semantics of the token through its context. For example, high-level language models may initially struggle with complex chemical terms such as PEGylation and Endosomal Escape. But through causal language modeling, especially by training on a data set that contains these terms, the model can learn these terms. Similarly, functional tokens can also be learned through parallel strategies, with the Octopus v2 model providing a powerful platform for such learning processes. Research shows that the definition space of functional tokens is infinite, allowing any specific function to be represented as a token.

Multi-stage training

In order to develop a high-performance multi-modal AI system, the researchers adopted a model architecture that integrates causal language models and image encoders. The training process of this model is divided into multiple stages. First, the causal language model and image encoder are trained separately to establish a basic model. Subsequently, the two components are merged and aligned and trained to synchronize image and text processing capabilities. On this basis, the method of Octopus v2 is used to promote the learning of functional tokens. In the final training phase, these functional tokens that interact with the environment provide feedback for further optimization of the model. Therefore, in the final stage, the researchers adopted reinforcement learning and selected another large language model as the reward model. This iterative training method enhances the model's ability to process and integrate multi-modal information.

Model Evaluation

This section introduces the experimental results of the model and compares them with the effects of integrating GPT-4V and GPT-4 models. In the comparative experiment, the researchers first used GPT-4V (gpt-4-turbo) to process image information. The extracted data is then fed into the GPT-4 framework (gpt-4-turbo-preview), which contextualizes all function descriptions and applies few-shot learning to improve performance. In the demonstration, the researchers converted 10 commonly used smartphone APIs into functional tokens and evaluated their performance, as detailed in the subsequent sections.

It is worth noting that although this article only shows 10 functional tokens, the model can train more tokens to create a more general AI system. The researchers found that for selected APIs, models with less than 1 billion parameters performed as multimodal AI comparable to the combination of GPT-4V and GPT-4.

In addition, the scalability of the model in this article allows the inclusion of a wide range of functional tokens, thereby enabling the creation of highly specialized AI systems suitable for specific fields or scenarios. This adaptability makes our approach particularly valuable in industries such as healthcare, finance, and customer service, where AI-driven solutions can significantly improve efficiency and user experience.

Among all the function names below, Octopus only outputs functional tokens such as ,..., . The researcher replaced the functional tokens with the corresponding function names for better demonstration. . All results below are generated directly without any output parser. Octopus v3 is a single model that can handle both Chinese and English, which means there is no need to train another Chinese model specifically.

Send an email

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

Send a text message

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

Google Search

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

Amazon Shopping

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

Intelligent Recycling

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

##Lost and Found

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

Interior Design

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

Instacart Shopping

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

DoorDash Delivery

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

Pet Care

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

Social Impact

Based on Octopus v2 Above, the updated model incorporates both textual and visual information, taking a significant step forward from its predecessor, a text-only approach. This significant advance enables simultaneous processing of visual and natural language data, paving the way for wider applications. The functional token introduced in Octopus v2 can be adapted to multiple fields, such as the medical and automotive industries. With the addition of visual data, the potential of functional tokens further extends to fields such as autonomous driving and robotics. In addition, the multi-modal model in this article makes it possible to actually transform devices such as Raspberry Pi into intelligent hardware such as Rabbit R1 and Humane AI Pin, using an end-point model rather than a cloud-based solution.

Functional token is currently authorized. The researcher encourages developers to participate in the framework of this article and innovate freely under the premise of complying with the license agreement. In future research, the researchers aim to develop a training framework that can accommodate additional data modalities such as audio and video. In addition, researchers have found that visual input can cause considerable latency and are currently optimizing inference speed.

The above is the detailed content of How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:jiqizhixin.com. If there is any infringement, please contact admin@php.cn delete