search
HomeTechnology peripheralsAIHow can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

Multimodal AI systems are characterized by their ability to process and learn various types of data including natural language, vision, audio, etc., to guide their behavioral decisions. Recently, research on incorporating visual data into large language models (such as GPT-4V) has made important progress, but how to effectively convert image information into executable operations for AI systems still faces challenges. In order to achieve the transformation of image information, a common method is to convert image data into corresponding text descriptions, and then the AI ​​system operates based on the descriptions. This can be done by performing supervised learning on existing image data sets, allowing the AI ​​system to automatically learn the image-to-text mapping relationship. In addition, reinforcement learning methods can also be used to learn how to make decisions based on image information by interacting with the environment. Another method is to directly combine image information with language models to construct

In a recent paper, researchers proposed a multi-modal model designed specifically for AI applications, introducing "functional "token" concept.

  • Paper title: Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent

  • Paper link: https://arxiv .org/pdf/2404.11459.pdf

  • Model weights and inference code: https://www.nexa4ai.com/apply

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

This model can fully support edge devices, and researchers have optimized its parameter size to within 1 billion. Similar to GPT-4, this model can handle both English and Chinese. Experiments have proven that the model can run efficiently on various resource-limited terminal devices including Raspberry Pi.

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?
How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?
How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

#Research Background

The rapid development of artificial intelligence technology has completely changed the way human-computer interaction occurs, giving rise to a number of intelligent AI systems that can perform complex tasks and make decisions based on natural language, vision and other forms of input. These systems are expected to automate everything from simple tasks such as image recognition and language translation to complex applications such as medical diagnosis and autonomous driving. Multimodal language models are at the core of these intelligent systems, enabling them to understand and generate near-human responses by processing and integrating multimodal data such as text, images, and even audio and video. Compared with traditional language models that mainly focus on text processing and generation, multimodal language models are a big leap forward. By incorporating visual information, these models are able to better understand the context and semantics of the input data, resulting in more accurate and relevant output. The ability to process and integrate multimodal data is crucial for developing multimodal AI systems that can simultaneously understand tasks such as language and visual information, such as visual question answering, image navigation, multimodal sentiment analysis, etc.

One of the challenges in developing multimodal language models is how to effectively encode visual information into a format that the model can process. This is usually done with the help of neural network architectures, such as visual transformers (ViT) and convolutional neural networks (CNN). The ability to extract hierarchical features from images is widely used in computer vision tasks. Using these architectures as models, one can learn to extract more complex representations from input data. Furthermore, the transformer-based architecture is not only capable of capturing long-distance dependencies but also excels in understanding the relationships between objects in images. Very popular in recent years. These architectures enable models to extract meaningful features from input images and convert them into vector representations that can be combined with text input.

Another way to encode visual information is image symbolization (tokenization), which is to divide the image into smaller discrete units or tokens. This approach allows the model to process images in a similar way to text, enabling a more seamless integration of the two modalities. Image token information can be fed into the model along with text input, allowing it to focus on both modalities and produce more accurate and contextual output. For example, the DALL-E model developed by OpenAI uses a variant of VQ-VAE (Vector Quantized Variational Autoencoder) to symbolize images, allowing the model to generate novel images based on text descriptions. Developing small, efficient models that can act on user-supplied queries and images will have profound implications for the future development of AI systems. These models can be deployed on resource-constrained devices such as smartphones and IoT devices, expanding their application scope and scenarios. Leveraging the power of multimodal language models, these small systems can understand and respond to user queries in a more natural and intuitive way, while taking into account the visual context provided by the user. This opens up the possibility of more engaging, personalized human-machine interactions, such as virtual assistants that provide visual recommendations based on user preferences, or smart home devices that adjust settings based on the user’s facial expressions.

In addition, the development of multi-modal AI systems is expected to democratize artificial intelligence technology and benefit a wider range of users and industries. Smaller and more efficient models can be trained on hardware with weaker computing power, reducing the computing resources and energy consumption required for deployment. This may lead to the widespread application of AI systems in various fields such as medical care, education, entertainment, e-commerce, etc., ultimately changing the way people live and work.

Related Work

Multimodal models have attracted much attention due to their ability to process and learn multiple data types such as text, images, and audio. This type of model can capture the complex interactions between different modalities and use their complementary information to improve the performance of various tasks. Vision-Language Pre-trained (VLP) models such as ViLBERT, LXMERT, VisualBERT, etc. learn the alignment of visual and text features through cross-modal attention to generate rich multi-modal representations. Multi-modal transformer architectures such as MMT, ViLT, etc. have improved transformers to efficiently handle multiple modalities. Researchers have also tried to incorporate other modalities such as audio and facial expressions into models, such as multimodal sentiment analysis (MSA) models, multimodal emotion recognition (MER) models, etc. By utilizing the complementary information of different modalities, multimodal models achieve better performance and generalization capabilities than single-modal methods.

Terminal language models are defined as models with less than 7 billion parameters, because researchers have found that even with quantization, it is very difficult to run a 13 billion parameter model on edge devices. Recent advances in this area include Google’s Gemma 2B and 7B, Stable Diffusion’s Stable Code 3B, and Meta’s Llama 7B. Interestingly, Meta’s research shows that, unlike large language models, small language models perform better with deep and narrow architectures. Other techniques that are beneficial to the terminal model include embedding sharing, grouped query attention, and instant block weight sharing proposed in MobileLLM. These findings highlight the need to consider different optimization methods and design strategies when developing small language models for end applications than for large models.

Octopus Method

Main techniques used in the development of the Octopus v3 model. Two key aspects of multimodal model development are integrating image information with textual input and optimizing the model's ability to predict actions.

Visual information encoding

There are many visual information encoding methods in image processing, and embedding of hidden layers is commonly used. For example, the hidden layer embedding of the VGG-16 model is used for style transfer tasks. OpenAI’s CLIP model demonstrates the ability to align text and image embedding, leveraging its image encoder to embed images. Methods such as ViT use more advanced technologies such as image tokenization. The researchers evaluated various image coding techniques and found that the CLIP model method was the most effective. Therefore, this paper uses a CLIP-based model for image coding.

Functional token

Similar to tokenization applied to natural language and images, specific functions can also be encapsulated as functional tokens. The researchers introduced a training strategy for these tokens, drawing on the technology of natural language models to process unseen words. This method is similar to word2vec and enriches the semantics of the token through its context. For example, high-level language models may initially struggle with complex chemical terms such as PEGylation and Endosomal Escape. But through causal language modeling, especially by training on a data set that contains these terms, the model can learn these terms. Similarly, functional tokens can also be learned through parallel strategies, with the Octopus v2 model providing a powerful platform for such learning processes. Research shows that the definition space of functional tokens is infinite, allowing any specific function to be represented as a token.

Multi-stage training

In order to develop a high-performance multi-modal AI system, the researchers adopted a model architecture that integrates causal language models and image encoders. The training process of this model is divided into multiple stages. First, the causal language model and image encoder are trained separately to establish a basic model. Subsequently, the two components are merged and aligned and trained to synchronize image and text processing capabilities. On this basis, the method of Octopus v2 is used to promote the learning of functional tokens. In the final training phase, these functional tokens that interact with the environment provide feedback for further optimization of the model. Therefore, in the final stage, the researchers adopted reinforcement learning and selected another large language model as the reward model. This iterative training method enhances the model's ability to process and integrate multi-modal information.

Model Evaluation

This section introduces the experimental results of the model and compares them with the effects of integrating GPT-4V and GPT-4 models. In the comparative experiment, the researchers first used GPT-4V (gpt-4-turbo) to process image information. The extracted data is then fed into the GPT-4 framework (gpt-4-turbo-preview), which contextualizes all function descriptions and applies few-shot learning to improve performance. In the demonstration, the researchers converted 10 commonly used smartphone APIs into functional tokens and evaluated their performance, as detailed in the subsequent sections.

It is worth noting that although this article only shows 10 functional tokens, the model can train more tokens to create a more general AI system. The researchers found that for selected APIs, models with less than 1 billion parameters performed as multimodal AI comparable to the combination of GPT-4V and GPT-4.

In addition, the scalability of the model in this article allows the inclusion of a wide range of functional tokens, thereby enabling the creation of highly specialized AI systems suitable for specific fields or scenarios. This adaptability makes our approach particularly valuable in industries such as healthcare, finance, and customer service, where AI-driven solutions can significantly improve efficiency and user experience.

Among all the function names below, Octopus only outputs functional tokens such as ,..., . The researcher replaced the functional tokens with the corresponding function names for better demonstration. . All results below are generated directly without any output parser. Octopus v3 is a single model that can handle both Chinese and English, which means there is no need to train another Chinese model specifically.

Send an email

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

Send a text message

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

Google Search

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

Amazon Shopping

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

Intelligent Recycling

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

##Lost and Found

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

Interior Design

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

Instacart Shopping

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

DoorDash Delivery

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

Pet Care

How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?

Social Impact

Based on Octopus v2 Above, the updated model incorporates both textual and visual information, taking a significant step forward from its predecessor, a text-only approach. This significant advance enables simultaneous processing of visual and natural language data, paving the way for wider applications. The functional token introduced in Octopus v2 can be adapted to multiple fields, such as the medical and automotive industries. With the addition of visual data, the potential of functional tokens further extends to fields such as autonomous driving and robotics. In addition, the multi-modal model in this article makes it possible to actually transform devices such as Raspberry Pi into intelligent hardware such as Rabbit R1 and Humane AI Pin, using an end-point model rather than a cloud-based solution.

Functional token is currently authorized. The researcher encourages developers to participate in the framework of this article and innovate freely under the premise of complying with the license agreement. In future research, the researchers aim to develop a training framework that can accommodate additional data modalities such as audio and video. In addition, researchers have found that visual input can cause considerable latency and are currently optimizing inference speed.

The above is the detailed content of How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:机器之心. If there is any infringement, please contact admin@php.cn delete
Tesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserTesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserApr 22, 2025 am 11:48 AM

Since 2008, I've championed the shared-ride van—initially dubbed the "robotjitney," later the "vansit"—as the future of urban transportation. I foresee these vehicles as the 21st century's next-generation transit solution, surpas

Sam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailSam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailApr 22, 2025 am 11:29 AM

Revolutionizing the Checkout Experience Sam's Club's innovative "Just Go" system builds on its existing AI-powered "Scan & Go" technology, allowing members to scan purchases via the Sam's Club app during their shopping trip.

Nvidia's AI Omniverse Expands At GTC 2025Nvidia's AI Omniverse Expands At GTC 2025Apr 22, 2025 am 11:28 AM

Nvidia's Enhanced Predictability and New Product Lineup at GTC 2025 Nvidia, a key player in AI infrastructure, is focusing on increased predictability for its clients. This involves consistent product delivery, meeting performance expectations, and

Exploring the Capabilities of Google's Gemma 2 ModelsExploring the Capabilities of Google's Gemma 2 ModelsApr 22, 2025 am 11:26 AM

Google's Gemma 2: A Powerful, Efficient Language Model Google's Gemma family of language models, celebrated for efficiency and performance, has expanded with the arrival of Gemma 2. This latest release comprises two models: a 27-billion parameter ver

The Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaThe Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaApr 22, 2025 am 11:21 AM

This Leading with Data episode features Dr. Kirk Borne, a leading data scientist, astrophysicist, and TEDx speaker. A renowned expert in big data, AI, and machine learning, Dr. Borne offers invaluable insights into the current state and future traje

AI For Runners And Athletes: We're Making Excellent ProgressAI For Runners And Athletes: We're Making Excellent ProgressApr 22, 2025 am 11:12 AM

There were some very insightful perspectives in this speech—background information about engineering that showed us why artificial intelligence is so good at supporting people’s physical exercise. I will outline a core idea from each contributor’s perspective to demonstrate three design aspects that are an important part of our exploration of the application of artificial intelligence in sports. Edge devices and raw personal data This idea about artificial intelligence actually contains two components—one related to where we place large language models and the other is related to the differences between our human language and the language that our vital signs “express” when measured in real time. Alexander Amini knows a lot about running and tennis, but he still

Jamie Engstrom On Technology, Talent And Transformation At CaterpillarJamie Engstrom On Technology, Talent And Transformation At CaterpillarApr 22, 2025 am 11:10 AM

Caterpillar's Chief Information Officer and Senior Vice President of IT, Jamie Engstrom, leads a global team of over 2,200 IT professionals across 28 countries. With 26 years at Caterpillar, including four and a half years in her current role, Engst

New Google Photos Update Makes Any Photo Pop With Ultra HDR QualityNew Google Photos Update Makes Any Photo Pop With Ultra HDR QualityApr 22, 2025 am 11:09 AM

Google Photos' New Ultra HDR Tool: A Quick Guide Enhance your photos with Google Photos' new Ultra HDR tool, transforming standard images into vibrant, high-dynamic-range masterpieces. Ideal for social media, this tool boosts the impact of any photo,

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.