


How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?
Multimodal AI systems are characterized by their ability to process and learn various types of data including natural language, vision, audio, etc., to guide their behavioral decisions. Recently, research on incorporating visual data into large language models (such as GPT-4V) has made important progress, but how to effectively convert image information into executable operations for AI systems still faces challenges. In order to achieve the transformation of image information, a common method is to convert image data into corresponding text descriptions, and then the AI system operates based on the descriptions. This can be done by performing supervised learning on existing image data sets, allowing the AI system to automatically learn the image-to-text mapping relationship. In addition, reinforcement learning methods can also be used to learn how to make decisions based on image information by interacting with the environment. Another method is to directly combine image information with language models to construct
In a recent paper, researchers proposed a multi-modal model designed specifically for AI applications, introducing "functional "token" concept.
Paper title: Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent
Paper link: https://arxiv .org/pdf/2404.11459.pdf
Model weights and inference code: https://www.nexa4ai.com/apply
This model can fully support edge devices, and researchers have optimized its parameter size to within 1 billion. Similar to GPT-4, this model can handle both English and Chinese. Experiments have proven that the model can run efficiently on various resource-limited terminal devices including Raspberry Pi.




#Research Background
The rapid development of artificial intelligence technology has completely changed the way human-computer interaction occurs, giving rise to a number of intelligent AI systems that can perform complex tasks and make decisions based on natural language, vision and other forms of input. These systems are expected to automate everything from simple tasks such as image recognition and language translation to complex applications such as medical diagnosis and autonomous driving. Multimodal language models are at the core of these intelligent systems, enabling them to understand and generate near-human responses by processing and integrating multimodal data such as text, images, and even audio and video. Compared with traditional language models that mainly focus on text processing and generation, multimodal language models are a big leap forward. By incorporating visual information, these models are able to better understand the context and semantics of the input data, resulting in more accurate and relevant output. The ability to process and integrate multimodal data is crucial for developing multimodal AI systems that can simultaneously understand tasks such as language and visual information, such as visual question answering, image navigation, multimodal sentiment analysis, etc.
One of the challenges in developing multimodal language models is how to effectively encode visual information into a format that the model can process. This is usually done with the help of neural network architectures, such as visual transformers (ViT) and convolutional neural networks (CNN). The ability to extract hierarchical features from images is widely used in computer vision tasks. Using these architectures as models, one can learn to extract more complex representations from input data. Furthermore, the transformer-based architecture is not only capable of capturing long-distance dependencies but also excels in understanding the relationships between objects in images. Very popular in recent years. These architectures enable models to extract meaningful features from input images and convert them into vector representations that can be combined with text input.
Another way to encode visual information is image symbolization (tokenization), which is to divide the image into smaller discrete units or tokens. This approach allows the model to process images in a similar way to text, enabling a more seamless integration of the two modalities. Image token information can be fed into the model along with text input, allowing it to focus on both modalities and produce more accurate and contextual output. For example, the DALL-E model developed by OpenAI uses a variant of VQ-VAE (Vector Quantized Variational Autoencoder) to symbolize images, allowing the model to generate novel images based on text descriptions. Developing small, efficient models that can act on user-supplied queries and images will have profound implications for the future development of AI systems. These models can be deployed on resource-constrained devices such as smartphones and IoT devices, expanding their application scope and scenarios. Leveraging the power of multimodal language models, these small systems can understand and respond to user queries in a more natural and intuitive way, while taking into account the visual context provided by the user. This opens up the possibility of more engaging, personalized human-machine interactions, such as virtual assistants that provide visual recommendations based on user preferences, or smart home devices that adjust settings based on the user’s facial expressions.
In addition, the development of multi-modal AI systems is expected to democratize artificial intelligence technology and benefit a wider range of users and industries. Smaller and more efficient models can be trained on hardware with weaker computing power, reducing the computing resources and energy consumption required for deployment. This may lead to the widespread application of AI systems in various fields such as medical care, education, entertainment, e-commerce, etc., ultimately changing the way people live and work.
Related Work
Multimodal models have attracted much attention due to their ability to process and learn multiple data types such as text, images, and audio. This type of model can capture the complex interactions between different modalities and use their complementary information to improve the performance of various tasks. Vision-Language Pre-trained (VLP) models such as ViLBERT, LXMERT, VisualBERT, etc. learn the alignment of visual and text features through cross-modal attention to generate rich multi-modal representations. Multi-modal transformer architectures such as MMT, ViLT, etc. have improved transformers to efficiently handle multiple modalities. Researchers have also tried to incorporate other modalities such as audio and facial expressions into models, such as multimodal sentiment analysis (MSA) models, multimodal emotion recognition (MER) models, etc. By utilizing the complementary information of different modalities, multimodal models achieve better performance and generalization capabilities than single-modal methods.
Terminal language models are defined as models with less than 7 billion parameters, because researchers have found that even with quantization, it is very difficult to run a 13 billion parameter model on edge devices. Recent advances in this area include Google’s Gemma 2B and 7B, Stable Diffusion’s Stable Code 3B, and Meta’s Llama 7B. Interestingly, Meta’s research shows that, unlike large language models, small language models perform better with deep and narrow architectures. Other techniques that are beneficial to the terminal model include embedding sharing, grouped query attention, and instant block weight sharing proposed in MobileLLM. These findings highlight the need to consider different optimization methods and design strategies when developing small language models for end applications than for large models.
Octopus Method
Main techniques used in the development of the Octopus v3 model. Two key aspects of multimodal model development are integrating image information with textual input and optimizing the model's ability to predict actions.
Visual information encoding
There are many visual information encoding methods in image processing, and embedding of hidden layers is commonly used. For example, the hidden layer embedding of the VGG-16 model is used for style transfer tasks. OpenAI’s CLIP model demonstrates the ability to align text and image embedding, leveraging its image encoder to embed images. Methods such as ViT use more advanced technologies such as image tokenization. The researchers evaluated various image coding techniques and found that the CLIP model method was the most effective. Therefore, this paper uses a CLIP-based model for image coding.
Functional token
Similar to tokenization applied to natural language and images, specific functions can also be encapsulated as functional tokens. The researchers introduced a training strategy for these tokens, drawing on the technology of natural language models to process unseen words. This method is similar to word2vec and enriches the semantics of the token through its context. For example, high-level language models may initially struggle with complex chemical terms such as PEGylation and Endosomal Escape. But through causal language modeling, especially by training on a data set that contains these terms, the model can learn these terms. Similarly, functional tokens can also be learned through parallel strategies, with the Octopus v2 model providing a powerful platform for such learning processes. Research shows that the definition space of functional tokens is infinite, allowing any specific function to be represented as a token.
Multi-stage training
In order to develop a high-performance multi-modal AI system, the researchers adopted a model architecture that integrates causal language models and image encoders. The training process of this model is divided into multiple stages. First, the causal language model and image encoder are trained separately to establish a basic model. Subsequently, the two components are merged and aligned and trained to synchronize image and text processing capabilities. On this basis, the method of Octopus v2 is used to promote the learning of functional tokens. In the final training phase, these functional tokens that interact with the environment provide feedback for further optimization of the model. Therefore, in the final stage, the researchers adopted reinforcement learning and selected another large language model as the reward model. This iterative training method enhances the model's ability to process and integrate multi-modal information.
Model Evaluation
This section introduces the experimental results of the model and compares them with the effects of integrating GPT-4V and GPT-4 models. In the comparative experiment, the researchers first used GPT-4V (gpt-4-turbo) to process image information. The extracted data is then fed into the GPT-4 framework (gpt-4-turbo-preview), which contextualizes all function descriptions and applies few-shot learning to improve performance. In the demonstration, the researchers converted 10 commonly used smartphone APIs into functional tokens and evaluated their performance, as detailed in the subsequent sections.
It is worth noting that although this article only shows 10 functional tokens, the model can train more tokens to create a more general AI system. The researchers found that for selected APIs, models with less than 1 billion parameters performed as multimodal AI comparable to the combination of GPT-4V and GPT-4.
In addition, the scalability of the model in this article allows the inclusion of a wide range of functional tokens, thereby enabling the creation of highly specialized AI systems suitable for specific fields or scenarios. This adaptability makes our approach particularly valuable in industries such as healthcare, finance, and customer service, where AI-driven solutions can significantly improve efficiency and user experience.
Among all the function names below, Octopus only outputs functional tokens such as ,...,
Send an email
Send a text message
Google Search
Amazon Shopping
Intelligent Recycling
##Lost and Found
Interior Design
Instacart Shopping
DoorDash Delivery
Pet Care
Social Impact
Based on Octopus v2 Above, the updated model incorporates both textual and visual information, taking a significant step forward from its predecessor, a text-only approach. This significant advance enables simultaneous processing of visual and natural language data, paving the way for wider applications. The functional token introduced in Octopus v2 can be adapted to multiple fields, such as the medical and automotive industries. With the addition of visual data, the potential of functional tokens further extends to fields such as autonomous driving and robotics. In addition, the multi-modal model in this article makes it possible to actually transform devices such as Raspberry Pi into intelligent hardware such as Rabbit R1 and Humane AI Pin, using an end-point model rather than a cloud-based solution.
Functional token is currently authorized. The researcher encourages developers to participate in the framework of this article and innovate freely under the premise of complying with the license agreement. In future research, the researchers aim to develop a training framework that can accommodate additional data modalities such as audio and video. In addition, researchers have found that visual input can cause considerable latency and are currently optimizing inference speed.
The above is the detailed content of How can OctopusV3, with less than 1 billion parameters, compare with GPT-4V and GPT-4?. For more information, please follow other related articles on the PHP Chinese website!

Since 2008, I've championed the shared-ride van—initially dubbed the "robotjitney," later the "vansit"—as the future of urban transportation. I foresee these vehicles as the 21st century's next-generation transit solution, surpas

Revolutionizing the Checkout Experience Sam's Club's innovative "Just Go" system builds on its existing AI-powered "Scan & Go" technology, allowing members to scan purchases via the Sam's Club app during their shopping trip.

Nvidia's Enhanced Predictability and New Product Lineup at GTC 2025 Nvidia, a key player in AI infrastructure, is focusing on increased predictability for its clients. This involves consistent product delivery, meeting performance expectations, and

Google's Gemma 2: A Powerful, Efficient Language Model Google's Gemma family of language models, celebrated for efficiency and performance, has expanded with the arrival of Gemma 2. This latest release comprises two models: a 27-billion parameter ver

This Leading with Data episode features Dr. Kirk Borne, a leading data scientist, astrophysicist, and TEDx speaker. A renowned expert in big data, AI, and machine learning, Dr. Borne offers invaluable insights into the current state and future traje

There were some very insightful perspectives in this speech—background information about engineering that showed us why artificial intelligence is so good at supporting people’s physical exercise. I will outline a core idea from each contributor’s perspective to demonstrate three design aspects that are an important part of our exploration of the application of artificial intelligence in sports. Edge devices and raw personal data This idea about artificial intelligence actually contains two components—one related to where we place large language models and the other is related to the differences between our human language and the language that our vital signs “express” when measured in real time. Alexander Amini knows a lot about running and tennis, but he still

Caterpillar's Chief Information Officer and Senior Vice President of IT, Jamie Engstrom, leads a global team of over 2,200 IT professionals across 28 countries. With 26 years at Caterpillar, including four and a half years in her current role, Engst

Google Photos' New Ultra HDR Tool: A Quick Guide Enhance your photos with Google Photos' new Ultra HDR tool, transforming standard images into vibrant, high-dynamic-range masterpieces. Ideal for social media, this tool boosts the impact of any photo,


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Atom editor mac version download
The most popular open source editor

SublimeText3 English version
Recommended: Win version, supports code prompts!

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.