Home >Technology peripherals >AI >Beyond GPT-4, the Stanford team's large model that can be run on mobile phones became popular, with over 2k downloads overnight
In the process of implementing large models, end-side AI is a very important direction.
Recently, Octopus v2 launched by researchers at Stanford University has become popular and has received great attention from the developer community. The model has been downloaded over 2k times overnight.
The 2 billion-parameter Octopus v2 can run on smartphones, cars, PCs, etc., surpassing GPT-4 in accuracy and latency, and reducing context length by 95%. Furthermore, Octopus v2 is 36 times faster than the Llama7B RAG scheme.
Paper: Octopus v2: On-device language model for super agent
Paper address: https ://arxiv.org/abs/2404.01744
Model homepage: https://huggingface.co/NexaAIDev/Octopus-v2
Model Overview
Octopus-V2-2B is an open source language model with 2 billion parameters, tailored for the Android API. It runs seamlessly on Android devices and extends its utility to a variety of applications ranging from Android system management to orchestration of multiple devices.
Typically, Retrieval Augmented Generation (RAG) methods require detailed descriptions of potential function parameters (sometimes requiring up to tens of thousands of input tokens). Based on this, Octopus-V2-2B introduces a unique function token strategy in the training and inference phases, which not only enables it to achieve a performance level comparable to GPT-4, but also significantly improves the inference speed, surpassing RAG-based methods. This makes it particularly beneficial for edge computing devices.
Octopus-V2-2B is capable of generating individual, nested and parallel function calls in a variety of complex scenarios.
Dataset
In order to adopt high-quality datasets for the training, validation and testing phases, and especially to achieve efficient training, the research team created the dataset with three key stages:
Generate relevant queries and their associated function call parameters;
Generate unrelated queries from the appropriate function components;
Binary verification support via Google Gemini.
The research team wrote 20 Android API descriptions for training the model. The following is an example of Android API description:
def get_trending_news (category=None, region='US', language='en', max_results=5):"""Fetches trending news articles based on category, region, and language.Parameters:- category (str, optional): News category to filter by, by default use None for all categories. Optional to provide.- region (str, optional): ISO 3166-1 alpha-2 country code for region-specific news, by default, uses 'US'. Optional to provide.- language (str, optional): ISO 639-1 language code for article language, by default uses 'en'. Optional to provide.- max_results (int, optional): Maximum number of articles to return, by default, uses 5. Optional to provide.Returns:- list [str]: A list of strings, each representing an article. Each string contains the article's heading and URL. """
Model development and training
This research uses the Google Gemma-2B model as the pre-processor in the framework Train the model using two different training methods: full model training and LoRA model training.
In the complete model training, this study uses the AdamW optimizer, the learning rate is set to 5e-5, the number of warm-up steps is set to 10, and a linear learning rate scheduler is used.
LoRA model training uses the same optimizer and learning rate configuration as the full model training, LoRA rank is set to 16, and LoRA is applied to the following modules: q_proj, k_proj, v_proj, o_proj, up_proj, down_proj. Among them, the LoRA alpha parameter is set to 32.
For both training methods, the number of epochs is set to 3.
Using the following code, you can run the Octopus-V2-2B model on a single GPU.
from transformers import AutoTokenizer, GemmaForCausalLMimport torchimport timedef inference (input_text):start_time = time.time ()input_ids = tokenizer (input_text, return_tensors="pt").to (model.device)input_length = input_ids ["input_ids"].shape [1]outputs = model.generate (input_ids=input_ids ["input_ids"], max_length=1024,do_sample=False)generated_sequence = outputs [:, input_length:].tolist ()res = tokenizer.decode (generated_sequence [0])end_time = time.time ()return {"output": res, "latency": end_time - start_time}model_id = "NexaAIDev/Octopus-v2"tokenizer = AutoTokenizer.from_pretrained (model_id)model = GemmaForCausalLM.from_pretrained (model_id, torch_dtype=torch.bfloat16, device_map="auto")input_text = "Take a selfie for me with front camera"nexa_query = f"Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: {input_text} \n\nResponse:"start_time = time.time () print ("nexa model result:\n", inference (nexa_query)) print ("latency:", time.time () - start_time,"s")
Evaluation
Octopus-V2-2B demonstrated superior inference speed in benchmark tests, outperforming "Llama7B" on a single A100 GPU RAG solution is 36 times faster. Additionally, Octopus-V2-2B is 168% faster compared to GPT-4-turbo, which relies on clustered A100/H100 GPUs. This efficiency breakthrough is attributed to the functional token design of Octopus-V2-2B.
Octopus-V2-2B not only performs well in speed, but also in accuracy, surpassing the "Llama7B RAG solution" in function call accuracy by 31%. Octopus-V2-2B achieves function calling accuracy comparable to GPT-4 and RAG GPT-3.5.
Interested readers can read the original text of the paper to learn more about the research content.
The above is the detailed content of Beyond GPT-4, the Stanford team's large model that can be run on mobile phones became popular, with over 2k downloads overnight. For more information, please follow other related articles on the PHP Chinese website!