


Since the opening of the ChatGPT API, a large number of studies have chosen to use the output of large basic models (LFM) such as ChatGPT and GPT-4 as training data, and then improve the capabilities of small models through imitation learning.
However, due to problems such as superficial imitation signals, insufficient training data, and lack of strict evaluation standards, the actual performance of small models has been overestimated.
From an effect point of view, the small model is more inclined to imitate the output style of LFM rather than the inference process.
## Paper link: https://arxiv.org/pdf/2306.02707.pdf
To address these challenges, Microsoft recently released a 51-page paper proposing a 13 billion-parameter Orca model that can learn to imitate the reasoning process of LFMs.
The researchers designed rich training signals for the large model, so that Orca can learn explanation traces, step-by-step thinking processes, complex instructions, etc. from GPT-4, and by ChatGPT Teachers assist in guidance; and mining large-scale and diverse imitation data through sampling and selection can further enhance the progressive learning effect.
In experimental evaluation, Orca outperformed other SOTA instruction fine-tuning models, achieving double the performance of Vicuna-13B in complex zero-shot inference benchmarks such as BigBench Hard (BBH) Performance, a 42% performance improvement was also achieved on AGIEval.
Additionally, Orca achieved performance on par with ChatGPT on the BBH benchmark and on professional and academic exams such as the SAT, LSAT, GRE, and GMAT There is only a 4% performance gap in , and they are all measured in a zero-sample setting without thought chaining.
#The findings show that letting models learn from step-by-step explanations, whether those explanations are generated by humans or more advanced AI models, They are all promising research directions to improve model capabilities and skills.
Explanation TuningDataset construction
In the training data, each instance includes three parts, namely system message, user query and LFM reply.
System message (system message) is placed at the beginning of the prompt and provides basic context, guidance and other related details to LFM.
System messages can be used to change the length of responses, describe the personality of the AI assistant, establish acceptable and unacceptable LFM behavior, and determine the response structure of the AI model.
The researchers hand-crafted 16 pieces of system information to design different types of LFM responses, which can generate creative content and solve information query problems. The most important thing is to be able to generate explanations and prompts based on the prompts. Step by step reasoning answers.
User query Defines the actual task you want LFM to perform.
In order to obtain a large number of diverse user queries, researchers used the FLAN-v2 collection to extract 5 million user queries (FLAN-5M) and collect ChatGPT responses; Then we further extracted 1 million instructions (FLAN-1M) from the 5 million instructions to collect the responses of GPT-4.
The FLAN-v2 set consists of five sub-sets, namely CoT, NiV2, T0, Flan 2021 and Dialogue, where each subset contains multiple tasks, and each task is a query collection.
Each sub-collection is related to multiple academic datasets, and each dataset has one or more tasks that focus mainly on zero-shot and few-shot queries.
In this work, the researchers only sampled the zero-shot queries for training Orca and did not sample from the Dialogue subset because these queries often lack the context to be useful from ChatGPT reply.
Let ChatGPT act as Teaching Assistant
First train Orca on FLAN-5M data (ChatGPT enhancement), followed by the second stage of training (GPT-4 enhancement) on FLAN-1M.
There are two main reasons for using ChatGPT as an intermediate teacher assistant:
1. Capability gap
Although the parameter amount of GPT-4 has not been disclosed, the 13 billion parameters of Orca are definitely many times smaller than GPT-4, and the capability gap between ChatGPT and Orca is Smaller, more suitable as an intermediate teacher, and this approach has been proven to improve the imitation learning performance of smaller student models in knowledge distillation.
This approach can also be seen as a kind of progressive learning or course learning, in which students first learn from easier examples and then move on to more difficult examples, assuming that the more Long responses will be more difficult to imitate than shorter responses, allowing for improved reasoning and step-by-step explanation skills from larger teacher models.
#2. Cost and Time
Large-scale data collection from Azure OpenAI API There will be some restrictions, including the rate limit of requests per minute to prevent excessive traffic; due to service delay issues, the number of available tokens per minute is limited; the prompt length and the monetary cost of token completion.
In comparison, ChatGPT API is faster and cheaper than GPT-4 terminal, so more is collected from ChatGPT than GPT-4 5 times the data.
It can be observed from the distribution of reply lengths of ChatGPT and GPT-4 corresponding to different system messages that the replies of GPT-4 are longer on average than those of ChatGPT 1.5x, enabling Orca to progressively learn from the complexity of teacher explanations, and demonstrating the impact of teacher help through ablation experiments.
Training
In the word segmentation stage, the researchers used LLaMA’s byte pair encoding (BPE) tokenizer to process input samples where multi-digit numbers are split into multiple single digits and fall back to bytes to decompose unknown UTF-8 characters.
In order to handle variable-length sequences, a filler word [[PAD]] is introduced in the vocabulary of the LLaMA tokenizer, and the final vocabulary contains 32001 tokens
In order to optimize the training process and effectively utilize available computing resources, researchers used packing technology to concatenate multiple input instances into a sequence before training the model.
During the packing process, the total length of the concatenated sequence does not exceed max_len=2048 tokens. The input samples will be randomly shuffled and divided into several groups. The length of each group of concatenated sequences At most max_len
Taking into account the length distribution of boosting instructions in the training data, the packing factor of each sequence is 2.7
To train Orca, The researchers chose to only calculate the loss of tokens generated by the teacher model, which means that learning to generate responses conditioned on system information and task instructions can ensure that the model focuses on learning from the most relevant and informative tokens, improving the efficiency of the training process. Overall efficiency and effectiveness.
Finally, Orca was trained on 20 NVIDIA A100 GPUs with 80GB of memory. It was first trained on FLAN-5M (ChatGPT enhanced) for 4 epochs, which took 160 hours; then on FLAN-1M (GPT -4 enhancement) and continue to train for 4 epochs
Due to traffic restrictions, terminal load and reply length issues, multiple GPT-3.5-turbo (ChatGPT) and GPT-4 The terminals took 2 and 3 weeks to collect data respectively.
Experimental part
The researchers mainly verified Orca’s reasoning capabilities.
As can be seen in the AGIEval experiment, Orca's performance is equivalent to Text-da-Vinci-003 and achieves 88% of ChatGPT's Performance, but significantly behind GPT-4
For analysis and reasoning tasks, Vicuna performed significantly worse, retaining only 62% of ChatGPT quality, indicating that this open source language model The reasoning ability is very poor.
While Orca performs equally well with Text-da-Vinci-003, it is still 5 points lower than ChatGPT, Orca performs better on math-related tasks (in SAT, GRE, GMAT) There is a big gap between it and ChatGPT.
Compared to Vicuna, Orca shows stronger performance, outperforming Vicuna in every category, with an average relative improvement of 42%.
GPT-4 far outperforms all other models, but there is still significant room for improvement in this benchmark, with all models currently performing significantly below human scores .
Orca's performance varies greatly depending on the type of system message. For trained models, empty system messages tend to work well. .
Orca outperforms ChatGPT (Orca-beats-ChatGPT example) on 325 samples of different tasks, most of which are from LogiQA (29% ), while other LSAT tasks and SAT-English tasks each account for less than 10%
The reasoning evaluation results on the Big-Bench Hard Results data set show that Orca’s performance in all tasks The overall performance is slightly better than ChatGPT, but significantly behind GPT-4; 113% higher than Vicuna performance
The above is the detailed content of Is 'imitation learning' just a cliché? Explanation fine-tuning + 13 billion parameters Orca: reasoning ability equals ChatGPT. For more information, please follow other related articles on the PHP Chinese website!

Introduction In prompt engineering, “Graph of Thought” refers to a novel approach that uses graph theory to structure and guide AI’s reasoning process. Unlike traditional methods, which often involve linear s

Introduction Congratulations! You run a successful business. Through your web pages, social media campaigns, webinars, conferences, free resources, and other sources, you collect 5000 email IDs daily. The next obvious step is

Introduction In today’s fast-paced software development environment, ensuring optimal application performance is crucial. Monitoring real-time metrics such as response times, error rates, and resource utilization can help main

“How many users do you have?” he prodded. “I think the last time we said was 500 million weekly actives, and it is growing very rapidly,” replied Altman. “You told me that it like doubled in just a few weeks,” Anderson continued. “I said that priv

Introduction Mistral has released its very first multimodal model, namely the Pixtral-12B-2409. This model is built upon Mistral’s 12 Billion parameter, Nemo 12B. What sets this model apart? It can now take both images and tex

Imagine having an AI-powered assistant that not only responds to your queries but also autonomously gathers information, executes tasks, and even handles multiple types of data—text, images, and code. Sounds futuristic? In this a

Introduction The finance industry is the cornerstone of any country’s development, as it drives economic growth by facilitating efficient transactions and credit availability. The ease with which transactions occur and credit

Introduction Data is being generated at an unprecedented rate from sources such as social media, financial transactions, and e-commerce platforms. Handling this continuous stream of information is a challenge, but it offers an


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SublimeText3 Chinese version
Chinese version, very easy to use

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Dreamweaver Mac version
Visual web development tools

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.