Home > Article > Technology peripherals > The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillation
#Large language models (LLMs) perform well on inference tasks, but their black-box properties and large number of parameters limit their application in practice. Especially when dealing with complex mathematical problems, LLMs sometimes develop faulty reasoning chains. Traditional research methods only transfer knowledge from positive samples, ignoring important information with wrong answers in synthetic data. Therefore, in order to improve the performance and reliability of LLMs, we need to consider and utilize synthetic data more comprehensively, not just limited to positive samples, to help LLMs better understand and reason about complex problems. This will help solve the challenges of LLMs in practice and promote their widespread application.
At AAAI 2024, Xiaohongshu search algorithm team proposed an innovative framework, Make full use of negative sample knowledge in the process of distilling large model reasoning capabilities. Negative samples, that is, those data that fail to produce correct answers during the inference process, are often regarded as useless, but in fact they contain valuable information.
The paper proposes and verifies the value of negative samples in the large model distillation process, and builds a model specialization framework: in addition to using positive samples, it also makes full use of Negative samples are used to refine the knowledge of LLM. The framework consists of three serialization steps, including Negative Assisted Training (NAT) , Negative Calibration Enhancement (NCE) and Dynamic Self-Consistency (ASC) , covering the entire process from training to inference. Through an extensive series of experiments, we demonstrate the critical role of negative data in LLM knowledge distillation.
Under the current situation, guided by the Chain of Thought (CoT), large language models (LLMs) have demonstrated powerful reasoning capabilities. However, we have shown that this emergent capability can only be achieved by models with hundreds of billions of parameters. Since these models require huge computing resources and high inference costs, they are difficult to apply under resource constraints. Therefore, our research goal is to develop small models capable of complex arithmetic reasoning for large-scale deployment in real-world applications.
Knowledge distillation provides an efficient way to transfer the specific capabilities of LLMs into smaller models. This process, also known as model specialization, forces small models to focus on certain capabilities. Previous research utilizes contextual learning (ICL) of LLMs to generate reasoning paths for mathematical problems and uses them as training data, which helps small models acquire complex reasoning capabilities. However, these studies only used the generated inference paths with correct answers (i.e., positive samples) as training samples, ignoring the valuable knowledge in the inference steps with wrong answers (i.e., negative samples). Therefore, researchers began to explore how to utilize the inference step in negative samples to improve the performance of small models. One approach is to use adversarial training, where a generator model is introduced to generate inference paths for wrong answers, and these paths are then used along with positive examples to train a small model. In this way, the small model can learn valuable knowledge in the error reasoning step and improve its reasoning ability. Another approach is to use self-supervised learning, by comparing correct answers to incorrect answers and letting a small model learn to distinguish between them and extract useful information from them. These methods can provide more comprehensive training for small models, giving them more powerful reasoning capabilities. In short, using the inference steps in negative samples can help small models obtain more comprehensive training and improve their inference capabilities. This
picture
is shown in the figure. Table 1 shows an interesting phenomenon: in the positive and negative samples respectively For the model trained on the data, the overlap of accurate answers on the MATH test set is very small. Although the model trained with negative samples has lower accuracy, it can solve some questions that the positive sample model cannot answer correctly, which confirms that the negative samples contain valuable knowledge. In addition, erroneous links in negative samples can help the model avoid making similar mistakes. Another reason why we should take advantage of negative samples is OpenAI’s token-based pricing strategy. Even GPT-4’s accuracy on the MATH dataset is less than 50%, which means that a large amount of tokens will be wasted if only positive sample knowledge is utilized. Therefore, we propose that instead of discarding negative samples directly, a better way is to extract and utilize valuable knowledge from them to enhance the specialization of small models.
The model specialization process can generally be summarized into three steps:
1) Chain-of-Thought Distillation, Small models are trained using inference chains generated by LLMs.
2) Self-Enhancement, self-distillation or data self-expansion to further optimize the model.
3) Self-Consistency is widely used as an effective decoding strategy to improve model performance in inference tasks.
In this work, we propose a new model specialization framework that can fully exploit negative samples and facilitate the extraction of complex inference capabilities from LLMs.
2.1 Negative Assistance Training (NAT)
and Dynamic Integration Unit:
2.1.1 Negative knowledge absorptionBy maximizing negative data
The following expectation is that the knowledge of negative samples is absorbed by LoRA θ
. During this process, the parameters of LLaMA remain frozen.
Picture
2.1.2 Dynamic Integration Unit
Since it is impossible to determine in advance which mathematical problems θ
is good at, we designed a dynamic integration unit as shown in the figure below so that in the process of learning positive sample knowledge, dynamic integration comes from θ
Knowledge:
Picture
We freeze θ
To prevent internal knowledge from being forgotten, and additionally introduce the positive LoRA module θ. Ideally, we should forwardly integrate positive and negative LoRA modules (the outputs in each LLaMA layer are represented as and ) to supplement the beneficial knowledge that is lacking in the positive samples but corresponding to . When θ
contains harmful knowledge, we should perform negative integration of positive and negative LoRA modules to help reduce possible bad behaviors in positive samples.
We propose a corrective attention mechanism to achieve this goal, as follows:
Picture
picture
We use
as the query to calculate the attention weights of and . By adding the correction term [0.5; -0.5], the attention weight of is limited to the range of [-0.5, 0.5], thereby achieving the effect of adaptively integrating knowledge from in both positive and negative directions. Finally, the sum of the
and LLaMA layer outputs forms the output of the dynamic integration unit.
In order to further enhance the reasoning ability of the model, we proposed Negative Calibration Enhancement (NCE), It uses negative knowledge to aid the self-enhancement process. We first use NAT to generate pairs as augmentation samples for each question in and supplement them into the training dataset. For the self-distillation part, we note that some samples may contain more critical inference steps, which are crucial to improving the model's inference capability. Our main goal is to identify these critical inference steps and enhance their learning during self-distillation.
Considering that NAT already contains useful knowledge of θ
, making NAT better than θ
The factor with stronger reasoning ability is implicit in the inconsistent reasoning link between the two. Therefore, we use KL divergence to measure this inconsistency and maximize the expectation of this formula:
Picture
Picture
Picture
The larger the value of β, the greater the difference between the two, which means that the Samples contain more critical knowledge. By introducing β to adjust the loss weight of different samples, NCE will be able to selectively learn and enhance the knowledge embedded in NAT.
Self-consistency (SC) is effective in further improving the performance of the model in complex reasoning of. However, current methods either assign equal weights to each candidate or simply assign weights based on generation probabilities. These strategies cannot adjust the candidate weights according to the quality of (rˆ, yˆ) during the voting stage, which may make it difficult for the correct candidate to be selected. To this end, we propose the dynamic self-consistency method (ASC), which utilizes positive and negative data to train a ranking model and can adaptively reweight candidate inference links.
2.3.1 Ranking model training
Ideally, we want the ranking model to be the one that gets the correct answer The inference link is assigned a higher weight and vice versa. Therefore, we construct the training samples in the following way:
Picture
and use MSE loss to train the ranking model:
Picture
2.3.2 Weighted Strategy
We will The voting strategy is modified to the following formula to achieve the goal of adaptively reweighting candidate inference links:
Picture
Next The figure shows the process of ASC strategy:
Picture
From the perspective of knowledge transfer, ASC realizes the transfer of data from LLMs Further utilization of knowledge (positive and negative) to help small models achieve better performance.
This study focuses on the challenging mathematical reasoning dataset MATH, which has a total of 12,500 questions involving seven different subjects. In addition, we introduce the following four datasets to evaluate the generalization ability of the proposed framework to out-of-distribution (OOD) data: GSM8K, ASDiv, MultiArith, and SVAMP.
For the teacher model, we use Open AI’s gpt-3.5-turbo and gpt-4 APIs to generate inference chains. For the student model, we choose LLaMA-7b.
There are two main types of baselines in our research: one is large language models (LLMs) and the other is based on LLaMA-7b. For LLMs, we compare them with two popular models: GPT3 and PaLM. For LLaMA-7b, we first present our method for comparison with three settings: Few-shot, Fine-tune (on original training samples), CoT KD (Chain of Thought Distillation). In terms of learning from the negative perspective, four baseline methods will also be included: MIX (training LLaMA directly with a mixture of positive and negative data), CL (contrastive learning), NT (negative training) and UL (non-likelihood loss) ).
All methods use greedy search (ie temperature = 0), the NAT experimental results are as shown in the figure The results show that the proposed NAT method improves the task accuracy on all baselines.
As can be seen from the low values of GPT3 and PaLM, MATH is a very difficult mathematical data set, but NAT is still able to perform well with very few parameters. Compared to fine-tuning on raw data, NAT achieves approximately 75.75% improvement under two different CoT sources. NAT also significantly improves accuracy compared to CoT KD on positive samples, demonstrating the value of negative samples.
For utilizing negative information baselines, the low performance of MIX indicates that training negative samples directly will make the model perform poorly. Other methods are also mostly inferior to NAT, which shows that using only negative samples in the negative direction is not enough in complex reasoning tasks.
Picture
As shown in the figure, and Compared to knowledge distillation (KD), NCE achieves an average improvement of 10% (0.66), which demonstrates the effectiveness of distillation using calibration information provided by negative samples. Compared with NAT, although NCE reduces some parameters, it still has a 6.5% improvement, achieving the purpose of compressing the model and improving performance.
Picture
To evaluate ASC, we will It is compared with base SC and weighted (WS) SC, using sampling temperature T = 1 to generate 16 samples. As shown in the figure, the results show that ASC aggregating answers from different samples is a more promising strategy.
Picture
Except MATH data set , we evaluated the generalization ability of the framework on other mathematical reasoning tasks, and the experimental results are as follows.
Picture
This work explores the use of negative samples to extract data from large language models The effectiveness of refining complex reasoning capabilities and migrating them to specialized small models. Xiaohongshu Search Algorithm Team proposed a brand new framework, consisting of three serialization steps, and fully utilized negative information throughout the entire process of model specialization. Negative Assistance Training (NAT) can provide a more comprehensive method of utilizing negative information from two perspectives. Negative Calibration Enhancement (NCE) is able to calibrate the self-distillation process so that it can master key knowledge in a more targeted manner. Ranking models trained on both viewpoints can assign more appropriate weights to answer aggregations to achieve dynamic self-consistency (ASC). Extensive experiments show that our framework can improve the effectiveness of refining reasoning capabilities through the generated negative samples.
Paper address: https://www.php.cn/link/8fa2a95ee83cd1633cfd64f78e856bd3
The above is the detailed content of The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillation. For more information, please follow other related articles on the PHP Chinese website!