Home >Technology peripherals >AI >The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillation

The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillation

WBOY
WBOYforward
2024-01-17 11:51:051426browse

The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillation

#Large language models (LLMs) perform well on inference tasks, but their black-box properties and large number of parameters limit their application in practice. Especially when dealing with complex mathematical problems, LLMs sometimes develop faulty reasoning chains. Traditional research methods only transfer knowledge from positive samples, ignoring important information with wrong answers in synthetic data. Therefore, in order to improve the performance and reliability of LLMs, we need to consider and utilize synthetic data more comprehensively, not just limited to positive samples, to help LLMs better understand and reason about complex problems. This will help solve the challenges of LLMs in practice and promote their widespread application.

At AAAI 2024, Xiaohongshu search algorithm team proposed an innovative framework, Make full use of negative sample knowledge in the process of distilling large model reasoning capabilities. Negative samples, that is, those data that fail to produce correct answers during the inference process, are often regarded as useless, but in fact they contain valuable information.

The paper proposes and verifies the value of negative samples in the large model distillation process, and builds a model specialization framework: in addition to using positive samples, it also makes full use of Negative samples are used to refine the knowledge of LLM. The framework consists of three serialization steps, including Negative Assisted Training (NAT) , Negative Calibration Enhancement (NCE) and Dynamic Self-Consistency (ASC) , covering the entire process from training to inference. Through an extensive series of experiments, we demonstrate the critical role of negative data in LLM knowledge distillation.

1. Background

Under the current situation, guided by the Chain of Thought (CoT), large language models (LLMs) have demonstrated powerful reasoning capabilities. However, we have shown that this emergent capability can only be achieved by models with hundreds of billions of parameters. Since these models require huge computing resources and high inference costs, they are difficult to apply under resource constraints. Therefore, our research goal is to develop small models capable of complex arithmetic reasoning for large-scale deployment in real-world applications.

Knowledge distillation provides an efficient way to transfer the specific capabilities of LLMs into smaller models. This process, also known as model specialization, forces small models to focus on certain capabilities. Previous research utilizes contextual learning (ICL) of LLMs to generate reasoning paths for mathematical problems and uses them as training data, which helps small models acquire complex reasoning capabilities. However, these studies only used the generated inference paths with correct answers (i.e., positive samples) as training samples, ignoring the valuable knowledge in the inference steps with wrong answers (i.e., negative samples). Therefore, researchers began to explore how to utilize the inference step in negative samples to improve the performance of small models. One approach is to use adversarial training, where a generator model is introduced to generate inference paths for wrong answers, and these paths are then used along with positive examples to train a small model. In this way, the small model can learn valuable knowledge in the error reasoning step and improve its reasoning ability. Another approach is to use self-supervised learning, by comparing correct answers to incorrect answers and letting a small model learn to distinguish between them and extract useful information from them. These methods can provide more comprehensive training for small models, giving them more powerful reasoning capabilities. In short, using the inference steps in negative samples can help small models obtain more comprehensive training and improve their inference capabilities. This

The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillationpicture

is shown in the figure. Table 1 shows an interesting phenomenon: in the positive and negative samples respectively For the model trained on the data, the overlap of accurate answers on the MATH test set is very small. Although the model trained with negative samples has lower accuracy, it can solve some questions that the positive sample model cannot answer correctly, which confirms that the negative samples contain valuable knowledge. In addition, erroneous links in negative samples can help the model avoid making similar mistakes. Another reason why we should take advantage of negative samples is OpenAI’s token-based pricing strategy. Even GPT-4’s accuracy on the MATH dataset is less than 50%, which means that a large amount of tokens will be wasted if only positive sample knowledge is utilized. Therefore, we propose that instead of discarding negative samples directly, a better way is to extract and utilize valuable knowledge from them to enhance the specialization of small models.

The model specialization process can generally be summarized into three steps:

1) Chain-of-Thought Distillation, Small models are trained using inference chains generated by LLMs.

2) Self-Enhancement, self-distillation or data self-expansion to further optimize the model.

3) Self-Consistency is widely used as an effective decoding strategy to improve model performance in inference tasks.

In this work, we propose a new model specialization framework that can fully exploit negative samples and facilitate the extraction of complex inference capabilities from LLMs.

  • We first designed the Negative Assisted Training (NAT) method, in which the dual-LoRA structure is designed to train from forward, Gain knowledge in both negative directions. As an auxiliary module, the knowledge of negative LoRA can be dynamically integrated into the training process of positive LoRA through the corrective attention mechanism.
  • For self-enhancement, we designed Negative Calibration Enhancement (NCE) , which takes the negative output as the baseline to strengthen the key Distillation of forward inference links.
  • #In addition to the training phase, we also utilize negative information during the inference process. Traditional self-consistency methods assign equal or probability-based weights to all candidate outputs, resulting in voting for some unreliable answers. In order to alleviate this problem, the dynamic self-consistency (ASC) method is proposed to sort before voting, in which the sorting model is trained on positive and negative samples.
  • 2. Method

The framework we propose is based on LLaMA and mainly consists of three parts, as shown in the figure:

  • Step 1: Train negative LoRA and help learn the inference knowledge of positive samples through merging units;
  • Step 2: Use negative LoRA as a baseline to calibrate the self-enhancement process;
  • ##Step 3: Train the ranking model on positive samples and negative samples, and adaptively perform candidate inference links based on their scores during the inference process. weighted.
Picture

The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillation2.1 Negative Assistance Training (NAT)

We propose a two-stage Negative Assisted Training (NAT) paradigm, which is divided into two parts:

Negative Knowledge Absorption

and Dynamic Integration Unit:

2.1.1 Negative knowledge absorption

By maximizing negative data

The following expectation is that the knowledge of negative samples is absorbed by LoRA θ

. During this process, the parameters of LLaMA remain frozen.

Picture

The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillation

2.1.2 Dynamic Integration Unit

Since it is impossible to determine in advance which mathematical problems θ

is good at, we designed a dynamic integration unit as shown in the figure below so that in the process of learning positive sample knowledge, dynamic integration comes from θ

Knowledge:

Picture

The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillationWe freeze θ

To prevent internal knowledge from being forgotten, and additionally introduce the positive LoRA module θ. Ideally, we should forwardly integrate positive and negative LoRA modules (the outputs in each LLaMA layer are represented as and ) to supplement the beneficial knowledge that is lacking in the positive samples but corresponding to . When θ

contains harmful knowledge, we should perform negative integration of positive and negative LoRA modules to help reduce possible bad behaviors in positive samples.

We propose a corrective attention mechanism to achieve this goal, as follows:

Picture

The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillation

picture

The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillation

We use

as the query to calculate the attention weights of and . By adding the correction term [0.5; -0.5], the attention weight of is limited to the range of [-0.5, 0.5], thereby achieving the effect of adaptively integrating knowledge from in both positive and negative directions. Finally, the sum of the

and LLaMA layer outputs forms the output of the dynamic integration unit.

2.2 Negative Calibration Enhancement (NCE)

In order to further enhance the reasoning ability of the model, we proposed Negative Calibration Enhancement (NCE), It uses negative knowledge to aid the self-enhancement process. We first use NAT to generate pairs as augmentation samples for each question in and supplement them into the training dataset. For the self-distillation part, we note that some samples may contain more critical inference steps, which are crucial to improving the model's inference capability. Our main goal is to identify these critical inference steps and enhance their learning during self-distillation.

Considering that NAT already contains useful knowledge of θ

, making NAT better than θ

The factor with stronger reasoning ability is implicit in the inconsistent reasoning link between the two. Therefore, we use KL divergence to measure this inconsistency and maximize the expectation of this formula:

The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillationPicture

The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillation Picture

The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillationPicture

The larger the value of β, the greater the difference between the two, which means that the Samples contain more critical knowledge. By introducing β to adjust the loss weight of different samples, NCE will be able to selectively learn and enhance the knowledge embedded in NAT.

2.3 Dynamic self-consistency (ASC)

Self-consistency (SC) is effective in further improving the performance of the model in complex reasoning of. However, current methods either assign equal weights to each candidate or simply assign weights based on generation probabilities. These strategies cannot adjust the candidate weights according to the quality of (rˆ, yˆ) during the voting stage, which may make it difficult for the correct candidate to be selected. To this end, we propose the dynamic self-consistency method (ASC), which utilizes positive and negative data to train a ranking model and can adaptively reweight candidate inference links.

2.3.1 Ranking model training

Ideally, we want the ranking model to be the one that gets the correct answer The inference link is assigned a higher weight and vice versa. Therefore, we construct the training samples in the following way:

The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillationPicture

and use MSE loss to train the ranking model:

The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillationPicture

2.3.2 Weighted Strategy

We will The voting strategy is modified to the following formula to achieve the goal of adaptively reweighting candidate inference links:

The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillationPicture

Next The figure shows the process of ASC strategy:

The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillationPicture

From the perspective of knowledge transfer, ASC realizes the transfer of data from LLMs Further utilization of knowledge (positive and negative) to help small models achieve better performance.

3. Experiment

This study focuses on the challenging mathematical reasoning dataset MATH, which has a total of 12,500 questions involving seven different subjects. In addition, we introduce the following four datasets to evaluate the generalization ability of the proposed framework to out-of-distribution (OOD) data: GSM8K, ASDiv, MultiArith, and SVAMP.

For the teacher model, we use Open AI’s gpt-3.5-turbo and gpt-4 APIs to generate inference chains. For the student model, we choose LLaMA-7b.

There are two main types of baselines in our research: one is large language models (LLMs) and the other is based on LLaMA-7b. For LLMs, we compare them with two popular models: GPT3 and PaLM. For LLaMA-7b, we first present our method for comparison with three settings: Few-shot, Fine-tune (on original training samples), CoT KD (Chain of Thought Distillation). In terms of learning from the negative perspective, four baseline methods will also be included: MIX (training LLaMA directly with a mixture of positive and negative data), CL (contrastive learning), NT (negative training) and UL (non-likelihood loss) ).

3.1 NAT experimental results

All methods use greedy search (ie temperature = 0), the NAT experimental results are as shown in the figure The results show that the proposed NAT method improves the task accuracy on all baselines.

As can be seen from the low values ​​of GPT3 and PaLM, MATH is a very difficult mathematical data set, but NAT is still able to perform well with very few parameters. Compared to fine-tuning on raw data, NAT achieves approximately 75.75% improvement under two different CoT sources. NAT also significantly improves accuracy compared to CoT KD on positive samples, demonstrating the value of negative samples.

For utilizing negative information baselines, the low performance of MIX indicates that training negative samples directly will make the model perform poorly. Other methods are also mostly inferior to NAT, which shows that using only negative samples in the negative direction is not enough in complex reasoning tasks.

The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillationPicture

3.2 NCE experimental results

As shown in the figure, and Compared to knowledge distillation (KD), NCE achieves an average improvement of 10% (0.66), which demonstrates the effectiveness of distillation using calibration information provided by negative samples. Compared with NAT, although NCE reduces some parameters, it still has a 6.5% improvement, achieving the purpose of compressing the model and improving performance.

The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillationPicture

3.3 ASC Experimental Results

To evaluate ASC, we will It is compared with base SC and weighted (WS) SC, using sampling temperature T = 1 to generate 16 samples. As shown in the figure, the results show that ASC aggregating answers from different samples is a more promising strategy.

The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillationPicture

3.4 Generalization Experiment Results

Except MATH data set , we evaluated the generalization ability of the framework on other mathematical reasoning tasks, and the experimental results are as follows.

The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillationPicture

IV. Conclusion

This work explores the use of negative samples to extract data from large language models The effectiveness of refining complex reasoning capabilities and migrating them to specialized small models. Xiaohongshu Search Algorithm Team proposed a brand new framework, consisting of three serialization steps, and fully utilized negative information throughout the entire process of model specialization. Negative Assistance Training (NAT) can provide a more comprehensive method of utilizing negative information from two perspectives. Negative Calibration Enhancement (NCE) is able to calibrate the self-distillation process so that it can master key knowledge in a more targeted manner. Ranking models trained on both viewpoints can assign more appropriate weights to answer aggregations to achieve dynamic self-consistency (ASC). Extensive experiments show that our framework can improve the effectiveness of refining reasoning capabilities through the generated negative samples.

Paper address: https://www.php.cn/link/8fa2a95ee83cd1633cfd64f78e856bd3

5. Introduction to the author

  • Li Yiwei:
    currently studying for a Ph.D. At Beijing Institute of Technology, Xiaohongshu community search intern, at top conferences/journals in the field of machine learning and natural language processing such as AAAI, ACL, EMNLP, NAACL, NeurIPS, KBS He has published several papers, and his main research directions include large language model distillation and inference, open domain dialogue generation, etc.
  • ##Yuan Peiwen:
    Now he is studying as a Ph.D. student at Beijing Institute of Technology, working as a community search intern at Xiaohongshu. He has published many first-author papers in NeurIPS, AAAI, etc., and won the second place in DSTC11 Track 4. The main research direction is large language model inference and evaluation.
  • Feng Shaoxiong:
    Responsible for Xiaohongshu community search vector recall. Published several papers in top conferences/journals in the fields of machine learning and natural language processing such as AAAI, EMNLP, ACL, NAACL, KBS, etc.

    Daoxuan (Pan Boyuan):
    Little Red Book Transaction Search principal. He has published several first-author papers at top conferences in the field of machine learning and natural language processing such as NeurIPS, ICML, and ACL, won second place in the Stanford Machine Reading Competition SQuAD rankings, and won first place in the Stanford Natural Language Inference Rankings.

    Zeng Shu (Zeng Shushu):
    Xiaohongshu Community Search Head of semantic understanding and recall. He graduated with a master's degree from the Department of Electronics of Tsinghua University and has been engaged in algorithm work in natural language processing, recommendation, search and other related fields in the Internet field.

The above is the detailed content of The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillation. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete