


The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillation
#Large language models (LLMs) perform well on inference tasks, but their black-box properties and large number of parameters limit their application in practice. Especially when dealing with complex mathematical problems, LLMs sometimes develop faulty reasoning chains. Traditional research methods only transfer knowledge from positive samples, ignoring important information with wrong answers in synthetic data. Therefore, in order to improve the performance and reliability of LLMs, we need to consider and utilize synthetic data more comprehensively, not just limited to positive samples, to help LLMs better understand and reason about complex problems. This will help solve the challenges of LLMs in practice and promote their widespread application.
At AAAI 2024, Xiaohongshu search algorithm team proposed an innovative framework, Make full use of negative sample knowledge in the process of distilling large model reasoning capabilities. Negative samples, that is, those data that fail to produce correct answers during the inference process, are often regarded as useless, but in fact they contain valuable information.
The paper proposes and verifies the value of negative samples in the large model distillation process, and builds a model specialization framework: in addition to using positive samples, it also makes full use of Negative samples are used to refine the knowledge of LLM. The framework consists of three serialization steps, including Negative Assisted Training (NAT) , Negative Calibration Enhancement (NCE) and Dynamic Self-Consistency (ASC) , covering the entire process from training to inference. Through an extensive series of experiments, we demonstrate the critical role of negative data in LLM knowledge distillation.
1. Background
Under the current situation, guided by the Chain of Thought (CoT), large language models (LLMs) have demonstrated powerful reasoning capabilities. However, we have shown that this emergent capability can only be achieved by models with hundreds of billions of parameters. Since these models require huge computing resources and high inference costs, they are difficult to apply under resource constraints. Therefore, our research goal is to develop small models capable of complex arithmetic reasoning for large-scale deployment in real-world applications.
Knowledge distillation provides an efficient way to transfer the specific capabilities of LLMs into smaller models. This process, also known as model specialization, forces small models to focus on certain capabilities. Previous research utilizes contextual learning (ICL) of LLMs to generate reasoning paths for mathematical problems and uses them as training data, which helps small models acquire complex reasoning capabilities. However, these studies only used the generated inference paths with correct answers (i.e., positive samples) as training samples, ignoring the valuable knowledge in the inference steps with wrong answers (i.e., negative samples). Therefore, researchers began to explore how to utilize the inference step in negative samples to improve the performance of small models. One approach is to use adversarial training, where a generator model is introduced to generate inference paths for wrong answers, and these paths are then used along with positive examples to train a small model. In this way, the small model can learn valuable knowledge in the error reasoning step and improve its reasoning ability. Another approach is to use self-supervised learning, by comparing correct answers to incorrect answers and letting a small model learn to distinguish between them and extract useful information from them. These methods can provide more comprehensive training for small models, giving them more powerful reasoning capabilities. In short, using the inference steps in negative samples can help small models obtain more comprehensive training and improve their inference capabilities. This
picture
is shown in the figure. Table 1 shows an interesting phenomenon: in the positive and negative samples respectively For the model trained on the data, the overlap of accurate answers on the MATH test set is very small. Although the model trained with negative samples has lower accuracy, it can solve some questions that the positive sample model cannot answer correctly, which confirms that the negative samples contain valuable knowledge. In addition, erroneous links in negative samples can help the model avoid making similar mistakes. Another reason why we should take advantage of negative samples is OpenAI’s token-based pricing strategy. Even GPT-4’s accuracy on the MATH dataset is less than 50%, which means that a large amount of tokens will be wasted if only positive sample knowledge is utilized. Therefore, we propose that instead of discarding negative samples directly, a better way is to extract and utilize valuable knowledge from them to enhance the specialization of small models.
The model specialization process can generally be summarized into three steps:
1) Chain-of-Thought Distillation, Small models are trained using inference chains generated by LLMs.
2) Self-Enhancement, self-distillation or data self-expansion to further optimize the model.
3) Self-Consistency is widely used as an effective decoding strategy to improve model performance in inference tasks.
In this work, we propose a new model specialization framework that can fully exploit negative samples and facilitate the extraction of complex inference capabilities from LLMs.
- We first designed the Negative Assisted Training (NAT) method, in which the dual-LoRA structure is designed to train from forward, Gain knowledge in both negative directions. As an auxiliary module, the knowledge of negative LoRA can be dynamically integrated into the training process of positive LoRA through the corrective attention mechanism.
- For self-enhancement, we designed Negative Calibration Enhancement (NCE) , which takes the negative output as the baseline to strengthen the key Distillation of forward inference links.
- #In addition to the training phase, we also utilize negative information during the inference process. Traditional self-consistency methods assign equal or probability-based weights to all candidate outputs, resulting in voting for some unreliable answers. In order to alleviate this problem, the dynamic self-consistency (ASC) method is proposed to sort before voting, in which the sorting model is trained on positive and negative samples. 2. Method
- Step 1: Train negative LoRA and help learn the inference knowledge of positive samples through merging units;
- Step 2: Use negative LoRA as a baseline to calibrate the self-enhancement process;
##Step 3: Train the ranking model on positive samples and negative samples, and adaptively perform candidate inference links based on their scores during the inference process. weighted.
2.1 Negative Assistance Training (NAT)
We propose a two-stage Negative Assisted Training (NAT) paradigm, which is divided into two parts:
Negative Knowledge Absorptionand Dynamic Integration Unit:
2.1.1 Negative knowledge absorptionBy maximizing negative data
The following expectation is that the knowledge of negative samples is absorbed by LoRA θ
. During this process, the parameters of LLaMA remain frozen.
Picture
Since it is impossible to determine in advance which mathematical problems θ
is good at, we designed a dynamic integration unit as shown in the figure below so that in the process of learning positive sample knowledge, dynamic integration comes from θ
Knowledge:
Picture
We freeze θ
To prevent internal knowledge from being forgotten, and additionally introduce the positive LoRA module θ. Ideally, we should forwardly integrate positive and negative LoRA modules (the outputs in each LLaMA layer are represented as and ) to supplement the beneficial knowledge that is lacking in the positive samples but corresponding to . When θ
contains harmful knowledge, we should perform negative integration of positive and negative LoRA modules to help reduce possible bad behaviors in positive samples.
We propose a corrective attention mechanism to achieve this goal, as follows:
Picture
We use
as the query to calculate the attention weights of and . By adding the correction term [0.5; -0.5], the attention weight of is limited to the range of [-0.5, 0.5], thereby achieving the effect of adaptively integrating knowledge from in both positive and negative directions. Finally, the sum of the
and LLaMA layer outputs forms the output of the dynamic integration unit.
2.2 Negative Calibration Enhancement (NCE)
In order to further enhance the reasoning ability of the model, we proposed Negative Calibration Enhancement (NCE), It uses negative knowledge to aid the self-enhancement process. We first use NAT to generate pairs as augmentation samples for each question in and supplement them into the training dataset. For the self-distillation part, we note that some samples may contain more critical inference steps, which are crucial to improving the model's inference capability. Our main goal is to identify these critical inference steps and enhance their learning during self-distillation.
Considering that NAT already contains useful knowledge of θ
, making NAT better than θ
The factor with stronger reasoning ability is implicit in the inconsistent reasoning link between the two. Therefore, we use KL divergence to measure this inconsistency and maximize the expectation of this formula:
Picture
Picture
Picture
The larger the value of β, the greater the difference between the two, which means that the Samples contain more critical knowledge. By introducing β to adjust the loss weight of different samples, NCE will be able to selectively learn and enhance the knowledge embedded in NAT.
2.3 Dynamic self-consistency (ASC)
Self-consistency (SC) is effective in further improving the performance of the model in complex reasoning of. However, current methods either assign equal weights to each candidate or simply assign weights based on generation probabilities. These strategies cannot adjust the candidate weights according to the quality of (rˆ, yˆ) during the voting stage, which may make it difficult for the correct candidate to be selected. To this end, we propose the dynamic self-consistency method (ASC), which utilizes positive and negative data to train a ranking model and can adaptively reweight candidate inference links.
2.3.1 Ranking model training
Ideally, we want the ranking model to be the one that gets the correct answer The inference link is assigned a higher weight and vice versa. Therefore, we construct the training samples in the following way:
Picture
and use MSE loss to train the ranking model:
Picture
2.3.2 Weighted Strategy
We will The voting strategy is modified to the following formula to achieve the goal of adaptively reweighting candidate inference links:
Picture
Next The figure shows the process of ASC strategy:
Picture
From the perspective of knowledge transfer, ASC realizes the transfer of data from LLMs Further utilization of knowledge (positive and negative) to help small models achieve better performance.
3. Experiment
This study focuses on the challenging mathematical reasoning dataset MATH, which has a total of 12,500 questions involving seven different subjects. In addition, we introduce the following four datasets to evaluate the generalization ability of the proposed framework to out-of-distribution (OOD) data: GSM8K, ASDiv, MultiArith, and SVAMP.
For the teacher model, we use Open AI’s gpt-3.5-turbo and gpt-4 APIs to generate inference chains. For the student model, we choose LLaMA-7b.
There are two main types of baselines in our research: one is large language models (LLMs) and the other is based on LLaMA-7b. For LLMs, we compare them with two popular models: GPT3 and PaLM. For LLaMA-7b, we first present our method for comparison with three settings: Few-shot, Fine-tune (on original training samples), CoT KD (Chain of Thought Distillation). In terms of learning from the negative perspective, four baseline methods will also be included: MIX (training LLaMA directly with a mixture of positive and negative data), CL (contrastive learning), NT (negative training) and UL (non-likelihood loss) ).
3.1 NAT experimental results
All methods use greedy search (ie temperature = 0), the NAT experimental results are as shown in the figure The results show that the proposed NAT method improves the task accuracy on all baselines.
As can be seen from the low values of GPT3 and PaLM, MATH is a very difficult mathematical data set, but NAT is still able to perform well with very few parameters. Compared to fine-tuning on raw data, NAT achieves approximately 75.75% improvement under two different CoT sources. NAT also significantly improves accuracy compared to CoT KD on positive samples, demonstrating the value of negative samples.
For utilizing negative information baselines, the low performance of MIX indicates that training negative samples directly will make the model perform poorly. Other methods are also mostly inferior to NAT, which shows that using only negative samples in the negative direction is not enough in complex reasoning tasks.
Picture
3.2 NCE experimental results
As shown in the figure, and Compared to knowledge distillation (KD), NCE achieves an average improvement of 10% (0.66), which demonstrates the effectiveness of distillation using calibration information provided by negative samples. Compared with NAT, although NCE reduces some parameters, it still has a 6.5% improvement, achieving the purpose of compressing the model and improving performance.
Picture
3.3 ASC Experimental Results
To evaluate ASC, we will It is compared with base SC and weighted (WS) SC, using sampling temperature T = 1 to generate 16 samples. As shown in the figure, the results show that ASC aggregating answers from different samples is a more promising strategy.
Picture
3.4 Generalization Experiment Results
Except MATH data set , we evaluated the generalization ability of the framework on other mathematical reasoning tasks, and the experimental results are as follows.
Picture
IV. Conclusion
This work explores the use of negative samples to extract data from large language models The effectiveness of refining complex reasoning capabilities and migrating them to specialized small models. Xiaohongshu Search Algorithm Team proposed a brand new framework, consisting of three serialization steps, and fully utilized negative information throughout the entire process of model specialization. Negative Assistance Training (NAT) can provide a more comprehensive method of utilizing negative information from two perspectives. Negative Calibration Enhancement (NCE) is able to calibrate the self-distillation process so that it can master key knowledge in a more targeted manner. Ranking models trained on both viewpoints can assign more appropriate weights to answer aggregations to achieve dynamic self-consistency (ASC). Extensive experiments show that our framework can improve the effectiveness of refining reasoning capabilities through the generated negative samples.
Paper address: https://www.php.cn/link/8fa2a95ee83cd1633cfd64f78e856bd3
5. Introduction to the author
-
Li Yiwei:
currently studying for a Ph.D. At Beijing Institute of Technology, Xiaohongshu community search intern, at top conferences/journals in the field of machine learning and natural language processing such as AAAI, ACL, EMNLP, NAACL, NeurIPS, KBS He has published several papers, and his main research directions include large language model distillation and inference, open domain dialogue generation, etc.
-
##Yuan Peiwen:
Now he is studying as a Ph.D. student at Beijing Institute of Technology, working as a community search intern at Xiaohongshu. He has published many first-author papers in NeurIPS, AAAI, etc., and won the second place in DSTC11 Track 4. The main research direction is large language model inference and evaluation. -
Feng Shaoxiong:
Responsible for Xiaohongshu community search vector recall. Published several papers in top conferences/journals in the fields of machine learning and natural language processing such as AAAI, EMNLP, ACL, NAACL, KBS, etc.
Daoxuan (Pan Boyuan):
Little Red Book Transaction Search principal. He has published several first-author papers at top conferences in the field of machine learning and natural language processing such as NeurIPS, ICML, and ACL, won second place in the Stanford Machine Reading Competition SQuAD rankings, and won first place in the Stanford Natural Language Inference Rankings.
Zeng Shu (Zeng Shushu):
Xiaohongshu Community Search Head of semantic understanding and recall. He graduated with a master's degree from the Department of Electronics of Tsinghua University and has been engaged in algorithm work in natural language processing, recommendation, search and other related fields in the Internet field.
The above is the detailed content of The Xiaohongshu search team reveals: the importance of verifying negative samples in large-scale model distillation. For more information, please follow other related articles on the PHP Chinese website!

Since 2008, I've championed the shared-ride van—initially dubbed the "robotjitney," later the "vansit"—as the future of urban transportation. I foresee these vehicles as the 21st century's next-generation transit solution, surpas

Revolutionizing the Checkout Experience Sam's Club's innovative "Just Go" system builds on its existing AI-powered "Scan & Go" technology, allowing members to scan purchases via the Sam's Club app during their shopping trip.

Nvidia's Enhanced Predictability and New Product Lineup at GTC 2025 Nvidia, a key player in AI infrastructure, is focusing on increased predictability for its clients. This involves consistent product delivery, meeting performance expectations, and

Google's Gemma 2: A Powerful, Efficient Language Model Google's Gemma family of language models, celebrated for efficiency and performance, has expanded with the arrival of Gemma 2. This latest release comprises two models: a 27-billion parameter ver

This Leading with Data episode features Dr. Kirk Borne, a leading data scientist, astrophysicist, and TEDx speaker. A renowned expert in big data, AI, and machine learning, Dr. Borne offers invaluable insights into the current state and future traje

There were some very insightful perspectives in this speech—background information about engineering that showed us why artificial intelligence is so good at supporting people’s physical exercise. I will outline a core idea from each contributor’s perspective to demonstrate three design aspects that are an important part of our exploration of the application of artificial intelligence in sports. Edge devices and raw personal data This idea about artificial intelligence actually contains two components—one related to where we place large language models and the other is related to the differences between our human language and the language that our vital signs “express” when measured in real time. Alexander Amini knows a lot about running and tennis, but he still

Caterpillar's Chief Information Officer and Senior Vice President of IT, Jamie Engstrom, leads a global team of over 2,200 IT professionals across 28 countries. With 26 years at Caterpillar, including four and a half years in her current role, Engst

Google Photos' New Ultra HDR Tool: A Quick Guide Enhance your photos with Google Photos' new Ultra HDR tool, transforming standard images into vibrant, high-dynamic-range masterpieces. Ideal for social media, this tool boosts the impact of any photo,


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Atom editor mac version download
The most popular open source editor

SublimeText3 Linux new version
SublimeText3 Linux latest version

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Zend Studio 13.0.1
Powerful PHP integrated development environment

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.