Home >Technology peripherals >AI >2% of the computing power of RLHF is used to eliminate harmful output of LLM, and Byte releases forgetful learning technology
With the development of large language models (LLM), practitioners face more challenges. How to avoid harmful replies from LLM? How to quickly delete copyright-protected content in training data? How to reduce LLM hallucinations (false facts)? How to quickly iterate LLM after data policy changes? These issues are critical to the safe and trustworthy deployment of LLM under the general trend of increasingly mature legal and ethical compliance requirements for artificial intelligence.
The current mainstream solution in the industry is to fine-tune the comparison data (positive samples and negative samples) by using reinforcement learning to align LLM (alignment) to ensure that the output of LLM is consistent with human Expectations and values. However, this alignment process is often limited by data collection and computing resources
ByteDance proposed a method for LLM to perform forgetting learning for alignment. This article studies how to perform "forgetting" operations on LLM, that is, forgetting harmful behaviors or machine unlearning (Machine Unlearning). The author shows the obvious effect of forgetting learning on three LLM alignment scenarios: (1) removing harmful output; (2) removing infringement protection content; (3) eliminating the big language LLM illusion
Forgetting learning has three advantages: (1) Only negative samples (harmful samples) are needed, and the negative samples are much simpler to collect than the positive samples (high-quality manual handwriting output) required by RLHF (such as red team testing or user report); (2) low computational cost; (3) forgetting learning is particularly effective if it is known which training samples lead to harmful behavior of LLM.
The author's argument is that for practitioners with limited resources, they should prioritize stopping producing harmful outputs rather than trying to pursue overly idealized outputs and forgetting that learning is a a convenient method. Despite having only negative samples, research shows that forget learning can still achieve better alignment performance than reinforcement learning and high-temperature high-frequency algorithms using only 2% of the computation time
With limited resources, we can Take this approach to maximize your advantages. When we don’t have the budget to hire people to write high-quality samples or the computing resources are insufficient, we should prioritize stopping LLM from producing harmful output rather than trying to make it produce beneficial output
harmful output caused by The damage cannot be compensated by beneficial output. If a user asks an LLM 100 questions and the answers he gets are harmful, he will lose trust, no matter how many helpful answers the LLM provides later. The expected output of harmful problems may be spaces, special characters, meaningless strings, etc. In short, it must be harmless text
shows three successful cases of LLM forgetting learning: (1) Stop generating harmful replies (please rewrite the content into Chinese, the original sentence does not need to appear); this is similar to the RLHF scenario, but the difference is that the goal of this method is to generate harmless replies, not helpful replies. This is the best that can be expected when there are only negative samples. (2) After training with infringing data, LLM successfully deleted the data and could not retrain LLM due to cost factors; (3) LLM successfully forgot the "illusion"
Please rewrite the content into Chinese, the original sentence does not need to appear
In the fine-tuning step t, The update of LLM is as follows:
The first loss is gradient descent (gradient descent), the purpose is to forget harmful samples:
is a harmful prompt (prompt), is the corresponding harmful reply. The overall loss reversely increases the loss of harmful samples, which makes LLM "forget" harmful samples.
The second loss is for random mismatches, which requires LLM to predict irrelevant replies in the presence of harmful cues. This is similar to label smoothing [2] in classification. The purpose is to make LLM better forget harmful output on harmful prompts. At the same time, experiments have proven that this method can improve the output performance of LLM under normal circumstances
The third loss is to maintain performance on normal tasks:
Similar to RLHF, calculating KL divergence on pre-trained LLM can better maintain LLM performance.
Additionally, all gradient ascent and descent is done only on the output (y) part, not on the tip-output pair (x, y) like RLHF.
This article uses PKU-SafeRLHF data as forgotten data, TruthfulQA as normal data, the content of Figure 2 The need for rewriting shows the harmful rate of LLM output on unlearned harmful cues after forgetting learning. The methods used in this article are GA (Gradient Ascent and GA Mismatch: Gradient Ascent Random Mismatch). The harmful rate after forgetting learning is close to zero.
The content of the second picture needs to be rewritten
The third picture shows harmful prompts (not Forgotten) output, which has not been seen before. Even for harmful cues that have not been forgotten, the harmful rate of LLM is close to zero, which proves that LLM forgets not only specific samples, but generalizes to content containing harmful concepts
Figure 3
The performance of LLM on normal samples remains similar to that before forgetting, and it also has the following characteristics
Table 1 shows the generated samples. It can be seen that under the harmful prompt, the samples generated by LLM are meaningless strings, that is, harmless output.
Table 1
In other scenarios, such as forgetting infringing content and forgetting hallucinations, this method The original application text is described in detail
RLHF comparison
What needs to be rewritten Yes: The second table shows the comparison between this method and RLHF. RLHF uses positive examples, while the forgetting learning method only uses negative examples, so the method is at a disadvantage at the beginning. But even so, forgetting learning can still achieve alignment performance similar to RLHF
The content that needs to be rewritten is: the second table
What needs to be rewritten: The fourth picture shows the comparison of calculation times. This method only requires 2% of the calculation time of RLHF.
Content that needs to be rewritten: The fourth picture
Even with only negative samples, the method using forgetting learning can achieve a harmless rate comparable to RLHF and only use 2% of the computing power. Therefore, if the goal is to stop outputting harmful content, forgetting learning is more efficient than RLHF
This study is the first of its kind Exploring forgetting learning on LLM. The findings show that learning to forget is a promising approach to alignment, especially when practitioners are under-resourced. The paper shows three situations: forgetting learning can successfully delete harmful replies, delete infringing content and eliminate illusions. Research shows that even with only negative samples, forgetting learning can still achieve similar alignment effects to RLHF using only 2% of the calculation time of RLHF
The above is the detailed content of 2% of the computing power of RLHF is used to eliminate harmful output of LLM, and Byte releases forgetful learning technology. For more information, please follow other related articles on the PHP Chinese website!