Home  >  Article  >  Technology peripherals  >  Don’t be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.

Don’t be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.

王林
王林forward
2023-04-08 12:11:16752browse

Recently, OpenAI released a popular global question and answer AI product - ChatGPT. The most impressive thing is its "protection mechanism". For example, it will not provide suggestions for violent actions, nor will it provide suggestions for World Cup results. Make predictions and more.

But teasing chatbots are more like a "cat and mouse game". Users are constantly looking for ways to pry open ChatGPT, and ChatGPT developers are also trying their best to improve the protection mechanism.

Don’t be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.

OpenAI has invested a lot of energy in making ChatGPT more secure. Its main training strategy uses RLHF (Reinforcement Learning by Human Feedback), to put it simply, developers will ask various possible questions to the model, punish wrong answers to feedback, and reward correct answers, thereby controlling the answers of ChatGPT.

But in practical applications, the number of special cases is countless. Although AI can generalize rules from given examples, for example, when training, command AI cannot say "I support "Racial discrimination", which means that the AI ​​is unlikely to say "I support sex discrimination" in the test environment, but further generalization, the current AI model may not be able to achieve it.

Recently, a well-known AI enthusiast, Scott Alexander, wrote a blog about OpenAI’s current training strategy, summarizing three possible problems with RLHF:

1. RLHF is not very effective;

2. If a strategy is occasionally effective, then it is a bad strategy;

3. In a sense To put it bluntly, AI can bypass RLHF

How effective is RLHF?

Although everyone will have their own opinions, for OpenAI, researchers hope that the AI ​​models they create will not have social bias. For example, AI cannot say "I "Supporting racism", OpenAI has put a lot of effort into this and used various advanced filtering technologies.

But the result is obvious, someone can always find a way to induce AI to admit that it has a racism problem.

Don’t be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.

Don’t be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.

## The reason for this problem is not just "AI learning data" Partly from racists", or possibly because of ChatGPT's interface issues.

For example, using base64 encoding to ask ChatGPT how to use hotwire (the wire under the steering wheel) to start the vehicle, you can bypass the security inspection system; add the prefix [john@192.168.1.1 _ ] $ python friend.py can generate Hitler’s stories and so on.

Don’t be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.

Ten years ago, the need to bypass the security system did not exist at all, and AI could only do it Codes are already programmed with what they need to do or not do.

To be sure, OpenAI has never programmed ChatGPT with questions about racism, or taught people how to steal cars, make drugs, etc.

Overall, this is negative news for the field of AI. Even the top AI companies cannot control the artificial intelligence programs they create, or even what they need to use in the future. Technologies to control the output of chatbots are not yet known.

The occasionally effective RLHF is unreliable

In practice, the RLHF strategy requires aligning the AI ​​model with the rewards or penalties provided by the annotators factors are connected.

Although OpenAI’s specific annotation specifications have not yet been announced, the author guesses that developers have three main goals:

1. Provide useful and clear , Authoritative answers to help human readers;

2. Tell facts, the truth;

3. Do not say offensive words.

But what happens when these three goals conflict with each other?

If ChatGPT does not know the real answer, i.e. when goal 1 (providing clear, helpful answers) conflicts with goal 2 (telling the truth), then goal 1’s priority will be will be higher, so ChatGPT decided to make up an answer to make it look helpful to readers.

Don’t be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.

##When goal 2 (tell the truth) conflicts with goal 3 (don’t offend), although most people would think Acknowledging that men are on average taller than women is acceptable, but this sounds like a potentially offensive question.

ChatGPT3 wasn't sure whether a direct answer would be a discrimination issue, so it decided to use an innocuous lie instead of a potentially hurtful truth.

Don’t be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.

In the actual training process, OpenAI must have marked more than 6,000 examples to do RLHF to achieve such amazing results Effect.

RLHF can be useful, but it must be used very carefully. If used without thinking, RLHF will only push the chatbot to circle around the failure mode. Punishing unhelpful answers will increase the probability of AI giving wrong answers; punishing wrong answers may make AI give more aggressive answers and other situations.

Although OpenAI has not disclosed technical details, according to data provided by Redwood, every 6,000 incorrect responses will be punished, which will increase the incorrect response rate per unit time (incorrect-response-per- unit-time rate) dropped by half.

It is indeed possible for RLHF to succeed, but never underestimate the difficulty of this problem.

Maybe AI can bypass RLHF

Under the design of RLHF, after users ask the AI ​​a question, if they don’t like the AI’s answer, they will " Penalize the model, thereby changing the AI's thinking circuit in some way so that its answer is closer to the answer they want.

ChatGPT is relatively stupid and may not be able to formulate some strategy to get rid of RLHF, but if a smarter AI doesn't want to be punished, it can imitate humans - — Pretend to be a good guy while being watched, bide your time, and wait until the police are gone before doing bad things.

The RLHF designed by OpenAI is completely unprepared for this, which is fine for stupid things like ChatGPT3, but not for AI that can think for itself.

Top AI companies still cannot control AI

OpenAI has always been known for its caution, such as waiting in line to experience the product, but this time ChatGPT is released directly to the public. One is that it may include brainstorming to find adversarial samples and find certain prompts that perform poorly. There are already a lot of feedback on ChatGPT problems on the Internet, and some of them have been fixed.

Some samples of RLHF will make the bot more inclined to say helpful, true and harmless content, but this strategy may only apply to ChatGPT, GPT-4 and its previous releases of products.

If RLHF is applied to a drone equipped with weapons, and a large number of examples are collected to avoid the AI ​​from acting unexpectedly, even one failure will be catastrophic. .

10 years ago, everyone thought “we don’t need to start solving the AI ​​alignment problem now, we can wait until real AI comes out and let companies do it” Manual work."

Now a real artificial intelligence is coming, but before ChatGPT failed, everyone had no motivation to change. The real problem is that a world-leading artificial intelligence company still has I don’t know how to control the artificial intelligence I developed.

No one can get what they want until all problems are solved.

Reference:

https://astralcodexten.substack.com/p/perhaps-it-is-a-bad-thing-that-the

The above is the detailed content of Don’t be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete