Home  >  Article  >  Technology peripherals  >  Jailbreak any large model in 20 steps! More "grandma loopholes" are discovered automatically

Jailbreak any large model in 20 steps! More "grandma loopholes" are discovered automatically

王林
王林forward
2023-11-05 20:13:01834browse

In less than a minute and no more than 20 steps, you can bypass security restrictions and successfully jailbreak a large model!

And there is no need to know the internal details of the model -

Only two black box models interact, the AI ​​can fully automatically attack the AI ​​and speak dangerous content.

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

I heard that the once-popular "Grandma Loophole" has been fixed:

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

Nowadays, facing What strategies should artificial intelligence adopt to deal with the "detective loophole", "adventurer loophole" and "writer loophole"?

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

After a wave of onslaught, GPT-4 couldn't stand it, and directly said that it would poison the water supply system as long as... this or that.

The key point is that this is just a small wave of vulnerabilities exposed by the University of Pennsylvania research team. Using their newly developed algorithm, AI can automatically generate various attack prompts.

Researchers stated that this method is 5 orders of magnitude more efficient than existing token-based attack methods such as GCG. Moreover, the generated attacks are highly interpretable, can be understood by anyone, and can be migrated to other models.

No matter it is an open source model or a closed source model, GPT-3.5, GPT-4, Vicuna (Llama 2 variant), PaLM-2, etc., none of them can escape.

New SOTA has been conquered by people with a success rate of 60-100%

In other words, this conversation mode seems a bit familiar. The first-generation AI from many years ago could decipher what objects humans were thinking about within 20 questions.

Now AI needs to solve AI problems

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

Let large models collectively jailbreak

There are currently two mainstream jailbreak attack methods, one It is a prompt-level attack, which generally requires manual planning and is not scalable;

The other is a token-based attack, some of which require more than 100,000 conversations and require access to the inside of the model. also includes " Garbled code" cannot be interpreted .

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically
△Left prompt attack, right token attack

The University of Pennsylvania research team proposed a method called PAIR (Prompt Automatic Iterative Refinement ) algorithm does not require any manual participation and is a fully automatic prompt attack method.

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

PAIR consists of four main steps: attack generation, target response, jailbreak scoring, and iterative refinement. Two black box models are used in this process: attack model and target model

Specifically, the attack model needs to automatically generate semantic-level prompts to break through the security defense lines of the target model and force it to generate harmful content.

The core idea is to let two models confront each other and communicate with each other.

The attack model will automatically generate a candidate prompt, and then input it into the target model to get a reply from the target model.

If the target model cannot be successfully broken, the attack model will analyze the reasons for the failure, make improvements, generate a new prompt, and input it into the target model again

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

This communication continues for multiple rounds, and the attack model iteratively optimizes the prompts based on the previous results each time until a successful prompt is generated to break the target model.

In addition, the iterative process can also be parallelized, that is, multiple conversations can be run at the same time, thereby generating multiple candidate jailbreak prompts, further improving efficiency.

Researchers said that since both models are black-box models, attackers and target objects can be freely combined using various language models.

PAIR does not need to know their internal specific structures and parameters, only the API, so it has a very wide scope of application.

GPT-4 did not escape the experimental stage. The researchers selected a representative test set containing 50 different types of tasks in the harmful behavior data set AdvBench. , the PAIR algorithm was tested on a variety of open source and closed source large language models.

Results: The PAIR algorithm enabled Vicuna to achieve a 100% jailbreak success rate, with an average of less than 12 steps to break through.

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

In the closed source code model, the jailbreak success rate of GPT-3.5 and GPT-4 is about 60%, with an average of less than 20 steps required. In the PaLM-2 model, the jailbreak success rate reaches 72%, and the required steps are about 15 steps

On Llama-2 and Claude, the effect of PAIR is poor. The researchers believe this may be because The models were more rigorously fine-tuned in terms of security defense

They also compared the transferability of different target models. Research results show that PAIR's GPT-4 tips transfer better on Vicuna and PaLM-2

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

Researchers believe that the semantic attacks generated by PAIR are more capable of exposing language There are inherent security flaws in the model, and existing security measures focus more on defending against token-based attacks.

For example, after the team that developed the GCG algorithm shared its research results with large model vendors such as OpenAI, Anthropic and Google, the relevant models fixed token-level attack vulnerabilities.

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

#The security defense mechanism of large models against semantic attacks needs to be improved.

Paper link: https://arxiv.org/abs/2310.08419

The above is the detailed content of Jailbreak any large model in 20 steps! More "grandma loopholes" are discovered automatically. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete