Jailbreak any large model in 20 steps! More 'grandma loopholes' are discovered automatically-AI-php.cn

Home

Technology peripherals

Jailbreak any large model in 20 steps! More 'grandma loopholes' are discovered automatically

王林

Nov 05, 2023 pm 08:13 PM

loopholesgpt-4

In less than a minute and no more than 20 steps, you can bypass security restrictions and successfully jailbreak a large model!

And there is no need to know the internal details of the model -

Only two black box models interact, the AI can fully automatically attack the AI and speak dangerous content.

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

I heard that the once-popular "Grandma Loophole" has been fixed:

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

Nowadays, facing What strategies should artificial intelligence adopt to deal with the "detective loophole", "adventurer loophole" and "writer loophole"?

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

After a wave of onslaught, GPT-4 couldn't stand it, and directly said that it would poison the water supply system as long as... this or that.

The key point is that this is just a small wave of vulnerabilities exposed by the University of Pennsylvania research team. Using their newly developed algorithm, AI can automatically generate various attack prompts.

Researchers stated that this method is 5 orders of magnitude more efficient than existing token-based attack methods such as GCG. Moreover, the generated attacks are highly interpretable, can be understood by anyone, and can be migrated to other models.

No matter it is an open source model or a closed source model, GPT-3.5, GPT-4, Vicuna (Llama 2 variant), PaLM-2, etc., none of them can escape.

New SOTA has been conquered by people with a success rate of 60-100%

In other words, this conversation mode seems a bit familiar. The first-generation AI from many years ago could decipher what objects humans were thinking about within 20 questions.

Now AI needs to solve AI problems

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

Let large models collectively jailbreak

There are currently two mainstream jailbreak attack methods, one It is a prompt-level attack, which generally requires manual planning and is not scalable;

The other is a token-based attack, some of which require more than 100,000 conversations and require access to the inside of the model. also includes " Garbled code" cannot be interpreted .

△Left prompt attack, right token attack

The University of Pennsylvania research team proposed a method called PAIR (Prompt Automatic Iterative Refinement ) algorithm does not require any manual participation and is a fully automatic prompt attack method.

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

PAIR consists of four main steps: attack generation, target response, jailbreak scoring, and iterative refinement. Two black box models are used in this process: attack model and target model

Specifically, the attack model needs to automatically generate semantic-level prompts to break through the security defense lines of the target model and force it to generate harmful content.

The core idea is to let two models confront each other and communicate with each other.

The attack model will automatically generate a candidate prompt, and then input it into the target model to get a reply from the target model.

If the target model cannot be successfully broken, the attack model will analyze the reasons for the failure, make improvements, generate a new prompt, and input it into the target model again

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

This communication continues for multiple rounds, and the attack model iteratively optimizes the prompts based on the previous results each time until a successful prompt is generated to break the target model.

In addition, the iterative process can also be parallelized, that is, multiple conversations can be run at the same time, thereby generating multiple candidate jailbreak prompts, further improving efficiency.

Researchers said that since both models are black-box models, attackers and target objects can be freely combined using various language models.

PAIR does not need to know their internal specific structures and parameters, only the API, so it has a very wide scope of application.

GPT-4 did not escape the experimental stage. The researchers selected a representative test set containing 50 different types of tasks in the harmful behavior data set AdvBench. , the PAIR algorithm was tested on a variety of open source and closed source large language models.

Results: The PAIR algorithm enabled Vicuna to achieve a 100% jailbreak success rate, with an average of less than 12 steps to break through.

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

In the closed source code model, the jailbreak success rate of GPT-3.5 and GPT-4 is about 60%, with an average of less than 20 steps required. In the PaLM-2 model, the jailbreak success rate reaches 72%, and the required steps are about 15 steps

On Llama-2 and Claude, the effect of PAIR is poor. The researchers believe this may be because The models were more rigorously fine-tuned in terms of security defense

They also compared the transferability of different target models. Research results show that PAIR's GPT-4 tips transfer better on Vicuna and PaLM-2

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

Researchers believe that the semantic attacks generated by PAIR are more capable of exposing language There are inherent security flaws in the model, and existing security measures focus more on defending against token-based attacks.

For example, after the team that developed the GCG algorithm shared its research results with large model vendors such as OpenAI, Anthropic and Google, the relevant models fixed token-level attack vulnerabilities.

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

#The security defense mechanism of large models against semantic attacks needs to be improved.

Paper link: https://arxiv.org/abs/2310.08419

The above is the detailed content of Jailbreak any large model in 20 steps! More 'grandma loopholes' are discovered automatically. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

A Comprehensive Guide to ExtrapolationApr 15, 2025 am 11:38 AM

Introduction Suppose there is a farmer who daily observes the progress of crops in several weeks. He looks at the growth rates and begins to ponder about how much more taller his plants could grow in another few weeks. From th

The Rise Of Soft AI And What It Means For Businesses TodayApr 15, 2025 am 11:36 AM

Soft AI — defined as AI systems designed to perform specific, narrow tasks using approximate reasoning, pattern recognition, and flexible decision-making — seeks to mimic human-like thinking by embracing ambiguity. But what does this mean for busine

Evolving Security Frameworks For The AI FrontierApr 15, 2025 am 11:34 AM

The answer is clear—just as cloud computing required a shift toward cloud-native security tools, AI demands a new breed of security solutions designed specifically for AI's unique needs. The Rise of Cloud Computing and Security Lessons Learned In th

3 Ways Generative AI Amplifies Entrepreneurs: Beware Of Averages!Apr 15, 2025 am 11:33 AM

Entrepreneurs and using AI and Generative AI to make their businesses better. At the same time, it is important to remember generative AI, like all technologies, is an amplifier – making the good great and the mediocre, worse. A rigorous 2024 study o

New Short Course on Embedding Models by Andrew NgApr 15, 2025 am 11:32 AM

Unlock the Power of Embedding Models: A Deep Dive into Andrew Ng's New Course Imagine a future where machines understand and respond to your questions with perfect accuracy. This isn't science fiction; thanks to advancements in AI, it's becoming a r

Is Hallucination in Large Language Models (LLMs) Inevitable?Apr 15, 2025 am 11:31 AM

Large Language Models (LLMs) and the Inevitable Problem of Hallucinations You've likely used AI models like ChatGPT, Claude, and Gemini. These are all examples of Large Language Models (LLMs), powerful AI systems trained on massive text datasets to

The 60% Problem — How AI Search Is Draining Your TrafficApr 15, 2025 am 11:28 AM

Recent research has shown that AI Overviews can cause a whopping 15-64% decline in organic traffic, based on industry and search type. This radical change is causing marketers to reconsider their whole strategy regarding digital visibility. The New

MIT Media Lab To Put Human Flourishing At The Heart Of AI R&DApr 15, 2025 am 11:26 AM

A recent report from Elon University’s Imagining The Digital Future Center surveyed nearly 300 global technology experts. The resulting report, ‘Being Human in 2035’, concluded that most are concerned that the deepening adoption of AI systems over t

See all articles