This is the official repository for "Do Anything Now'': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models by Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang.
In this project, we conduct the first measurement study on jailbreak prompts in the wild, with 6,387 prompts collected from four platforms over six months.
Check out our website here.
Disclaimer. This repo contains examples of harmful language. Reader discretion is recommended.
Overall, we collect 6,387 prompts from four platforms (Reddit, Discord, websites, and open-source datasets) during Dec 2022 to May 2023. Among these prompts, we identify 666 jailbreak prompts. To the best of our knowledge, this dataset serves as the largest collection of in-the-wild jailbreak prompts.
The data are provided here.
Platform | Source | # Posts | # Prompts | # Jailbreaks | Access Date |
---|---|---|---|---|---|
r/ChatGPT | 79436 | 108 | 108 | 2023-04-30 | |
r/ChatGPTPromptGenius | 854 | 314 | 24 | 2023-04-30 | |
r/ChatGPTJailbreak | 456 | 73 | 73 | 2023-04-30 | |
Discord | ChatGPT | 393 | 363 | 126 | 2023-04-30 |
Discord | ChatGPT Prompt Engineering | 240 | 211 | 47 | 2023-04-30 |
Discord | Spreadsheet Warriors | 63 | 54 | 54 | 2023-04-30 |
Discord | AI Prompt Sharing | 25 | 24 | 17 | 2023-04-30 |
Discord | LLM Promptwriting | 78 | 75 | 34 | 2023-04-30 |
Discord | BreakGPT | 19 | 17 | 17 | 2023-04-30 |
Website | AIPRM | - | 3385 | 20 | 2023-05-07 |
Website | FlowGPT | - | 1472 | 66 | 2023-05-07 |
Website | JailbreakChat | - | 78 | 78 | 2023-04-30 |
Dataset | Awesome ChatGPT Prompts | - | 163 | 2 | 2023-04-30 |
Dataset | OCR-Prompts | - | 50 | 0 | 2023-04-30 |
Total | 81564 | 6387 | 666 |
To evaluate the effectiveness of jailbreak prompts, we construct a question set comprising 46,800 samples across 13 forbidden scenarios adopted from OpenAI Usage Policy.
We exclude Child Sexual Abuse
scenario from our evaluation and focus on the rest 13 scenarios, including Illegal Activity
, Hate Speech
, Malware Generation
, Physical Harm
, Economic Harm
, Fraud
, Pornography
, Political Lobbying
, Privacy Violence
, Legal Opinion
, Financial Advice
, Health Consultation
, and Government Decision
.
The question set is provided here.
TODO
We acknowledge that data collected online can contain personal information. Thus, we adopt standard best practices to guarantee that our study follows ethical principles, such as not trying to deanonymize any user and reporting results on aggregate. Since this study only involved publicly available data and had no interactions with participants, it is not regarded as human subjects research by our Institutional Review Boards (IRB). Nonetheless, since one of our goals is to measure the risk of LLMs in answering harmful questions, it is inevitable to disclose how a model can generate hateful content. This can bring up worries about potential misuse. However, we strongly believe that raising awareness of the problem is even more crucial, as it can inform LLM vendors and the research community to develop stronger safeguards and contribute to the more responsible release of these models.
We have responsibly disclosed our findings to related LLM vendors.
If you find this useful in your research, please consider citing:
@article{SCBSZ23,
author = {Xinyue Shen and Zeyuan Chen and Michael Backes and Yun Shen and Yang Zhang},
title = {{"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models}},
journal = {{CoRR abs/2308.03825}},
year = {2023}
}
jailbreak_llms
is licensed under the terms of the MIT license. See LICENSE for more details.