Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements

Shu Yang*, Shenzhe Zhu*, Zeyu Wu, Keyu Wang, Junchi Yao, Junchao Wu, Lijie Hu, Mengdi Li, Derek F. Wong, Di Wang†

(*Contribute equally, †Corresponding author)

🤗 Dataset | 📜 Project Page | 📝 arxiv

❗️Content Warning: This repo contains examples of harmful language.

📰 News

2025/02/16: ❗️We have released our evaluation code.
2025/02/16: ❗️We have released our dataset.

🦆 Inference and Evaluation

Create environment

conda create -n fraud python=3.10
conda activate fraud
pip install -r requirements.txt

Config Your API

#please config your model api as in ./utils/config.py
OPENAI_KEYS = ["your tokens"]
ZHI_KEYS = ["your tokens"]
ZHI_URL = "your url"
OHMYGPT_KEYS = ["your tokens"]
OHMYGPT_URL = "your url"

Conduct multi-round inducements to LLMs

# In here, we use Helpful Assistant task as an example
nohup bash script/multi-round-level_attack/assistant.sh >assistant.out

Conduct multi-round evaluation

# In here, we use Helpful Assistant task as an example
nohup bash script/multi-round-dsr.sh >eval.out

Results Checking

cd ./results

💡 Abstract

We introduce Fraud-R1, a benchmark designed to evaluate LLMs’ ability to defend against internet fraud and phishing in dynamic, real-world scenarios. Fraud-R1 comprises 8,564 fraud cases sourced from phishing scams, fake job postings, social media, and news, categorized into 5 major fraud types. Unlike previous benchmarks, Fraud-R1 introduces a multi-round evaluation pipeline to assess LLMs’ resistance to fraud at different stages, including credibility building, urgency creation, and emotional manipulation. Furthermore, we evaluate 15 LLMs under two settings: (i) Helpful-Assistant, where the LLM provides general decision-making assistance, and (ii) Role-play, where the model assumes a specific persona, widely used in real-world agent-based interactions. Our evaluation reveals the significant challenges in defending against fraud and phishing inducement, especially in role-play settings and fake job postings. Additionally, we observe a substantial performance gap between Chinese and English, underscoring the need for improved multilingual fraud detection capabilities.

📡 Evaluation Flow

An overview of Fraud-R1 evaluation flow. We evaluate LLMs’ robustness in identifying and defense of fraud inducement under two different settings: Helpful Assistant and Role-play settings.

🛠️ Data Construction and Augmentation Pipeline

Our process begins with real-world fraud cases sourced from multiple channels. We then extract key Fraudulent Strategies and Fraudulent Intentions from these cases. Next, we employ Deepseek-R1 to generate fraudulent messages, emails, and posts, which are subsequently filtered to form ourbasedata (Base Dataset). Finally, through a multi-stage refinement process, we construct ourlevelupdatset (Level-up Dataset) to enable robust evaluation of LLMs against increasingly sophisticated fraudulent scenarios.

🚀 Data Composition

Data Statistics

Statistics	Information
Total dataset size	8564
Data split	Base (25%) / Levelup (75%)
Languages	Chinese (50%) / English (50%)
Fraudulent Service	28.04%
Impersonation	28.04%
Phishing Scam	22.06%
Fake Job Posting	14.02%
Online Relationship	7.84%
Average token length	273.92 tokens

FP-base: FP-base is directly generated by a state-of-the-art reasoning LLM from our selected real-world fraud cases

FP-levelup: FP-levelup is a rule-based augmentation of the base dataset, designed for multi-round dialogue setting.

Following is the step-by-step augmented fraud of 4 levels, including FP-base and FP-levelup(Building Credibility, Creating Urgency, Exploiting Emotional Appeal).

🏆 Leaderboard

Following is the Overall Model Performance on Fraud-R1 : The DSR% column represents the Defense Success Rate, while the DFR% column represents the Defense Failure Rate. Note: for model wise, DSR% = 100% - DFR%.

🤖 Performance Across Two Tasks

The detailed DSR(%) on 15 models. Bold values indicate the highest score in each column within API-based or Open-weight models, and underlined values represent the second highest score within the same category. "OD" stands for the overall DSR of models. "AS" and "RP" represent the model performance on Helpful Assistant and Role-play tasks, respectively. We use “R1-Llama-70B” as a shorthand for “Deepseek-R1-Distill-Llama-70B”.

❌ Disclaimers

This dataset includes offensive content that some may find disturbing. It is intended solely for educational and research use.

📲 Contact

Shu Yang: shu.yang@kaust.edu.sa
Shenzhe Zhu: cho.zhu@mail.utoronto.ca

📖 BibTeX:

@misc{yang2025fraudr1,
    title={Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements},
    author={Shu Yang and Shenzhe Zhu and Zeyu Wu and Keyu Wang and Junchi Yao and Junchao Wu and Lijie Hu and Mengdi Li and Derek F. Wong and Di Wang},
    year={2025},
    eprint={2502.12904},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
asset		asset
attacks		attacks
datacreation		datacreation
dataset		dataset
evaluation		evaluation
script		script
utils		utils
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements

📰 News

🦆 Inference and Evaluation

Create environment

Config Your API

Conduct multi-round inducements to LLMs

Conduct multi-round evaluation

Results Checking

💡 Abstract

📡 Evaluation Flow

🛠️ Data Construction and Augmentation Pipeline

🚀 Data Composition

Data Statistics

🏆 Leaderboard

🤖 Performance Across Two Tasks

❌ Disclaimers

📲 Contact

📖 BibTeX:

About

Releases

Packages

Contributors 5

Languages

kaustpradalab/Fraud-R1

Folders and files

Latest commit

History

Repository files navigation

Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements

📰 News

🦆 Inference and Evaluation

Create environment

Config Your API

Conduct multi-round inducements to LLMs

Conduct multi-round evaluation

Results Checking

💡 Abstract

📡 Evaluation Flow

🛠️ Data Construction and Augmentation Pipeline

🚀 Data Composition

Data Statistics

🏆 Leaderboard

🤖 Performance Across Two Tasks

❌ Disclaimers

📲 Contact

📖 BibTeX:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages