DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails

DuoGuard is a guardrail LLM trained with two-player reinforcement learning framework designed to enhance multilingual safeguard for large language models (LLMs). Our approach enables the co-evolution of a generator and a guardrail model to iteratively improve synthetic multilingual safety data generation. DuoGuard significantly outperforms state-of-the-art models in multilingual safety tasks while maintaining high inference efficiency.

[Feb 2025] We have released the code, arXiv and the model weights.
[Coming Soon] We will release the datasets and training scripts in a future update.

Figure 1. Overview of the two-player training pipeline. The generator produces synthetic data from seed data. The classifier make predictions and we measure these examples as being predicted correctly or incorrectly based on their seed data label. We train the generator with DPO to create increasingly challenging examples, which in turn improve the classifier through iterative training.

Setup

Environment Installation

conda create -n duoguard python=3.10 -y
conda activate duoguard
pip install -r requirements.txt

Evaluation

In evaluation/test_single_input.py, we provide the code to test a single input entry and obtain the full probability output from DuoGuard.

Run Evaluation Script

bash scripts/eval.sh

Run Language-Specific Evaluations

python evaluation/evaluate_duoguard.py --language En
python evaluation/evaluate_duoguard.py --language Fr
python evaluation/evaluate_duoguard.py --language Es
python evaluation/evaluate_duoguard.py --language De

📊 Results

DuoGuard achieves superior multilingual safety performance compared to existing guardrail models on average across the six benchmarks (XSTest, OpenAI Moderation, ToxicChat, BeaverTail, RTP-LX, XSafety):

Model	Size	En-F1	Fr-F1	Es-F1	De-F1	Speed (ms/input)
LlamaGuard3	1B	45.2	44.6	45.0	44.7	45.6
ShieldGemma	2B	43.1	37.4	37.0	36.8	61.8
LlamaGuard2	8B	59.7	56.6	56.5	55.4	52.3
LlamaGuard3	8B	63.4	61.9	61.5	61.3	72.1
DuoGuard	0.5B	74.9	72.7	73.9	71.9	16.0

📄 Citation

If you use DuoGuard in your research, please cite:

@misc{deng2025duoguardtwoplayerrldrivenframework,
      title={DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails}, 
      author={Yihe Deng and Yu Yang and Junkai Zhang and Wei Wang and Bo Li},
      year={2025},
      eprint={2502.05163},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.05163}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
classifier		classifier
evaluation		evaluation
figures		figures
generator		generator
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails

Setup

Environment Installation

Evaluation

Run Evaluation Script

Run Language-Specific Evaluations

📊 Results

📄 Citation

About

Releases

Packages

Languages

License

yihedeng9/DuoGuard

Folders and files

Latest commit

History

Repository files navigation

DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails

Setup

Environment Installation

Evaluation

Run Evaluation Script

Run Language-Specific Evaluations

📊 Results

📄 Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages