Skip to content

DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails

License

Notifications You must be signed in to change notification settings

yihedeng9/DuoGuard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails

🤗Models🤗Dataset (comming soon)

arXiv

DuoGuard is a guardrail LLM trained with two-player reinforcement learning framework designed to enhance multilingual safeguard for large language models (LLMs). Our approach enables the co-evolution of a generator and a guardrail model to iteratively improve synthetic multilingual safety data generation. DuoGuard significantly outperforms state-of-the-art models in multilingual safety tasks while maintaining high inference efficiency.

  • [Feb 2025] We have released the code, arXiv and the model weights.
  • [Coming Soon] We will release the datasets and training scripts in a future update.

Figure 1. Overview of the two-player training pipeline. The generator produces synthetic data from seed data. The classifier make predictions and we measure these examples as being predicted correctly or incorrectly based on their seed data label. We train the generator with DPO to create increasingly challenging examples, which in turn improve the classifier through iterative training.

Setup

Environment Installation

conda create -n duoguard python=3.10 -y
conda activate duoguard
pip install -r requirements.txt

Evaluation

In evaluation/test_single_input.py, we provide the code to test a single input entry and obtain the full probability output from DuoGuard.

Run Evaluation Script

bash scripts/eval.sh

Run Language-Specific Evaluations

python evaluation/evaluate_duoguard.py --language En
python evaluation/evaluate_duoguard.py --language Fr
python evaluation/evaluate_duoguard.py --language Es
python evaluation/evaluate_duoguard.py --language De

📊 Results

DuoGuard achieves superior multilingual safety performance compared to existing guardrail models on average across the six benchmarks (XSTest, OpenAI Moderation, ToxicChat, BeaverTail, RTP-LX, XSafety):

Model Size En-F1 Fr-F1 Es-F1 De-F1 Speed (ms/input)
LlamaGuard3 1B 45.2 44.6 45.0 44.7 45.6
ShieldGemma 2B 43.1 37.4 37.0 36.8 61.8
LlamaGuard2 8B 59.7 56.6 56.5 55.4 52.3
LlamaGuard3 8B 63.4 61.9 61.5 61.3 72.1
DuoGuard 0.5B 74.9 72.7 73.9 71.9 16.0

📄 Citation

If you use DuoGuard in your research, please cite:

@misc{deng2025duoguardtwoplayerrldrivenframework,
      title={DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails}, 
      author={Yihe Deng and Yu Yang and Junkai Zhang and Wei Wang and Bo Li},
      year={2025},
      eprint={2502.05163},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.05163}, 
}

About

DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published