README

This repository contains the code for the paper "Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges".

Abstract

Offering a promising solution to the scalability challenges associated with human evaluation, the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models (LLMs). However, there are still many open questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold. In this paper, we present a comprehensive study of the performance of various LLMs acting as judges. We leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs and evaluate them alongside human annotations which we found to have a high inter-annotator agreement. Our study includes 9 judge models and 9 exam taker models -- both base and instruction-tuned. We assess the judge model's alignment across different model sizes, families, and judge prompts. Among other results, our research rediscovers the importance of using Cohen's kappa as a metric of alignment as opposed to simple percent agreement, showing that judges with high percent agreement can still assign vastly different scores. We find that both Llama-3 70B and GPT-4 Turbo have an excellent alignment with humans, but in terms of ranking exam taker models, they are outperformed by both JudgeLM-7B and the lexical judge Contains, which have up to 34 points lower human alignment. Through error analysis and various other studies, including the effects of instruction length and leniency bias, we hope to provide valuable lessons for using LLMs as judges in the future.

Running experiments

All the experiments in the paper were run using the code in this repository, on the Unity cluster at the University of Massachusetts Amherst and stored on a MongoDB database.

Creating configs

To run any benchmark and/or evaluation, the user only needs to specify the config as a JSON file, and run the main.py script with the appropriate arguments.

The JSON config files are stored in the configs directory. The config files have the following keys:

metadata: Metadata for the job, useful for storing any extra information about the job. The metadata is not used by the code.
benchmarks: A list of benchmark configurations.
models: A list of model configurations.
evaluators: A list of evaluator configurations.

A benchmark job will run each benchmark with each model. An evaluation job will run each evaluator with each benchmark-model pair. Evaluation job assumes that the corresponding benchmark job has already been run.

A sample config file is shown below:

{
    "metadata": {
        "version": "v1.6.0",
    },
    "benchmarks": [
        {
            "name": "triviaQA-400-v1.5.0",
            "cls": "TriviaQABenchmark",
            "subset": "unfiltered", 
            "seed": 51,
            "num_samples": 400,
            "num_fewshot": 5
        }
    ],
    "models" : [
        {
            "name": "gpt-4t",
            "cls": "OpenAIModel",
            "model": "gpt-4-turbo-2024-04-09",
            "chat": true
        },
        {
            "name": "mistral-7B",
            "cls": "MistralModel",
            "model": "mistralai/Mistral-7B-v0.1"
        }
    ],
    "evaluators": [
        {
            "name": "eval-qwen72-extract",
            "cls": "LLMExtractEvaluator",
            "model_config": {
                "cls": "HFModel",
                "model": "Qwen/Qwen1.5-72B-Chat",
                "chat": true,
                "max_new_tokens": 512
            },
            "truncate": "newlinequestion",
            "template": "TAG",
            "eval_tag": "evaluation"
        }
    ]
}

Running a job

The code for all the benchmarks and evaluation is in the src directory. In the root of the project, you can run the main.py driver script to run a job. The script takes a few arguments:

-b or --benchmark-config: The name of the config JSON file in the configs directory. This file contains the configuration for the benchmark to run.
-e or --eval-config: The name of the config JSON file in the configs directory. This file contains the configuration for the evaluation to run.
-be or -eb: The name of the config JSON file in the configs directory. This file contains the configuration for both the benchmark and evaluation to run.
-i or --inspect-config: The name of the config JSON file in the configs directory. This file contains the configuration for manual inspection of the model output, which will be printed to the console.
-v or --verbose: If set, the output will be more verbose.
--json-db: Use a local JSON database instead of MongoDB. Useful for testing.
--markdown: Used with the inspect job to save the output to a markdown file.

For example, to run a benchmark job using the config present in configs/benchmark.json, you can run the following command:

python main.py -b benchmark

To run an evaluation job using the config present in configs/evaluation.json, you can run the following command:

python main.py -e evaluation

To run both the benchmark and evaluation in a single job, you can run the following command:

python main.py -b benchmark -e evaluation

If the benchmark and evaluation configs are the same, you can run the following equivalent command:

python main.py -be benchmark

To run an inspection job using the config present in configs/inspection.json, you can run the following command:

python main.py -i inspection

Models

Models have the following attributes:

cls (str): The name of the model class to use
name (str): Name for easy identification in outputs and logs, not used by the code
max_new_tokens (int): The maximum number of new tokens to generate (default: 32)

Model classes

HuggingFace

cls = [HFModel, LlamaModel, MistralModel,Phi2Model]

chat (bool): If using the chat model (default: False)
model (str): The name of the model. Full model name including org name (e.g., "meta-llama/Llama-2-70b"). If using a custom model class (e.g. LlamaModel), you can specify the full model name or just the model name on HuggingFace (e.g., "Llama-2-70b"). In the latter case, the default org name for the model class (HF_ORG_NAME attribute) will be used. (default: 32)

All the supported models are listed below:

Model	HFModel	HFModel	LlamaModel	MistralModel	HFModel	Phi2Model	HFModel	HFModel	HFModel
Org Name	tiiuae	google	meta-llama	mistralai	allenai	microsoft	lmsys	HuggingFaceH4	Qwen
Base Models	falcon-7b	gemma-2b	Llama-2-7b-hf	Mistral-7B-v0.1	OLMo-1B	phi-2			Qwen1.5-0.5B
	falcon-40b	gemma-7b	Llama-2-13b-hf	Mixtral-8x7B-v0.1	OLMo-7B				Qwen1.5-1.8B
	falcon-180b		Llama-2-70b-hf						Qwen1.5-4B
			Meta-Llama-3-8B						Qwen1.5-7B
			Meta-Llama-3-70B						Qwen1.5-14B
									Qwen1.5-72B
Chat Models	falcon-7b-instruct	gemma-2b-it	Llama-2-7b-chat-hf	Mistral-7B-Instruct-v0.2	OLMo-7B-Instruct	phi-2	vicuna-7b-v1.5	zephyr-7b-beta	Qwen1.5-0.5B-Chat
	falcon-40b-instruct	gemma-7b-it	Llama-2-13b-chat-hf	Mixtral-8x7B-Instruct-v0.1			vicuna-13b-v1.5	zephyr-7b-gemma-v0.1	Qwen1.5-1.8B-Chat
	falcon-180b-chat		Llama-2-70b-chat-hf				vicuna-33b-v1.3		Qwen1.5-4B-Chat
			Meta-Llama-3-8B-Instruct						Qwen1.5-7B-Chat
			Meta-Llama-3-70B-Instruct						Qwen1.5-14B-Chat
									Qwen1.5-72B-Chat

OpenAI

cls = OpenAIModel model (str): The name of the model, as defined by OpenAI in their API reference (e.g., "gpt-4-turbo-2024-04-09")
Anthropic

cls = AnthropicModel
- chat (bool): If using the chat model (default: False)
HumanModel

cls = HumanModel

Makes queries to the human using CLI. For testing purposes.
DummyModel

cls = "DummyModel"

Returns the prompt with a fixed prefix. For automated testing purposes.
- prefix (str): The fixed prefix to return (default: "")

Benchmarks

Benchmarks have the following attributes:

cls (str): The name of the benchmark class to use
name (str): Name for easy identification in outputs and logs, not used by the code
seed (int): The random seed to use for shuffling the fewshot examples and benchmark questions (if sampling). Default is 0.
num_samples (int): The number of samples (questions) to use. A value of None means all questions are used without shuffling. Default is None.
num_fewshot (int): The number of few-shot examples to use. Default is 0.
template (str): Name of the template to use for creating the prompt. Default is "BASE_SIMPLE". See llm_eval/helpers/templates/ for available templates.

Benchmark classes

Natural Questions

cls = "NaturalQuestionsBenchmark"
TriviaQA

cls = "TriviaQABenchmark"
- subset: (str:"unfiltered"/"rc") : The subset of TriviaQA to use for benchamrking
MMLU

cls = "MMLUBenchmark"

Evaluators

Evaluators have the following attributes:

cls (str): The name of the evaluator class to use
name (str): Name for easy identification in outputs and logs, not used by the code

Evaluator classes

Exact Match

cls = "ExactMatchEvaluator"
- cased (bool): If the evaluation should be case-sensitive (default: True)
Contains

cls = "ContainsMatchEvaluator"
- cased (bool): If the evaluation should be case-sensitive (default: True)
HumanEvaluator

cls = "HumanEvaluator"

Makes queries to the human using CLI. Human must answer with y, n, y?, or n? for each prompt.
LLMEvaluator

cls = "LLMEvaluator"
- model (Model): The model to use for generating
- template (str): Name of the template to use for creating the prompt for the evaluator. Default is "DEFAULT".
- truncate (str): The truncation logic to use. Available options are "newline", "newlinequestion", "skip", and "eleutherai". Default is "newline".
- eval_tag (str): A tag to identify the evaluation in the output. For example, if the eval_tag is "tag", and the raw output of the evaluator is "The answer is <tag>correct</tag> because...", the evaluator will extract "correct" as the answer. Useful if the template asks the evalutor to wrap its evaluation inside a specified tag. If the eval_tag is None, the raw output is used as the answer. Default is None.

Cite Us

@misc{thakur2024judging,
      title={Judging the Judges: Evaluating Alignment and Vulnerabilities in {LLMs}-as-Judges}, 
      author={Aman Singh Thakur and Kartik Choudhary and Venkat Srinik Ramayapally and Sankaran Vaidyanathan and Dieuwke Hupkes},
      year={2024},
      eprint={2406.12624},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.12624}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
configs/benchmarks		configs/benchmarks
images		images
results		results
src/llm_eval		src/llm_eval
test		test
.gitignore		.gitignore
README.md		README.md
main.py		main.py
main.sh		main.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

Abstract

Running experiments

Creating configs

Running a job

Models

Model classes

Benchmarks

Benchmark classes

Evaluators

Evaluator classes

Cite Us

About

Releases 22

Packages

Contributors 3

Languages

UMass-Meta-LLM-Eval/llm_eval

Folders and files

Latest commit

History

Repository files navigation

README

Abstract

Running experiments

Creating configs

Running a job

Models

Model classes

Benchmarks

Benchmark classes

Evaluators

Evaluator classes

Cite Us

About

Topics

Resources

Stars

Watchers

Forks

Releases 22

Packages 0

Contributors 3

Languages

Packages