Problems and answers from OpenAI's SimpleQA eval translated into Czech. This work is based on the data from the paper:
Measuring short-form factuality in large language models Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, William Fedus arXiv preprint arXiv:2411.04368, 2024. https://arxiv.org/abs/2411.04368
model | SimpleQA1 | Czech-SimpleQA |
---|---|---|
gpt-4o-mini-2024-07-18 | 9.5 | 8.1 |
gpt-4o-2024-11-20 | 38.8 | 31.4 |
claude-3-5-sonnet-20240620 | 35.0 | 25.8 |
claude-3-5-sonnet-20241022 | N/A | 31.1 |
claude-3-5-haiku-20241022 | N/A | 9.3 |
There is a post on my blog with more detailed results!
The file with the data lives at src/czech_simpleqa/czech_simpleqa.csv.gz
, this is the full URL.
Getting it with pandas
looks like this:
import pandas as pd
eval_data = pd.read_csv(
"https://raw.githubusercontent.com/jancervenka/"
"czech-simpleqa/refs/heads/main/src/czech_simpleqa/czech_simpleqa.csv.gz"
)
problem | target | czech_problem | czech_target |
---|---|---|---|
What was the population count in the 2011 census of the Republic of Nauru? | 10,084 | Jaký byl počet obyvatel při sčítání lidu v roce 2011 v Republice Nauru? | 10 084 |
The package contains everything required to run the eval end-to-end and collect the results.
You can install it with pip
or any other Python package manager:
pip install czech-simpleqa
python -m czech_simpleqa.eval \
--answering_model claude-3-5-haiku-20241022 \
--grading_model gpt-4o \
--output_file_path output/claude-3-5-haiku-20241022.csv \
--max_concurrent_tasks 30
--answering_model
: Model that will generate predicted answers to the problems in the eval.--grading_model
: Model that will grade the predicted answers from the answering model.--output_file_path
: Where to store the.csv
file with the eval results.--max_concurrent_tasks
: Maximum number of concurrent model calls (default 20).
problem | target | predicted_answer | grade |
---|---|---|---|
Jaké je rozlišení Cat B15 Q v pixelech? | 480 x 800 | Cat B15 Q má rozlišení 480 x 800 pixelů. | A |
Models from OpenAI and Anthropic are currently supported. Environment variables OPENAI_API_KEY
or
ANTHROPIC_API_KEY
need to be configured.
Answers with their grades from all the evaluated models can be found in the model_results/
directory.
Footnotes
-
As reported in the SimpleQA README.md and in the paper. ↩