Czech-SimpleQA

Problems and answers from OpenAI's SimpleQA eval translated into Czech. This work is based on the data from the paper:

Measuring short-form factuality in large language models Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, William Fedus arXiv preprint arXiv:2411.04368, 2024. https://arxiv.org/abs/2411.04368

model	SimpleQA¹	Czech-SimpleQA
gpt-4o-mini-2024-07-18	9.5	8.1
gpt-4o-2024-11-20	38.8	31.4
claude-3-5-sonnet-20240620	35.0	25.8
claude-3-5-sonnet-20241022	N/A	31.1
claude-3-5-haiku-20241022	N/A	9.3

There is a post on my blog with more detailed results!

I Just Want the Eval Data

The file with the data lives at src/czech_simpleqa/czech_simpleqa.csv.gz, this is the full URL. Getting it with pandas looks like this:

import pandas as pd

eval_data = pd.read_csv(
    "https://raw.githubusercontent.com/jancervenka/"
    "czech-simpleqa/refs/heads/main/src/czech_simpleqa/czech_simpleqa.csv.gz"
)

problem	target	czech_problem	czech_target
What was the population count in the 2011 census of the Republic of Nauru?	10,084	Jaký byl počet obyvatel při sčítání lidu v roce 2011 v Republice Nauru?	10 084

I Want to Use the Python Package

The package contains everything required to run the eval end-to-end and collect the results. You can install it with pip or any other Python package manager:

pip install czech-simpleqa
python -m czech_simpleqa.eval \
    --answering_model claude-3-5-haiku-20241022 \
    --grading_model gpt-4o \
    --output_file_path output/claude-3-5-haiku-20241022.csv \
    --max_concurrent_tasks 30

CLI Arguments

--answering_model: Model that will generate predicted answers to the problems in the eval.
--grading_model: Model that will grade the predicted answers from the answering model.
--output_file_path: Where to store the .csv file with the eval results.
--max_concurrent_tasks: Maximum number of concurrent model calls (default 20).

Output File Schema

problem	target	predicted_answer	grade
Jaké je rozlišení Cat B15 Q v pixelech?	480 x 800	Cat B15 Q má rozlišení 480 x 800 pixelů.	A

Supported Models

Models from OpenAI and Anthropic are currently supported. Environment variables OPENAI_API_KEY or ANTHROPIC_API_KEY need to be configured.

Model Results

Answers with their grades from all the evaluated models can be found in the model_results/ directory.

As reported in the SimpleQA README.md and in the paper. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github/workflows		.github/workflows
model_results		model_results
src/czech_simpleqa		src/czech_simpleqa
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Czech-SimpleQA

I Just Want the Eval Data

I Want to Use the Python Package

CLI Arguments

Output File Schema

Supported Models

Model Results

About

Releases 1

Languages

License

jancervenka/czech-simpleqa

Folders and files

Latest commit

History

Repository files navigation

Czech-SimpleQA

I Just Want the Eval Data

I Want to Use the Python Package

CLI Arguments

Output File Schema

Supported Models

Model Results

Footnotes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Languages