Skip to content

Commit

Permalink
[nanoeval] update readme (#45)
Browse files Browse the repository at this point in the history
  • Loading branch information
kliu128 authored Feb 26, 2025
1 parent 25ce1b8 commit 65f5897
Showing 1 changed file with 43 additions and 18 deletions.
61 changes: 43 additions & 18 deletions project/nanoeval/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,17 @@
# nanoeval

Simple, ergonomic, and high performance evals.
Simple, ergonomic, and high performance evals. We use it at OpenAI as part of our infrastructure to run Preparedness evaluations.

# Installation

```bash
# Using https://github.com/astral-sh/uv (recommended)
uv add "git+https://github.com/openai/SWELancer-Benchmark#egg=nanoeval&subdirectory=project/nanoeval"
# Using pip
pip install "git+https://github.com/openai/SWELancer-Benchmark#egg=nanoeval&subdirectory=project/nanoeval"
```

nanoeval is pre-release software and may have breaking changes, so it's recommended that you pin your installation to a specific commit. The uv command above will do this for you.

# Principles

Expand All @@ -13,8 +24,8 @@ Simple, ergonomic, and high performance evals.

- `Eval` - A [chz](https://github.com/openai/chz) class. Enumerates a set of tasks, and (typically) uses a "Solver" to solve them and then records the results. Can be configured in code or on the CLI using a chz entrypoint.
- `EvalSpec` - An eval to run and runtime characteristics of how to run it (i.e. concurrency, recording, other administrivia)
- `Task` - A separable, scoreable unit of work.
- `Solver` - A strategy (usually involving sampling a model) to go from a task to a result that can be scored. For example, there may be different ways to prompt a model to answer a multiple-choice question (i.e. looking at logits, using consensus, etc)
- `Task` - A single scoreable unit of work.
- `Solver` - A strategy (usually involving sampling a model) to go from a task to a result that can be scored. For example, there may be different ways to prompt a model to answer a multiple-choice question (i.e. looking at logits, few-shot prompting, etc)

# Running your first eval

Expand All @@ -33,9 +44,15 @@ The executors can operate in two modes:
1. **In-process:** The executor is just an async task running in the same process as the main eval script. The default.
2. **Multiprocessing:** Starts a pool of executor processes that all poll the db. Use this via `spec.runner.experimental_use_multiprocessing=True`.

## The monitor
## Performance

nanoeval has been tested up to ~5,000 concurrent rollouts. It is likely that it can go higher.

For highest performance, use multiprocessing with as many processes as your system memory + core count allows. See `RunnerArgs` for documentation.

## Monitoring

Nanoeval has a tiny built-in monitor to track ongoing evals. It's a streamlit that visualizes the state of the internal run state database. This can be helpful to diagnose hangs on specific tasks. To use it:
nanoeval has a tiny built-in monitor to track ongoing evals. It's a streamlit that visualizes the state of the internal run state database. This can be helpful to diagnose hangs on specific tasks. To use it:

```bash
# either set spec.runner.use_monitor=True OR run this command:
Expand All @@ -44,25 +61,28 @@ python3 -m nanoeval.bin.mon

## Resumption

Because nanoeval uses a persistent database to track the state of individual tasks in a run, this means you can restart an in-progress eval if it crashes. To do this:
Because nanoeval uses a persistent database to track the state of individual tasks in a run, you can restart an in-progress eval if it crashes. (In-progress rollouts will be restarted from scratch, but completed rollouts will be saved.) To do this:

```bash
python3 -m nanoeval.extras.resume db_name=<NAME OF YOUR RUN DB>
# Restarts the eval in a new process
python3 -m nanoeval.extras.resume run_set_id=...
```

The `db_name` is typically autogenerated and looks something like `<computer hostname>-<pid>`. You can list all your databases with:
You can list all run sets (databases) using the following command:

```bash
ls -lh ~/Library/Application Support/nanoeval/run_state/
ls -lh "$(python3 -c "from nanoeval.fs_paths import database_dir; print(database_dir())")"
```

# **Writing your first eval**
The run set ID for each database is simply the filename, without the `.db*` suffix.

# Writing your first eval

An eval is just a `chz` class that defines `get_name()`, `get_tasks()`, `evaluate()` and `get_summary()`. Start with `gpqa_simple.py`; copy it and modify it to suit your needs. If necessary, drop down to the base `nanoeval.Eval` class instead of using `MCQEval`.

The following sections describe common use case needs and how to achieve them.

## **Public API**
## Public API

You may import code from any `nanoeval.*` package that does not start with an underscore. Functions and classes that start with an underscore are considered private.

Expand Down Expand Up @@ -90,18 +110,23 @@ class MCQEval(Eval[MCQTask, Answer]):

# Debugging

Is your big eval not working? Check here.

## Killing old executors
## Kill dangling executors

Sometimes, if you ctrl-c the main job, executors don’t have time to exit. A quick fix:
Nanoeval uses `multiprocessing` to execute rollouts in parallel. Sometimes, if you ctrl-c the main job, the multiprocessing executors don’t have time to exit. A quick fix:

```bash
pkill -f multiprocessing.spawn
```

## Observability

### py-spy/aiomonitor
## Debugging stuck runs

`py-spy` is an excellent tool to figure out where processes are stuck if progress isn’t happening. You can check the monitor to find the PIDs of all the executors and py-spy them one by one. The executors also run `aiomonitor`, so you can connect to them via `python3 -m aiomonitor.cli ...` to inspect async tasks.

## Diagnosing main thread stalls

nanoeval relies heavily on Python asyncio for concurrency within each executor process; thus, if you block the main thread, this will harm performance and lead to main thread stalls. A common footgun is making a synchronous LLM or HTTP call, which can stall the main thread for dozens of seconds.

Tracking down blocking calls can be annoying, so nanoeval comes with some built-in features to diagnose these.

1. Blocking synchronous calls will trigger a stacktrace dump to a temporary directory. You can see them by running `open "$(python3 -c "from nanoeval.fs_paths import stacktrace_root_dir; print(stacktrace_root_dir())")"`.
2. Blocking synchronous calls will also trigger a console warning.

0 comments on commit 65f5897

Please sign in to comment.