Skip to content

Commit

Permalink
scoring workflow docs
Browse files Browse the repository at this point in the history
  • Loading branch information
jjallaire committed Feb 12, 2025
1 parent f3c3170 commit 909aa12
Showing 1 changed file with 110 additions and 53 deletions.
163 changes: 110 additions & 53 deletions docs/scorers.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -147,12 +147,12 @@ Note that `score()` above is declared as an `async` function. When creating cust

The components of `Score` include:

| Field | Type | Description |
| Field | Type | Description |
|-------------------|-------------------|----------------------------------|
| `value` | `Value` | Value assigned to the sample (e.g. "C" or "I", or a raw numeric value). |
| `answer` | `str` | Text extracted from model output for comparison (optional). |
| `explanation` | `str` | Explanation of score, e.g. full model output or grader model output (optional). |
| `metadata` | `dict[str,Any]` | Additional metadata about the score to record in the log file (optional). |
| `value` | `Value` | Value assigned to the sample (e.g. "C" or "I", or a raw numeric value). |
| `answer` | `str` | Text extracted from model output for comparison (optional). |
| `explanation` | `str` | Explanation of score, e.g. full model output or grader model output (optional). |
| `metadata` | `dict[str,Any]` | Additional metadata about the score to record in the log file (optional). |

: {tbl-colwidths=\[20,20,60\]}

Expand Down Expand Up @@ -191,7 +191,7 @@ Next, we'll take a look at the source code for a couple of the built in scorers

You'll often want to use models in the implementation of scorers. Use the `get_model()` function to get either the currently evaluated model or another model interface. For example:

```python
``` python
# use the model being evaluated for grading
grader_model = get_model()

Expand All @@ -201,14 +201,13 @@ grader_model = get_model("google/gemini-1.5-pro")

Use the `config` parameter of `get_model()` to override default generation options:

```python
``` python
grader_model = get_model(
"google/gemini-1.5-pro",
config = GenerateConfig(temperature = 0.9, max_connections = 10)
)
```


### Example: Includes

Here is the source code for the built-in `includes()` scorer:
Expand Down Expand Up @@ -381,11 +380,10 @@ Glob keys will each be resolved and a complete list of matching metrics will be
)
```

### Scorer with Complex Metrics
### Scorer with Complex Metrics

Sometime, it is useful for a scorer to compute multiple values (returning a dictionary as the score value) and to have metrics computed both for each key in the score dictionary, but also for the dictionary as a whole. For example:


``` python
@scorer(
metrics=[{ # <1>
Expand Down Expand Up @@ -420,10 +418,10 @@ task = Task(
scorer=letter_count()
)
```

1. The metrics for this scorer are a list, one element is a dictionary—this defines metrics to be applied to scores (by name), the other element is a Metric which will receive the entire score dictionary.
2. The score value itself is a dictionary—the keys corresponding to the keys defined in the metrics on the `@scorer` decorator.
3. The `total_count` metric will compute a metric based upon the entire score dictionary (since it isn't being mapped onto the dictionary by key)

3. The `total_count` metric will compute a metric based upon the entire score dictionary (since it isn't being mapped onto the dictionary by key)

### Reducing Multiple Scores

Expand All @@ -438,7 +436,6 @@ multi_scorer(

Use of `multi_scorer()` requires both a list of scorers as well as a *reducer* which determines how a list of scores will be turned into a single score. In this case we use the "mode" reducer which returns the score that appeared most frequently in the answers.


### Sandbox Access

If your Solver is an [Agent](agents.qmd) with tool use, you might want to inspect the contents of the tool sandbox to score the task.
Expand All @@ -447,8 +444,7 @@ The contents of the sandbox for the Sample are available to the scorer; simply c

For example:


```python
``` python
from inspect_ai import Task, task
from inspect_ai.dataset import Sample
from inspect_ai.scorer import Score, Target, accuracy, scorer
Expand Down Expand Up @@ -483,11 +479,8 @@ def challenge() -> Task:
sandbox="local",
scorer=check_file_exists(),
)

```



## Scoring Metrics

Each scorer provides one or more built-in metrics (typically `accuracy` and `stderr`) corresponding to the most typically useful metrics for that scorer.
Expand All @@ -508,7 +501,7 @@ Task(

If you still want to compute the built-in metrics, we re-specify them along with the custom metrics:

```python
``` python
metrics=[accuracy(), stderr(), custom_metric()]
```

Expand All @@ -534,7 +527,7 @@ Inspect includes some simple built in metrics for calculating accuracy, mean, et

- `stderr()`

Standard error of the mean.
Standard error of the mean.

- `bootstrap_stderr()`

Expand All @@ -550,7 +543,7 @@ pip install git+https://github.com/UKGovernmentBEIS/inspect_ai
```
:::

The `stderr()` metric supports computing [clustered standard errors](https://en.wikipedia.org/wiki/Clustered_standard_errors) via the `cluster` parameter. Most scorers already include `stderr()` as a built-in metric, so to compute clustered standard errors you'll want to specify custom `metrics` for your task (which will override the scorer's built in metrics).
The `stderr()` metric supports computing [clustered standard errors](https://en.wikipedia.org/wiki/Clustered_standard_errors) via the `cluster` parameter. Most scorers already include `stderr()` as a built-in metric, so to compute clustered standard errors you'll want to specify custom `metrics` for your task (which will override the scorer's built in metrics).

For example, let's say you wanted to cluster on a "category" variable defined in `Sample` metadata:

Expand Down Expand Up @@ -597,7 +590,7 @@ Note that the `Score` class contains a `Value` that is a union over several scal

## Reducing Epochs {#reducing-epochs}

If a task is run over more than one `epoch`, multiple scores will be generated for each sample. These scores are then *reduced* to a single score representing the score for the sample across all the epochs.
If a task is run over more than one `epoch`, multiple scores will be generated for each sample. These scores are then *reduced* to a single score representing the score for the sample across all the epochs.

By default, this is done by taking the mean of all sample scores, but you may specify other strategies for reducing the samples by passing an `Epochs`, which includes both a count and one or more reducers to combine sample scores with. For example:

Expand Down Expand Up @@ -630,22 +623,19 @@ def gpqa():

Inspect includes several built in reducers which are summarised below.

| Reducer | Description |
|--------------------------|----------------------------------------------------------------------------------------------------------------------|
| mean | Reduce to the average of all scores. |
| median | Reduce to the median of all scores |
| mode | Reduce to the most common score. |
| max | Reduce to the maximum of all scores. |
| pass_at_{k} | Probability of at least 1 correct sample given `k` epochs (<https://arxiv.org/pdf/2107.03374>) |
| at_least_{k} | `1` if at least `k` samples are correct, else `0`. |
| Reducer | Description |
|------------------|------------------------------------------------------|
| mean | Reduce to the average of all scores. |
| median | Reduce to the median of all scores |
| mode | Reduce to the most common score. |
| max | Reduce to the maximum of all scores. |
| pass_at\_{k} | Probability of at least 1 correct sample given `k` epochs (<https://arxiv.org/pdf/2107.03374>) |
| at_least\_{k} | `1` if at least `k` samples are correct, else `0`. |

: {tbl-colwidths="\[30,70\]"}


:::{.callout-note}

::: callout-note
The built in reducers will compute a reduced `value` for the score and populate the fields `answer` and `explanation` only if their value is equal across all epochs. The `metadata` field will always be reduced to the value of `metadata` in the first epoch. If your custom metrics function needs differing behavior for reducing fields, you should also implement your own custom reducer and merge or preserve fields in some way.

:::

### Custom Reducers
Expand All @@ -670,50 +660,92 @@ def mean_score() -> ScoreReducer:
return reduce
```


## Workflow {#sec-scorer-workflow}

### Score Command
::: {.callout-note appearance="simple"}
The `inspect score` command and `score()` function as described below are currently available only in the development version of Inspect. To install the development version from GitHub:

``` bash
pip install git+https://github.com/UKGovernmentBEIS/inspect_ai
```
:::

By default, model output in evaluations is automatically scored. However, you can separate generation and scoring by using the `--no-score` option. For example:
### Unscored Evals

By default, model output in evaluations is automatically scored. However, you can defer scoring by using the `--no-score` option. For example:

``` bash
inspect eval popularity.py --model openai/gpt-4 --no-score
```

This will produce a log with samples that have not yet been scored and with no evaluation metrics.

::: {.callout-tip appearance="simple"}
Using a distinct scoring step is particularly useful during scorer development, as it bypasses the entire generation phase, saving lots of time and inference costs.
:::

### Score Command

You can score an evaluation previously run this way using the `inspect score` command:

``` bash
# score last eval
inspect score popularity.py

# score specific log file
inspect score popularity.py ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.json
# score an unscored eval
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval
```

::: callout-tip
Using a distinct scoring step is particularly useful during scorer development, as it bypasses the entire generation phase, saving lots of time and inference costs.
:::
This will use the scorers and metrics that were declared when the evaluation was run, applying them to score each sample and generate metrics for the evaluation.

You may choose to use a different scorer than the task scorer to score a log file. In this case, you can use the `--scorer` option to pass the name of a scorer (including one in a package) or the path to a source code file containing a scorer to use. For example:

### Log Overwriting
``` bash
# use built in match scorer
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer match

# use scorer in a package
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer scorertools/custom_scorer

By default, `inspect score` overwrites the file it scores. If don't want to overwrite target files, pass the `--no-overwrite` flag:
# use scorer in a file
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer custom_scorer.py

# use a custom scorer named 'classify' in a file with more than one scorer
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer custom_scorers.py@classify
```

If you need to pass arguments to the scorer, you can do do using scorer args (`-S`) like so:

``` bash
inspect score popularity.py --no-overwrite
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer match -S location=end
```

When specifying `--no-overwrite`, a `-scored` suffix will be added to the original log file name:
#### Overwriting Logs

When you use the `inspect score` command, you will prompted whether or not you'd like to overwrite the existing log file (with the scores added), or create a new scored log file. By default, the command will create a new log file with a `-scored` suffix to distinguish it from the original file. You may also control this using the `--overwrite` flag as follows:

``` bash
./logs/2024-02-23_task_gpt-4_TUhnCn473c6-scored.json
# overwrite the log with scores from the task defined scorer
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --overwrite
```

Note that the `--no-overwrite` flag does not apply to log files that already have the `-scored` suffix—those files are always overwritten by `inspect score`. If you plan on scoring multiple times and you want to save each scoring output, you will want to copy the log to another location before re-scoring.
#### Ovewriting Scores

When rescoring a previously scored log file you have two options:

1) Append Mode (Default): The new scores will be added alongside the existing scores in the log file, keeping both the old and new results.
2) Overwrite Mode: The new scores will replace the existing scores in the log file, removing the old results.

You can choose which mode to use based on whether you want to preserve or discard the previous scoring data. To control this, use the `--action` arg:

``` bash
# append scores from custom scorer
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer custom_scorer.py --action append

# overwrite scores with new scores from custom scorer
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer custom_scorer.py --action overwrite
```

### Python API
### Score Function

If you are exploring the performance of different scorers, you might find it more useful to call the `score()` function using varying scorers or scorer options. For example:
You can also use the `score()` function in your Python code to score evaluation logs. For example, if you are exploring the performance of different scorers, you might find it more useful to call the `score()` function using varying scorers or scorer options. For example:

``` python
log = eval(popularity, model="openai/gpt-4")[0]
Expand All @@ -730,3 +762,28 @@ scoring_logs = [score(log, model_graded_qa(model=model))

plot_results(scoring_logs)
```

You can also use this function to score an existing log file (appending or overwriting results) like so:

``` python
# read the log
input_log_path = "./logs/2025-02-11T15-17-00-05-00_popularity_dPiJifoWeEQBrfWsAopzWr.eval"
log = read_eval_log(input_log_path)

grader_models = [
"openai/gpt-4",
"anthropic/claude-3-opus-20240229",
"google/gemini-1.0-pro",
"mistral/mistral-large-latest"
]

# perform the scoring using various models
scoring_logs = [score(log, model_graded_qa(model=model), action="append")
for model in grader_models]

# write log files with the model name as a suffix
for model, scored_log in zip(grader_models, scoring_logs):
base, ext = os.path.splitext(input_log_path)
output_file = f"{base}_{model.replace('/', '_')}{ext}"
write_eval_log(scored_log, output_file)
```

0 comments on commit 909aa12

Please sign in to comment.