Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalization configuration docs #270

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions docs/normalization/0_basic_info.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# LM-Polygraph Normalization Methods

LM-Polygraph implements several uncertainty normalization methods to convert raw uncertainty scores into more interpretable confidence values bounded between 0 and 1. Here are the key normalization approaches:

### MinMax Normalization (MinMaxNormalizer in `minmax.py`)

- Takes raw uncertainty scores and linearly scales them to [0,1] range.
- Flips the sign since uncertainty scores should be negatively correlated with confidence.
- Uses scikit-learn's `MinMaxScaler` internally.
- Simple but doesn't maintain a direct connection to output quality.

### Quantile Normalization (QuantileNormalizer in `quantile.py`)

- Transforms uncertainty scores into their corresponding percentile ranks.
- Uses empirical CDF to map scores to [0,1] range.
- Provides uniformly distributed confidence scores.
- May lose some granularity of original uncertainty estimates.

## Performance-Calibrated Confidence (PCC) Methods

### Binned PCC (BinnedPCCNormalizer in `binned_pcc.py`)

- Splits calibration data into bins based on uncertainty values.
- Each bin has approximately an equal number of samples.
- Confidence score is the mean output quality of samples in the corresponding bin.
- Provides an interpretable connection between confidence and expected quality.
- Drawback: Can change ordering of samples compared to raw uncertainty.

### Isotonic PCC (IsotonicPCCNormalizer in `isotonic_pcc.py`)

- Uses Centered Isotonic Regression (CIR) to fit a monotonic relationship.
- Maps uncertainty scores to output quality while preserving ordering.
- Enforces monotonicity constraint to maintain uncertainty ranking.
- More robust than the binned approach while maintaining interpretability.
- Implementation based on CIR algorithm from Oron & Flournoy (2017).

## Common Interface: `BaseUENormalizer`

All normalizers follow a common interface defined in `BaseUENormalizer`:

- `fit()`: Learns normalization parameters from calibration data.
- `transform()`: Applies normalization to new uncertainty scores.
- `dumps()/loads()`: Serialization support for fitted normalizers.

## Key Benefits of PCC Methods

- Direct connection to output quality metrics.
- Bounded interpretable range [0,1].
- Maintained correlation with generation quality.
- Easy to explain meaning to end users.

## Highlight: Isotonic PCC

The Isotonic PCC approach provides the best balance between:

- Maintaining the original uncertainty ranking.
- Providing interpretable confidence scores.
- Establishing a clear connection to expected output quality.

When using normalized scores, users can interpret them as estimates of relative output quality, making them more useful for downstream applications and human understanding.
141 changes: 141 additions & 0 deletions docs/normalization/1_core_normalization_configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# Core Normalization Configuration

## Overview
Core normalization configuration in LM-Polygraph defines how uncertainty scores are transformed into interpretable confidence values. These configurations control the fundamental behavior of all normalization methods across the system.

## Base Configuration Location
Core normalization configurations are located in:
```
/examples/configs/normalization/fit/default.yaml
```

## Available Normalization Methods

### 1. MinMax Normalization
Linearly scales uncertainty scores to [0,1] range.

```yaml
normalization:
type: "minmax"
clip: true # Whether to clip values outside [0,1] range
```

### 2. Quantile Normalization
Transforms scores into percentile ranks using empirical CDF.

```yaml
normalization:
type: "quantile"
```

### 3. Binned Performance-Calibrated Confidence (Binned PCC)
Maps uncertainty scores to confidence bins based on output quality.

```yaml
normalization:
type: "binned_pcc"
params:
num_bins: 10 # Number of bins for mapping
```

### 4. Isotonic Performance-Calibrated Confidence (Isotonic PCC)
Uses monotonic regression to map uncertainty to confidence while preserving ordering.

```yaml
normalization:
type: "isotonic_pcc"
params:
y_min: 0.0 # Minimum confidence value
y_max: 1.0 # Maximum confidence value
increasing: false # Whether mapping should be increasing
out_of_bounds: "clip" # How to handle out-of-range values
```

## Common Parameters

### Calibration Strategy
```yaml
normalization:
calibration:
strategy: "dataset_specific" # or "global"
background_dataset: null # Optional background dataset for global calibration
```

### Data Processing
```yaml
normalization:
processing:
ignore_nans: true # Whether to ignore NaN values in calibration
normalize_metrics: true # Whether to normalize quality metrics
```

### Caching
```yaml
normalization:
cache:
enabled: true
path: "${cache_path}/normalization"
version: "v1"
```

## Usage Examples

### Basic MinMax Normalization
```yaml
normalization:
type: "minmax"
clip: true
calibration:
strategy: "dataset_specific"
```

### Global Isotonic PCC
```yaml
normalization:
type: "isotonic_pcc"
params:
y_min: 0.0
y_max: 1.0
increasing: false
calibration:
strategy: "global"
background_dataset: "allenai/c4"
```

### Binned PCC with Custom Settings
```yaml
normalization:
type: "binned_pcc"
params:
num_bins: 20
processing:
ignore_nans: false
normalize_metrics: true
cache:
enabled: true
```

## Best Practices

1. **Method Selection**
- Use MinMax/Quantile for simple scaling needs
- Use PCC methods when interpretability is crucial
- Prefer Isotonic PCC when preserving score ordering is important

2. **Calibration Strategy**
- Use dataset-specific calibration when possible
- Use global calibration when consistency across tasks is needed
- Consider using background dataset for robust global calibration

3. **Performance Considerations**
- Enable caching for large datasets
- Adjust bin count based on dataset size
- Monitor memory usage with large calibration sets

## Integration with Other Configs
Core normalization settings can be overridden by:
- Task-specific configs
- Model-specific configs
- Instruction-tuned model configs

Core settings serve as defaults when not specified in other configuration layers.
156 changes: 156 additions & 0 deletions docs/normalization/2_dataset_specific_configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
# Dataset-Specific Normalization Configurations in LM-Polygraph

## Overview
Dataset-specific normalization configurations in LM-Polygraph allow fine-tuning how uncertainty scores are normalized for different tasks and data types. These configurations can be found in the evaluation config files under `/examples/configs/` and its subfolders.

## Configuration Structure

### 1. Common Parameters

Every dataset-specific configuration includes these core normalization parameters:

```yaml
# Dataset sampling configuration
subsample_background_train_dataset: 1000 # Size of background dataset for normalization
subsample_train_dataset: 1000 # Size of task-specific calibration dataset
subsample_eval_dataset: -1 # Size of evaluation dataset (-1 = full)

# Training data settings
train_dataset: null # Optional separate training dataset
train_test_split: false # Whether to split data for calibration
test_split_size: 1 # Test split ratio if splitting enabled

# Background dataset configuration
background_train_dataset: allenai/c4 # Default background dataset
background_train_dataset_text_column: text # Text column name
background_train_dataset_label_column: url # Label column name
background_load_from_disk: false # Loading mode
```

### 2. Task-Specific Configurations

#### Question-Answering Tasks (TriviaQA, MMLU, CoQA)
```yaml
# Additional QA-specific settings
process_output_fn:
path: output_processing_scripts/qa_normalize.py
fn_name: normalize_qa
normalize: true
normalize_metrics: true
target_ignore_regex: null
```

#### Translation Tasks (WMT)
```yaml
# Translation-specific normalization
source_ignore_regex: "^.*?: " # Regex to clean source text
target_ignore_regex: null # Regex to clean target text
normalize_translations: true
```

#### Summarization Tasks (XSum, AESLC)
```yaml
# Summarization normalization
normalize_summaries: true
output_ignore_regex: null
processing:
trim_outputs: true
lowercase: true
```

### 3. Language-Specific Settings

For multilingual tasks (especially in claim-level fact-checking):

```yaml
# Language-specific normalization
language: "en" # Options: en, zh, ar, ru
multilingual_normalization:
enabled: true
use_language_specific_bins: true
combine_language_statistics: false
```

## Usage Examples

### 1. Basic QA Task Configuration
```yaml
hydra:
run:
dir: ${cache_path}/${task}/${model.path}/${dataset}/${now:%Y-%m-%d}

defaults:
- model: default
- _self_

dataset: triviaqa
subsample_train_dataset: 1000
normalize: true
process_output_fn:
path: output_processing_scripts/triviaqa.py
fn_name: normalize_qa
```

### 2. Translation Task Setup
```yaml
dataset: wmt14_deen
subsample_train_dataset: 2000
source_ignore_regex: "^Translation: "
normalize_translations: true
background_train_dataset: null
```

### 3. Multilingual Configuration
```yaml
dataset: person_bio
language: zh
multilingual_normalization:
enabled: true
use_language_specific_bins: true
subsample_train_dataset: 1000
background_train_dataset: allenai/c4
```

## Key Considerations

### 1. Dataset Size and Sampling
- Use `subsample_train_dataset` to control calibration dataset size
- Larger values provide better calibration but increase compute time
- Default value of 1000 works well for most tasks

### 2. Background Dataset Usage
- Background dataset provides additional calibration data
- Useful for tasks with limited in-domain data
- C4 dataset is default choice for English tasks

### 3. Processing and Cleaning
- Task-specific normalization functions handle special cases
- Regular expressions clean input/output texts
- Language-specific processing for multilingual tasks

### 4. Performance Impact
- Larger sample sizes increase normalization quality but computational cost
- Background dataset usage adds overhead
- Consider caching normalized values for repeated evaluations

## Best Practices

1. **Dataset Size Selection**
- Use at least 1000 samples for calibration
- Increase for complex tasks or when accuracy is critical
- Consider computational resources available

2. **Background Dataset Usage**
- Use for tasks with limited training data
- Ensure background data distribution matches task
- Consider language and domain compatibility

3. **Processing Configuration**
- Configure task-specific normalization functions
- Use appropriate regex patterns for cleaning
- Enable language-specific processing for multilingual tasks

4. **Optimization Tips**
- Cache normalized values when possible
- Use smaller sample sizes during development
- Enable background dataset loading from disk for large datasets
Loading
Loading