IINemo · yobeen · Dec 3, 2024
diff --git a/docs/normalization/0_basic_info.md b/docs/normalization/0_basic_info.md
@@ -0,0 +1,60 @@
+# LM-Polygraph Normalization Methods
+
+LM-Polygraph implements several uncertainty normalization methods to convert raw uncertainty scores into more interpretable confidence values bounded between 0 and 1. Here are the key normalization approaches:
+
+### MinMax Normalization (MinMaxNormalizer in `minmax.py`)
+
+- Takes raw uncertainty scores and linearly scales them to [0,1] range.
+- Flips the sign since uncertainty scores should be negatively correlated with confidence.
+- Uses scikit-learn's `MinMaxScaler` internally.
+- Simple but doesn't maintain a direct connection to output quality.
+
+### Quantile Normalization (QuantileNormalizer in `quantile.py`)
+
+- Transforms uncertainty scores into their corresponding percentile ranks.
+- Uses empirical CDF to map scores to [0,1] range.
+- Provides uniformly distributed confidence scores.
+- May lose some granularity of original uncertainty estimates.
+
+## Performance-Calibrated Confidence (PCC) Methods
+
+### Binned PCC (BinnedPCCNormalizer in `binned_pcc.py`)
+
+- Splits calibration data into bins based on uncertainty values.
+- Each bin has approximately an equal number of samples.
+- Confidence score is the mean output quality of samples in the corresponding bin.
+- Provides an interpretable connection between confidence and expected quality.
+- Drawback: Can change ordering of samples compared to raw uncertainty.
+
+### Isotonic PCC (IsotonicPCCNormalizer in `isotonic_pcc.py`)
+
+- Uses Centered Isotonic Regression (CIR) to fit a monotonic relationship.
+- Maps uncertainty scores to output quality while preserving ordering.
+- Enforces monotonicity constraint to maintain uncertainty ranking.
+- More robust than the binned approach while maintaining interpretability.
+- Implementation based on CIR algorithm from Oron & Flournoy (2017).
+
+## Common Interface: `BaseUENormalizer`
+
+All normalizers follow a common interface defined in `BaseUENormalizer`:
+
+- `fit()`: Learns normalization parameters from calibration data.
+- `transform()`: Applies normalization to new uncertainty scores.
+- `dumps()/loads()`: Serialization support for fitted normalizers.
+
+## Key Benefits of PCC Methods
+
+- Direct connection to output quality metrics.
+- Bounded interpretable range [0,1].
+- Maintained correlation with generation quality.
+- Easy to explain meaning to end users.
+
+## Highlight: Isotonic PCC
+
+The Isotonic PCC approach provides the best balance between:
+
+- Maintaining the original uncertainty ranking.
+- Providing interpretable confidence scores.
+- Establishing a clear connection to expected output quality.
+
+When using normalized scores, users can interpret them as estimates of relative output quality, making them more useful for downstream applications and human understanding.
diff --git a/docs/normalization/1_core_normalization_configuration.md b/docs/normalization/1_core_normalization_configuration.md
@@ -0,0 +1,141 @@
+# Core Normalization Configuration
+
+## Overview
+Core normalization configuration in LM-Polygraph defines how uncertainty scores are transformed into interpretable confidence values. These configurations control the fundamental behavior of all normalization methods across the system.
+
+## Base Configuration Location
+Core normalization configurations are located in:
+```
+/examples/configs/normalization/fit/default.yaml
+```
+
+## Available Normalization Methods
+
+### 1. MinMax Normalization
+Linearly scales uncertainty scores to [0,1] range.
+
+```yaml
+normalization:
+  type: "minmax"
+  clip: true  # Whether to clip values outside [0,1] range
+```
+
+### 2. Quantile Normalization 
+Transforms scores into percentile ranks using empirical CDF.
+
+```yaml
+normalization:
+  type: "quantile"
+```
+
+### 3. Binned Performance-Calibrated Confidence (Binned PCC)
+Maps uncertainty scores to confidence bins based on output quality.
+
+```yaml
+normalization:
+  type: "binned_pcc"
+  params:
+    num_bins: 10  # Number of bins for mapping
+```
+
+### 4. Isotonic Performance-Calibrated Confidence (Isotonic PCC)
+Uses monotonic regression to map uncertainty to confidence while preserving ordering.
+
+```yaml
+normalization:
+  type: "isotonic_pcc"
+  params:
+    y_min: 0.0  # Minimum confidence value
+    y_max: 1.0  # Maximum confidence value
+    increasing: false  # Whether mapping should be increasing
+    out_of_bounds: "clip"  # How to handle out-of-range values
+```
+
+## Common Parameters
+
+### Calibration Strategy
+```yaml
+normalization:
+  calibration:
+    strategy: "dataset_specific"  # or "global"
+    background_dataset: null  # Optional background dataset for global calibration
+```
+
+### Data Processing
+```yaml
+normalization:
+  processing:
+    ignore_nans: true  # Whether to ignore NaN values in calibration
+    normalize_metrics: true  # Whether to normalize quality metrics
+```
+
+### Caching
+```yaml
+normalization:
+  cache:
+    enabled: true
+    path: "${cache_path}/normalization"
+    version: "v1"
+```
+
+## Usage Examples
+
+### Basic MinMax Normalization
+```yaml
+normalization:
+  type: "minmax"
+  clip: true
+  calibration:
+    strategy: "dataset_specific"
+```
+
+### Global Isotonic PCC
+```yaml
+normalization:
+  type: "isotonic_pcc"
+  params:
+    y_min: 0.0
+    y_max: 1.0
+    increasing: false
+  calibration:
+    strategy: "global"
+    background_dataset: "allenai/c4"
+```
+
+### Binned PCC with Custom Settings
+```yaml
+normalization:
+  type: "binned_pcc"
+  params:
+    num_bins: 20
+  processing:
+    ignore_nans: false
+    normalize_metrics: true
+  cache:
+    enabled: true
+```
+
+## Best Practices
+
+1. **Method Selection**
+   - Use MinMax/Quantile for simple scaling needs
+   - Use PCC methods when interpretability is crucial
+   - Prefer Isotonic PCC when preserving score ordering is important
+
+2. **Calibration Strategy**
+   - Use dataset-specific calibration when possible
+   - Use global calibration when consistency across tasks is needed
+   - Consider using background dataset for robust global calibration
+
+3. **Performance Considerations**
+   - Enable caching for large datasets
+   - Adjust bin count based on dataset size
+   - Monitor memory usage with large calibration sets
+
+## Integration with Other Configs
+Core normalization settings can be overridden by:
+- Task-specific configs
+- Model-specific configs
+- Instruction-tuned model configs
+
+Core settings serve as defaults when not specified in other configuration layers.
diff --git a/docs/normalization/2_dataset_specific_configuration.md b/docs/normalization/2_dataset_specific_configuration.md
@@ -0,0 +1,156 @@
+# Dataset-Specific Normalization Configurations in LM-Polygraph
+
+## Overview
+Dataset-specific normalization configurations in LM-Polygraph allow fine-tuning how uncertainty scores are normalized for different tasks and data types. These configurations can be found in the evaluation config files under `/examples/configs/` and its subfolders.
+
+## Configuration Structure
+
+### 1. Common Parameters
+
+Every dataset-specific configuration includes these core normalization parameters:
+
+```yaml
+# Dataset sampling configuration
+subsample_background_train_dataset: 1000  # Size of background dataset for normalization
+subsample_train_dataset: 1000             # Size of task-specific calibration dataset 
+subsample_eval_dataset: -1                # Size of evaluation dataset (-1 = full)
+
+# Training data settings
+train_dataset: null                       # Optional separate training dataset
+train_test_split: false                   # Whether to split data for calibration
+test_split_size: 1                        # Test split ratio if splitting enabled
+
+# Background dataset configuration 
+background_train_dataset: allenai/c4      # Default background dataset
+background_train_dataset_text_column: text # Text column name
+background_train_dataset_label_column: url # Label column name
+background_load_from_disk: false          # Loading mode
+```
+
+### 2. Task-Specific Configurations
+
+#### Question-Answering Tasks (TriviaQA, MMLU, CoQA)
+```yaml
+# Additional QA-specific settings
+process_output_fn:
+  path: output_processing_scripts/qa_normalize.py
+  fn_name: normalize_qa
+normalize: true
+normalize_metrics: true
+target_ignore_regex: null
+```
+
+#### Translation Tasks (WMT)
+```yaml
+# Translation-specific normalization
+source_ignore_regex: "^.*?: "            # Regex to clean source text
+target_ignore_regex: null                # Regex to clean target text
+normalize_translations: true
+```
+
+#### Summarization Tasks (XSum, AESLC)
+```yaml
+# Summarization normalization
+normalize_summaries: true
+output_ignore_regex: null
+processing:
+  trim_outputs: true
+  lowercase: true
+```
+
+### 3. Language-Specific Settings
+
+For multilingual tasks (especially in claim-level fact-checking):
+
+```yaml
+# Language-specific normalization
+language: "en"  # Options: en, zh, ar, ru
+multilingual_normalization:
+  enabled: true
+  use_language_specific_bins: true
+  combine_language_statistics: false
+```
+
+## Usage Examples
+
+### 1. Basic QA Task Configuration
+```yaml
+hydra:
+  run:
+    dir: ${cache_path}/${task}/${model.path}/${dataset}/${now:%Y-%m-%d}
+
+defaults:
+  - model: default
+  - _self_
+
+dataset: triviaqa
+subsample_train_dataset: 1000
+normalize: true
+process_output_fn:
+  path: output_processing_scripts/triviaqa.py
+  fn_name: normalize_qa
+```
+
+### 2. Translation Task Setup
+```yaml
+dataset: wmt14_deen
+subsample_train_dataset: 2000
+source_ignore_regex: "^Translation: "
+normalize_translations: true
+background_train_dataset: null
+```
+
+### 3. Multilingual Configuration
+```yaml
+dataset: person_bio
+language: zh
+multilingual_normalization:
+  enabled: true
+  use_language_specific_bins: true
+subsample_train_dataset: 1000
+background_train_dataset: allenai/c4
+```
+
+## Key Considerations
+
+### 1. Dataset Size and Sampling
+- Use `subsample_train_dataset` to control calibration dataset size
+- Larger values provide better calibration but increase compute time
+- Default value of 1000 works well for most tasks
+
+### 2. Background Dataset Usage
+- Background dataset provides additional calibration data
+- Useful for tasks with limited in-domain data
+- C4 dataset is default choice for English tasks
+
+### 3. Processing and Cleaning
+- Task-specific normalization functions handle special cases
+- Regular expressions clean input/output texts
+- Language-specific processing for multilingual tasks
+
+### 4. Performance Impact
+- Larger sample sizes increase normalization quality but computational cost
+- Background dataset usage adds overhead
+- Consider caching normalized values for repeated evaluations
+
+## Best Practices
+
+1. **Dataset Size Selection**
+   - Use at least 1000 samples for calibration
+   - Increase for complex tasks or when accuracy is critical
+   - Consider computational resources available
+
+2. **Background Dataset Usage**
+   - Use for tasks with limited training data
+   - Ensure background data distribution matches task
+   - Consider language and domain compatibility
+
+3. **Processing Configuration**
+   - Configure task-specific normalization functions
+   - Use appropriate regex patterns for cleaning
+   - Enable language-specific processing for multilingual tasks
+
+4. **Optimization Tips**
+   - Cache normalized values when possible
+   - Use smaller sample sizes during development
+   - Enable background dataset loading from disk for large datasets