diff --git a/docs/normalization/0_basic_info.md b/docs/normalization/0_basic_info.md new file mode 100644 index 00000000..e99e55b7 --- /dev/null +++ b/docs/normalization/0_basic_info.md @@ -0,0 +1,60 @@ +# LM-Polygraph Normalization Methods + +LM-Polygraph implements several uncertainty normalization methods to convert raw uncertainty scores into more interpretable confidence values bounded between 0 and 1. Here are the key normalization approaches: + +### MinMax Normalization (MinMaxNormalizer in `minmax.py`) + +- Takes raw uncertainty scores and linearly scales them to [0,1] range. +- Flips the sign since uncertainty scores should be negatively correlated with confidence. +- Uses scikit-learn's `MinMaxScaler` internally. +- Simple but doesn't maintain a direct connection to output quality. + +### Quantile Normalization (QuantileNormalizer in `quantile.py`) + +- Transforms uncertainty scores into their corresponding percentile ranks. +- Uses empirical CDF to map scores to [0,1] range. +- Provides uniformly distributed confidence scores. +- May lose some granularity of original uncertainty estimates. + +## Performance-Calibrated Confidence (PCC) Methods + +### Binned PCC (BinnedPCCNormalizer in `binned_pcc.py`) + +- Splits calibration data into bins based on uncertainty values. +- Each bin has approximately an equal number of samples. +- Confidence score is the mean output quality of samples in the corresponding bin. +- Provides an interpretable connection between confidence and expected quality. +- Drawback: Can change ordering of samples compared to raw uncertainty. + +### Isotonic PCC (IsotonicPCCNormalizer in `isotonic_pcc.py`) + +- Uses Centered Isotonic Regression (CIR) to fit a monotonic relationship. +- Maps uncertainty scores to output quality while preserving ordering. +- Enforces monotonicity constraint to maintain uncertainty ranking. +- More robust than the binned approach while maintaining interpretability. +- Implementation based on CIR algorithm from Oron & Flournoy (2017). + +## Common Interface: `BaseUENormalizer` + +All normalizers follow a common interface defined in `BaseUENormalizer`: + +- `fit()`: Learns normalization parameters from calibration data. +- `transform()`: Applies normalization to new uncertainty scores. +- `dumps()/loads()`: Serialization support for fitted normalizers. + +## Key Benefits of PCC Methods + +- Direct connection to output quality metrics. +- Bounded interpretable range [0,1]. +- Maintained correlation with generation quality. +- Easy to explain meaning to end users. + +## Highlight: Isotonic PCC + +The Isotonic PCC approach provides the best balance between: + +- Maintaining the original uncertainty ranking. +- Providing interpretable confidence scores. +- Establishing a clear connection to expected output quality. + +When using normalized scores, users can interpret them as estimates of relative output quality, making them more useful for downstream applications and human understanding. \ No newline at end of file diff --git a/docs/normalization/1_core_normalization_configuration.md b/docs/normalization/1_core_normalization_configuration.md new file mode 100644 index 00000000..5632391a --- /dev/null +++ b/docs/normalization/1_core_normalization_configuration.md @@ -0,0 +1,141 @@ +# Core Normalization Configuration + +## Overview +Core normalization configuration in LM-Polygraph defines how uncertainty scores are transformed into interpretable confidence values. These configurations control the fundamental behavior of all normalization methods across the system. + +## Base Configuration Location +Core normalization configurations are located in: +``` +/examples/configs/normalization/fit/default.yaml +``` + +## Available Normalization Methods + +### 1. MinMax Normalization +Linearly scales uncertainty scores to [0,1] range. + +```yaml +normalization: + type: "minmax" + clip: true # Whether to clip values outside [0,1] range +``` + +### 2. Quantile Normalization +Transforms scores into percentile ranks using empirical CDF. + +```yaml +normalization: + type: "quantile" +``` + +### 3. Binned Performance-Calibrated Confidence (Binned PCC) +Maps uncertainty scores to confidence bins based on output quality. + +```yaml +normalization: + type: "binned_pcc" + params: + num_bins: 10 # Number of bins for mapping +``` + +### 4. Isotonic Performance-Calibrated Confidence (Isotonic PCC) +Uses monotonic regression to map uncertainty to confidence while preserving ordering. + +```yaml +normalization: + type: "isotonic_pcc" + params: + y_min: 0.0 # Minimum confidence value + y_max: 1.0 # Maximum confidence value + increasing: false # Whether mapping should be increasing + out_of_bounds: "clip" # How to handle out-of-range values +``` + +## Common Parameters + +### Calibration Strategy +```yaml +normalization: + calibration: + strategy: "dataset_specific" # or "global" + background_dataset: null # Optional background dataset for global calibration +``` + +### Data Processing +```yaml +normalization: + processing: + ignore_nans: true # Whether to ignore NaN values in calibration + normalize_metrics: true # Whether to normalize quality metrics +``` + +### Caching +```yaml +normalization: + cache: + enabled: true + path: "${cache_path}/normalization" + version: "v1" +``` + +## Usage Examples + +### Basic MinMax Normalization +```yaml +normalization: + type: "minmax" + clip: true + calibration: + strategy: "dataset_specific" +``` + +### Global Isotonic PCC +```yaml +normalization: + type: "isotonic_pcc" + params: + y_min: 0.0 + y_max: 1.0 + increasing: false + calibration: + strategy: "global" + background_dataset: "allenai/c4" +``` + +### Binned PCC with Custom Settings +```yaml +normalization: + type: "binned_pcc" + params: + num_bins: 20 + processing: + ignore_nans: false + normalize_metrics: true + cache: + enabled: true +``` + +## Best Practices + +1. **Method Selection** + - Use MinMax/Quantile for simple scaling needs + - Use PCC methods when interpretability is crucial + - Prefer Isotonic PCC when preserving score ordering is important + +2. **Calibration Strategy** + - Use dataset-specific calibration when possible + - Use global calibration when consistency across tasks is needed + - Consider using background dataset for robust global calibration + +3. **Performance Considerations** + - Enable caching for large datasets + - Adjust bin count based on dataset size + - Monitor memory usage with large calibration sets + +## Integration with Other Configs +Core normalization settings can be overridden by: +- Task-specific configs +- Model-specific configs +- Instruction-tuned model configs + +Core settings serve as defaults when not specified in other configuration layers. \ No newline at end of file diff --git a/docs/normalization/2_dataset_specific_configuration.md b/docs/normalization/2_dataset_specific_configuration.md new file mode 100644 index 00000000..566eedba --- /dev/null +++ b/docs/normalization/2_dataset_specific_configuration.md @@ -0,0 +1,156 @@ +# Dataset-Specific Normalization Configurations in LM-Polygraph + +## Overview +Dataset-specific normalization configurations in LM-Polygraph allow fine-tuning how uncertainty scores are normalized for different tasks and data types. These configurations can be found in the evaluation config files under `/examples/configs/` and its subfolders. + +## Configuration Structure + +### 1. Common Parameters + +Every dataset-specific configuration includes these core normalization parameters: + +```yaml +# Dataset sampling configuration +subsample_background_train_dataset: 1000 # Size of background dataset for normalization +subsample_train_dataset: 1000 # Size of task-specific calibration dataset +subsample_eval_dataset: -1 # Size of evaluation dataset (-1 = full) + +# Training data settings +train_dataset: null # Optional separate training dataset +train_test_split: false # Whether to split data for calibration +test_split_size: 1 # Test split ratio if splitting enabled + +# Background dataset configuration +background_train_dataset: allenai/c4 # Default background dataset +background_train_dataset_text_column: text # Text column name +background_train_dataset_label_column: url # Label column name +background_load_from_disk: false # Loading mode +``` + +### 2. Task-Specific Configurations + +#### Question-Answering Tasks (TriviaQA, MMLU, CoQA) +```yaml +# Additional QA-specific settings +process_output_fn: + path: output_processing_scripts/qa_normalize.py + fn_name: normalize_qa +normalize: true +normalize_metrics: true +target_ignore_regex: null +``` + +#### Translation Tasks (WMT) +```yaml +# Translation-specific normalization +source_ignore_regex: "^.*?: " # Regex to clean source text +target_ignore_regex: null # Regex to clean target text +normalize_translations: true +``` + +#### Summarization Tasks (XSum, AESLC) +```yaml +# Summarization normalization +normalize_summaries: true +output_ignore_regex: null +processing: + trim_outputs: true + lowercase: true +``` + +### 3. Language-Specific Settings + +For multilingual tasks (especially in claim-level fact-checking): + +```yaml +# Language-specific normalization +language: "en" # Options: en, zh, ar, ru +multilingual_normalization: + enabled: true + use_language_specific_bins: true + combine_language_statistics: false +``` + +## Usage Examples + +### 1. Basic QA Task Configuration +```yaml +hydra: + run: + dir: ${cache_path}/${task}/${model.path}/${dataset}/${now:%Y-%m-%d} + +defaults: + - model: default + - _self_ + +dataset: triviaqa +subsample_train_dataset: 1000 +normalize: true +process_output_fn: + path: output_processing_scripts/triviaqa.py + fn_name: normalize_qa +``` + +### 2. Translation Task Setup +```yaml +dataset: wmt14_deen +subsample_train_dataset: 2000 +source_ignore_regex: "^Translation: " +normalize_translations: true +background_train_dataset: null +``` + +### 3. Multilingual Configuration +```yaml +dataset: person_bio +language: zh +multilingual_normalization: + enabled: true + use_language_specific_bins: true +subsample_train_dataset: 1000 +background_train_dataset: allenai/c4 +``` + +## Key Considerations + +### 1. Dataset Size and Sampling +- Use `subsample_train_dataset` to control calibration dataset size +- Larger values provide better calibration but increase compute time +- Default value of 1000 works well for most tasks + +### 2. Background Dataset Usage +- Background dataset provides additional calibration data +- Useful for tasks with limited in-domain data +- C4 dataset is default choice for English tasks + +### 3. Processing and Cleaning +- Task-specific normalization functions handle special cases +- Regular expressions clean input/output texts +- Language-specific processing for multilingual tasks + +### 4. Performance Impact +- Larger sample sizes increase normalization quality but computational cost +- Background dataset usage adds overhead +- Consider caching normalized values for repeated evaluations + +## Best Practices + +1. **Dataset Size Selection** + - Use at least 1000 samples for calibration + - Increase for complex tasks or when accuracy is critical + - Consider computational resources available + +2. **Background Dataset Usage** + - Use for tasks with limited training data + - Ensure background data distribution matches task + - Consider language and domain compatibility + +3. **Processing Configuration** + - Configure task-specific normalization functions + - Use appropriate regex patterns for cleaning + - Enable language-specific processing for multilingual tasks + +4. **Optimization Tips** + - Cache normalized values when possible + - Use smaller sample sizes during development + - Enable background dataset loading from disk for large datasets \ No newline at end of file diff --git a/docs/normalization/3_instruction_tuned_model.md b/docs/normalization/3_instruction_tuned_model.md new file mode 100644 index 00000000..06327a59 --- /dev/null +++ b/docs/normalization/3_instruction_tuned_model.md @@ -0,0 +1,223 @@ +# Instruction-Tuned Model Normalization Configurations in LM-Polygraph + +## Overview +Instruction-tuned model configurations in LM-Polygraph provide specialized normalization settings for models that have been fine-tuned on instruction data. These configurations are located in `/examples/configs/instruct/` and include specific processing scripts and parameters for handling instruction-formatted inputs and outputs. + +## Configuration Structure + +### 1. Base Processing Configuration +Located in `/examples/configs/instruct/`, base processing configs define foundational normalization settings: + +```yaml +# Base processing for instruction-tuned models +process_output_fn: + path: instruct/output_processing_scripts/default.py + fn_name: normalize_em +process_target_fn: + path: instruct/output_processing_scripts/default.py + fn_name: normalize_em +``` + +### 2. Task-Specific Processing + +#### CoQA Processing +```yaml +# CoQA-specific instruction normalization +process_output_fn: + path: instruct/output_processing_scripts/coqa.py + fn_name: normalize_em_coqa +process_target_fn: + path: instruct/output_processing_scripts/coqa.py + fn_name: normalize_em_coqa +``` + +#### TriviaQA Processing +```yaml +# TriviaQA-specific instruction normalization +process_output_fn: + path: instruct/output_processing_scripts/triviaqa.py + fn_name: normalize_em_triviaqa +process_target_fn: + path: instruct/output_processing_scripts/triviaqa.py + fn_name: normalize_em_triviaqa +``` + +### 3. Processing Types + +#### Chain-of-Thought (CoT) Processing +```yaml +# CoT processing settings +cot_processing: + enabled: true + extract_final_answer: true + normalize_reasoning: false + ignore_intermediate_steps: true +``` + +#### Top-K Processing +```yaml +# Top-K response processing +topk_processing: + enabled: true + k: 4 # Number of alternatives to consider + aggregate_method: "max" # How to combine multiple predictions +``` + +#### Top-1 Processing +```yaml +# Top-1 response processing +top1_processing: + enabled: true + normalize_confidence: true + extract_probability: true +``` + +## Model-Specific Configurations + +### 1. Model Type Settings +```yaml +defaults: + - model: default_causal.py + - _self_ + +model: + type: "CausalLM" + path_to_load_script: model/default_causal.py + generation_params: + do_sample: false + num_beams: 1 + temperature: 1.0 +``` + +### 2. Specialized Model Examples + +#### Mistral Configuration +```yaml +model: + path: mistral-7b-instruct-v0.2 + type: "CausalLM" + load_model_args: + device_map: auto + trust_remote_code: true + load_tokenizer_args: + trust_remote_code: true +``` + +#### StableLM Configuration +```yaml +model: + path: stablelm-2-12b-chat + type: "CausalLM" + load_model_args: + device_map: auto + use_flash_attention: true +``` + +## Integration Features + +### 1. Processing Pipeline Integration +- Custom normalization functions for instruction-formatted outputs +- Task-specific answer extraction +- Confidence score normalization + +### 2. Model Output Processing +- Handling of structured instruction outputs +- Extraction of final answers from reasoning chains +- Normalization of multiple response formats + +### 3. Configuration Inheritance +- Base processing settings inheritance +- Task-specific overrides +- Model-specific adaptations + +## Best Practices + +### 1. Processing Function Selection +- Use task-specific normalizers when available +- Fall back to default processors for general cases +- Consider instruction format when selecting processors + +### 2. Confidence Handling +- Enable confidence normalization for compatible models +- Configure appropriate aggregation methods for multiple outputs +- Consider model-specific confidence scales + +### 3. Chain-of-Thought Processing +- Enable for models trained with CoT +- Configure appropriate answer extraction +- Consider preservation of reasoning steps + +### 4. Performance Optimization +- Enable caching for processed outputs +- Configure batch processing when possible +- Balance processing complexity with performance needs + +## Example Configurations + +### 1. Basic Instruction Model Setup +```yaml +defaults: + - model: default_causal + - _self_ + +process_output_fn: + path: instruct/output_processing_scripts/default.py + fn_name: normalize_em + +top1_processing: + enabled: true + normalize_confidence: true +``` + +### 2. CoT Model Configuration +```yaml +defaults: + - model: mistral-instruct + - _self_ + +cot_processing: + enabled: true + extract_final_answer: true + +process_output_fn: + path: instruct/output_processing_scripts/cot.py + fn_name: normalize_cot +``` + +### 3. Multi-Task Model Setup +```yaml +defaults: + - model: stablelm-chat + - _self_ + +process_output_fn: + path: instruct/output_processing_scripts/multi_task.py + fn_name: normalize_mt + +topk_processing: + enabled: true + k: 4 + aggregate_method: "max" +``` + +## Common Issues and Solutions + +### 1. Output Format Mismatches +- Problem: Model outputs don't match expected instruction format +- Solution: Configure custom processing functions +- Example: Use task-specific normalizers + +### 2. Confidence Scale Differences +- Problem: Different models use different confidence scales +- Solution: Enable confidence normalization +- Example: Configure model-specific scaling + +### 3. Processing Pipeline Conflicts +- Problem: Multiple processing steps interfering +- Solution: Configure processing order +- Example: Set priority for different normalizers + +### 4. Performance Bottlenecks +- Problem: Slow processing of instruction outputs +- Solution: Enable caching and batch processing +- Example: Configure appropriate batch sizes \ No newline at end of file diff --git a/docs/normalization/4_impact_areas_and_default_behaviors.md b/docs/normalization/4_impact_areas_and_default_behaviors.md new file mode 100644 index 00000000..0564855a --- /dev/null +++ b/docs/normalization/4_impact_areas_and_default_behaviors.md @@ -0,0 +1,164 @@ +# LM-Polygraph Normalization: Impact Areas and Default Behaviors + +## Normalization Impact Areas + +### 1. Score Transformation +- **Raw Uncertainty Scores** + - Original uncertainty estimates in unbounded ranges + - Higher values indicate more uncertainty + - Various scales depending on estimation method + +- **Normalized Confidence Values** + - Bounded in [0,1] range + - Higher values indicate more confidence + - Directly interpretable probabilities + - Preserves relative ordering of original scores (for Isotonic PCC) + +### 2. Evaluation Pipeline +- **Calibration Stage** + - Uses calibration dataset to learn normalization parameters + - Requires generation quality metrics for Performance-Calibrated Confidence + - Can use either task-specific or background data + - Parameters are saved for reuse + +- **Inference Stage** + - Applies learned normalization to new uncertainty estimates + - No additional model inference required + - Fast transformation using stored parameters + - Can be applied to any compatible uncertainty estimator + +### 3. Quality Metrics Integration +- **Metric Normalization** + - Quality metrics are normalized to [0,1] range + - Enables consistent calibration across different metrics + - Handles various metric types (ROUGE, BLEU, accuracy, etc.) + - Supports both bounded and unbounded metrics + +- **Metric Selection** + - Different metrics for different tasks + - Task-specific normalization of quality scores + - Multiple metrics can be used simultaneously + - Quality metrics guide confidence calibration + +### 4. Model Types Support +- **White-box Models** + - Access to internal probabilities and logits + - Can normalize token-level uncertainties + - Supports both sequence and token-level calibration + - Works with HuggingFace models + +- **Black-box Models** + - Limited to output-based uncertainty estimation + - Only sequence-level normalization + - Compatible with API-based models (OpenAI, etc.) + - No access to internal model states + +## Default Behaviors + +### 1. Score Processing + +```python +# Default score processing behavior +normalize_scores = { + 'clip_values': True, # Clip to [0,1] range + 'flip_uncertainty': True, # Convert uncertainty to confidence + 'preserve_order': True, # Maintain sample ordering + 'handle_nans': 'ignore' # Skip NaN values in calibration +} +``` + +### 2. Calibration Settings + +```python +# Default calibration configuration +calibration_defaults = { + 'strategy': 'dataset_specific', # Use task-specific calibration + 'num_samples': 1000, # Default calibration set size + 'background_data': None, # No background data by default + 'split_ratio': None, # No train/test split + 'cache_enabled': True # Cache calibration parameters +} +``` + +### 3. Method Selection + +```python +# Default normalization method selection +method_defaults = { + 'primary_method': 'isotonic_pcc', # Default to Isotonic PCC + 'fallback_method': 'minmax', # Use MinMax as fallback + 'combine_methods': False, # Don't combine multiple methods + 'quality_metric': 'auto' # Auto-select appropriate metric +} +``` + +### 4. Task-Specific Defaults + +```yaml +# Task-type specific defaults +task_defaults: + qa: + metric: 'accuracy' + normalize_answers: true + ignore_case: true + + translation: + metric: 'bleu' + normalize_translations: true + source_cleaning: true + + summarization: + metric: 'rouge' + normalize_summaries: true + trim_outputs: true +``` + +### 5. Error Handling + +```python +# Default error handling behavior +error_handling = { + 'invalid_scores': 'skip', # Skip invalid uncertainty scores + 'missing_metrics': 'error', # Raise error for missing metrics + 'calibration_fails': 'fallback', # Use fallback method if calibration fails + 'out_of_bounds': 'clip' # Clip out-of-bounds values +} +``` + +### 6. Memory Management + +```python +# Default memory management settings +memory_settings = { + 'cache_location': '~/.cache/lm-polygraph/norm', + 'max_cache_size': '1GB', + 'clear_cache_on_exit': False, + 'compression': True +} +``` + +## Usage Guidelines + +1. **Choosing Calibration Data** + - Use task-specific data when available + - Ensure calibration set is representative + - Consider using background data for sparse tasks + - Monitor calibration set size vs. performance + +2. **Method Selection** + - Start with Isotonic PCC for best balance + - Use MinMax for simple scaling needs + - Consider Binned PCC for interpretability + - Evaluate multiple methods if uncertain + +3. **Error Handling** + - Monitor normalization failures + - Validate calibration success + - Check normalized score distributions + - Verify quality metric calculations + +4. **Performance Optimization** + - Enable caching for repeated use + - Adjust calibration set size as needed + - Use appropriate quality metrics + - Monitor memory usage \ No newline at end of file