diff --git a/docs/normalization/0_basic_info.md b/docs/normalization/0_basic_info.md
new file mode 100644
index 00000000..e99e55b7
--- /dev/null
+++ b/docs/normalization/0_basic_info.md
@@ -0,0 +1,60 @@
+# LM-Polygraph Normalization Methods
+
+LM-Polygraph implements several uncertainty normalization methods to convert raw uncertainty scores into more interpretable confidence values bounded between 0 and 1. Here are the key normalization approaches:
+
+### MinMax Normalization (MinMaxNormalizer in `minmax.py`)
+
+- Takes raw uncertainty scores and linearly scales them to [0,1] range.
+- Flips the sign since uncertainty scores should be negatively correlated with confidence.
+- Uses scikit-learn's `MinMaxScaler` internally.
+- Simple but doesn't maintain a direct connection to output quality.
+
+### Quantile Normalization (QuantileNormalizer in `quantile.py`)
+
+- Transforms uncertainty scores into their corresponding percentile ranks.
+- Uses empirical CDF to map scores to [0,1] range.
+- Provides uniformly distributed confidence scores.
+- May lose some granularity of original uncertainty estimates.
+
+## Performance-Calibrated Confidence (PCC) Methods
+
+### Binned PCC (BinnedPCCNormalizer in `binned_pcc.py`)
+
+- Splits calibration data into bins based on uncertainty values.
+- Each bin has approximately an equal number of samples.
+- Confidence score is the mean output quality of samples in the corresponding bin.
+- Provides an interpretable connection between confidence and expected quality.
+- Drawback: Can change ordering of samples compared to raw uncertainty.
+
+### Isotonic PCC (IsotonicPCCNormalizer in `isotonic_pcc.py`)
+
+- Uses Centered Isotonic Regression (CIR) to fit a monotonic relationship.
+- Maps uncertainty scores to output quality while preserving ordering.
+- Enforces monotonicity constraint to maintain uncertainty ranking.
+- More robust than the binned approach while maintaining interpretability.
+- Implementation based on CIR algorithm from Oron & Flournoy (2017).
+
+## Common Interface: `BaseUENormalizer`
+
+All normalizers follow a common interface defined in `BaseUENormalizer`:
+
+- `fit()`: Learns normalization parameters from calibration data.
+- `transform()`: Applies normalization to new uncertainty scores.
+- `dumps()/loads()`: Serialization support for fitted normalizers.
+
+## Key Benefits of PCC Methods
+
+- Direct connection to output quality metrics.
+- Bounded interpretable range [0,1].
+- Maintained correlation with generation quality.
+- Easy to explain meaning to end users.
+
+## Highlight: Isotonic PCC
+
+The Isotonic PCC approach provides the best balance between:
+
+- Maintaining the original uncertainty ranking.
+- Providing interpretable confidence scores.
+- Establishing a clear connection to expected output quality.
+
+When using normalized scores, users can interpret them as estimates of relative output quality, making them more useful for downstream applications and human understanding.
\ No newline at end of file
diff --git a/docs/normalization/1_core_normalization_configuration.md b/docs/normalization/1_core_normalization_configuration.md
new file mode 100644
index 00000000..5632391a
--- /dev/null
+++ b/docs/normalization/1_core_normalization_configuration.md
@@ -0,0 +1,141 @@
+# Core Normalization Configuration
+
+## Overview
+Core normalization configuration in LM-Polygraph defines how uncertainty scores are transformed into interpretable confidence values. These configurations control the fundamental behavior of all normalization methods across the system.
+
+## Base Configuration Location
+Core normalization configurations are located in:
+```
+/examples/configs/normalization/fit/default.yaml
+```
+
+## Available Normalization Methods
+
+### 1. MinMax Normalization
+Linearly scales uncertainty scores to [0,1] range.
+
+```yaml
+normalization:
+  type: "minmax"
+  clip: true  # Whether to clip values outside [0,1] range
+```
+
+### 2. Quantile Normalization 
+Transforms scores into percentile ranks using empirical CDF.
+
+```yaml
+normalization:
+  type: "quantile"
+```
+
+### 3. Binned Performance-Calibrated Confidence (Binned PCC)
+Maps uncertainty scores to confidence bins based on output quality.
+
+```yaml
+normalization:
+  type: "binned_pcc"
+  params:
+    num_bins: 10  # Number of bins for mapping
+```
+
+### 4. Isotonic Performance-Calibrated Confidence (Isotonic PCC)
+Uses monotonic regression to map uncertainty to confidence while preserving ordering.
+
+```yaml
+normalization:
+  type: "isotonic_pcc"
+  params:
+    y_min: 0.0  # Minimum confidence value
+    y_max: 1.0  # Maximum confidence value
+    increasing: false  # Whether mapping should be increasing
+    out_of_bounds: "clip"  # How to handle out-of-range values
+```
+
+## Common Parameters
+
+### Calibration Strategy
+```yaml
+normalization:
+  calibration:
+    strategy: "dataset_specific"  # or "global"
+    background_dataset: null  # Optional background dataset for global calibration
+```
+
+### Data Processing
+```yaml
+normalization:
+  processing:
+    ignore_nans: true  # Whether to ignore NaN values in calibration
+    normalize_metrics: true  # Whether to normalize quality metrics
+```
+
+### Caching
+```yaml
+normalization:
+  cache:
+    enabled: true
+    path: "${cache_path}/normalization"
+    version: "v1"
+```
+
+## Usage Examples
+
+### Basic MinMax Normalization
+```yaml
+normalization:
+  type: "minmax"
+  clip: true
+  calibration:
+    strategy: "dataset_specific"
+```
+
+### Global Isotonic PCC
+```yaml
+normalization:
+  type: "isotonic_pcc"
+  params:
+    y_min: 0.0
+    y_max: 1.0
+    increasing: false
+  calibration:
+    strategy: "global"
+    background_dataset: "allenai/c4"
+```
+
+### Binned PCC with Custom Settings
+```yaml
+normalization:
+  type: "binned_pcc"
+  params:
+    num_bins: 20
+  processing:
+    ignore_nans: false
+    normalize_metrics: true
+  cache:
+    enabled: true
+```
+
+## Best Practices
+
+1. **Method Selection**
+   - Use MinMax/Quantile for simple scaling needs
+   - Use PCC methods when interpretability is crucial
+   - Prefer Isotonic PCC when preserving score ordering is important
+
+2. **Calibration Strategy**
+   - Use dataset-specific calibration when possible
+   - Use global calibration when consistency across tasks is needed
+   - Consider using background dataset for robust global calibration
+
+3. **Performance Considerations**
+   - Enable caching for large datasets
+   - Adjust bin count based on dataset size
+   - Monitor memory usage with large calibration sets
+
+## Integration with Other Configs
+Core normalization settings can be overridden by:
+- Task-specific configs
+- Model-specific configs
+- Instruction-tuned model configs
+
+Core settings serve as defaults when not specified in other configuration layers.
\ No newline at end of file
diff --git a/docs/normalization/2_dataset_specific_configuration.md b/docs/normalization/2_dataset_specific_configuration.md
new file mode 100644
index 00000000..566eedba
--- /dev/null
+++ b/docs/normalization/2_dataset_specific_configuration.md
@@ -0,0 +1,156 @@
+# Dataset-Specific Normalization Configurations in LM-Polygraph
+
+## Overview
+Dataset-specific normalization configurations in LM-Polygraph allow fine-tuning how uncertainty scores are normalized for different tasks and data types. These configurations can be found in the evaluation config files under `/examples/configs/` and its subfolders.
+
+## Configuration Structure
+
+### 1. Common Parameters
+
+Every dataset-specific configuration includes these core normalization parameters:
+
+```yaml
+# Dataset sampling configuration
+subsample_background_train_dataset: 1000  # Size of background dataset for normalization
+subsample_train_dataset: 1000             # Size of task-specific calibration dataset 
+subsample_eval_dataset: -1                # Size of evaluation dataset (-1 = full)
+
+# Training data settings
+train_dataset: null                       # Optional separate training dataset
+train_test_split: false                   # Whether to split data for calibration
+test_split_size: 1                        # Test split ratio if splitting enabled
+
+# Background dataset configuration 
+background_train_dataset: allenai/c4      # Default background dataset
+background_train_dataset_text_column: text # Text column name
+background_train_dataset_label_column: url # Label column name
+background_load_from_disk: false          # Loading mode
+```
+
+### 2. Task-Specific Configurations
+
+#### Question-Answering Tasks (TriviaQA, MMLU, CoQA)
+```yaml
+# Additional QA-specific settings
+process_output_fn:
+  path: output_processing_scripts/qa_normalize.py
+  fn_name: normalize_qa
+normalize: true
+normalize_metrics: true
+target_ignore_regex: null
+```
+
+#### Translation Tasks (WMT)
+```yaml
+# Translation-specific normalization
+source_ignore_regex: "^.*?: "            # Regex to clean source text
+target_ignore_regex: null                # Regex to clean target text
+normalize_translations: true
+```
+
+#### Summarization Tasks (XSum, AESLC)
+```yaml
+# Summarization normalization
+normalize_summaries: true
+output_ignore_regex: null
+processing:
+  trim_outputs: true
+  lowercase: true
+```
+
+### 3. Language-Specific Settings
+
+For multilingual tasks (especially in claim-level fact-checking):
+
+```yaml
+# Language-specific normalization
+language: "en"  # Options: en, zh, ar, ru
+multilingual_normalization:
+  enabled: true
+  use_language_specific_bins: true
+  combine_language_statistics: false
+```
+
+## Usage Examples
+
+### 1. Basic QA Task Configuration
+```yaml
+hydra:
+  run:
+    dir: ${cache_path}/${task}/${model.path}/${dataset}/${now:%Y-%m-%d}
+
+defaults:
+  - model: default
+  - _self_
+
+dataset: triviaqa
+subsample_train_dataset: 1000
+normalize: true
+process_output_fn:
+  path: output_processing_scripts/triviaqa.py
+  fn_name: normalize_qa
+```
+
+### 2. Translation Task Setup
+```yaml
+dataset: wmt14_deen
+subsample_train_dataset: 2000
+source_ignore_regex: "^Translation: "
+normalize_translations: true
+background_train_dataset: null
+```
+
+### 3. Multilingual Configuration
+```yaml
+dataset: person_bio
+language: zh
+multilingual_normalization:
+  enabled: true
+  use_language_specific_bins: true
+subsample_train_dataset: 1000
+background_train_dataset: allenai/c4
+```
+
+## Key Considerations
+
+### 1. Dataset Size and Sampling
+- Use `subsample_train_dataset` to control calibration dataset size
+- Larger values provide better calibration but increase compute time
+- Default value of 1000 works well for most tasks
+
+### 2. Background Dataset Usage
+- Background dataset provides additional calibration data
+- Useful for tasks with limited in-domain data
+- C4 dataset is default choice for English tasks
+
+### 3. Processing and Cleaning
+- Task-specific normalization functions handle special cases
+- Regular expressions clean input/output texts
+- Language-specific processing for multilingual tasks
+
+### 4. Performance Impact
+- Larger sample sizes increase normalization quality but computational cost
+- Background dataset usage adds overhead
+- Consider caching normalized values for repeated evaluations
+
+## Best Practices
+
+1. **Dataset Size Selection**
+   - Use at least 1000 samples for calibration
+   - Increase for complex tasks or when accuracy is critical
+   - Consider computational resources available
+
+2. **Background Dataset Usage**
+   - Use for tasks with limited training data
+   - Ensure background data distribution matches task
+   - Consider language and domain compatibility
+
+3. **Processing Configuration**
+   - Configure task-specific normalization functions
+   - Use appropriate regex patterns for cleaning
+   - Enable language-specific processing for multilingual tasks
+
+4. **Optimization Tips**
+   - Cache normalized values when possible
+   - Use smaller sample sizes during development
+   - Enable background dataset loading from disk for large datasets
\ No newline at end of file
diff --git a/docs/normalization/3_instruction_tuned_model.md b/docs/normalization/3_instruction_tuned_model.md
new file mode 100644
index 00000000..06327a59
--- /dev/null
+++ b/docs/normalization/3_instruction_tuned_model.md
@@ -0,0 +1,223 @@
+# Instruction-Tuned Model Normalization Configurations in LM-Polygraph
+
+## Overview
+Instruction-tuned model configurations in LM-Polygraph provide specialized normalization settings for models that have been fine-tuned on instruction data. These configurations are located in `/examples/configs/instruct/` and include specific processing scripts and parameters for handling instruction-formatted inputs and outputs.
+
+## Configuration Structure
+
+### 1. Base Processing Configuration
+Located in `/examples/configs/instruct/`, base processing configs define foundational normalization settings:
+
+```yaml
+# Base processing for instruction-tuned models
+process_output_fn:
+  path: instruct/output_processing_scripts/default.py
+  fn_name: normalize_em
+process_target_fn:
+  path: instruct/output_processing_scripts/default.py
+  fn_name: normalize_em
+```
+
+### 2. Task-Specific Processing
+
+#### CoQA Processing
+```yaml
+# CoQA-specific instruction normalization
+process_output_fn:
+  path: instruct/output_processing_scripts/coqa.py
+  fn_name: normalize_em_coqa
+process_target_fn:
+  path: instruct/output_processing_scripts/coqa.py
+  fn_name: normalize_em_coqa
+```
+
+#### TriviaQA Processing
+```yaml
+# TriviaQA-specific instruction normalization
+process_output_fn:
+  path: instruct/output_processing_scripts/triviaqa.py
+  fn_name: normalize_em_triviaqa
+process_target_fn:
+  path: instruct/output_processing_scripts/triviaqa.py
+  fn_name: normalize_em_triviaqa
+```
+
+### 3. Processing Types
+
+#### Chain-of-Thought (CoT) Processing
+```yaml
+# CoT processing settings
+cot_processing:
+  enabled: true
+  extract_final_answer: true
+  normalize_reasoning: false
+  ignore_intermediate_steps: true
+```
+
+#### Top-K Processing
+```yaml
+# Top-K response processing
+topk_processing:
+  enabled: true
+  k: 4  # Number of alternatives to consider
+  aggregate_method: "max"  # How to combine multiple predictions
+```
+
+#### Top-1 Processing
+```yaml
+# Top-1 response processing
+top1_processing:
+  enabled: true
+  normalize_confidence: true
+  extract_probability: true
+```
+
+## Model-Specific Configurations
+
+### 1. Model Type Settings
+```yaml
+defaults:
+  - model: default_causal.py
+  - _self_
+
+model:
+  type: "CausalLM"
+  path_to_load_script: model/default_causal.py
+  generation_params:
+    do_sample: false
+    num_beams: 1
+    temperature: 1.0
+```
+
+### 2. Specialized Model Examples
+
+#### Mistral Configuration
+```yaml
+model:
+  path: mistral-7b-instruct-v0.2
+  type: "CausalLM"
+  load_model_args:
+    device_map: auto
+    trust_remote_code: true
+  load_tokenizer_args:
+    trust_remote_code: true
+```
+
+#### StableLM Configuration
+```yaml
+model:
+  path: stablelm-2-12b-chat
+  type: "CausalLM"
+  load_model_args:
+    device_map: auto
+    use_flash_attention: true
+```
+
+## Integration Features
+
+### 1. Processing Pipeline Integration
+- Custom normalization functions for instruction-formatted outputs
+- Task-specific answer extraction
+- Confidence score normalization
+
+### 2. Model Output Processing
+- Handling of structured instruction outputs
+- Extraction of final answers from reasoning chains
+- Normalization of multiple response formats
+
+### 3. Configuration Inheritance
+- Base processing settings inheritance
+- Task-specific overrides
+- Model-specific adaptations
+
+## Best Practices
+
+### 1. Processing Function Selection
+- Use task-specific normalizers when available
+- Fall back to default processors for general cases
+- Consider instruction format when selecting processors
+
+### 2. Confidence Handling
+- Enable confidence normalization for compatible models
+- Configure appropriate aggregation methods for multiple outputs
+- Consider model-specific confidence scales
+
+### 3. Chain-of-Thought Processing
+- Enable for models trained with CoT
+- Configure appropriate answer extraction
+- Consider preservation of reasoning steps
+
+### 4. Performance Optimization
+- Enable caching for processed outputs
+- Configure batch processing when possible
+- Balance processing complexity with performance needs
+
+## Example Configurations
+
+### 1. Basic Instruction Model Setup
+```yaml
+defaults:
+  - model: default_causal
+  - _self_
+
+process_output_fn:
+  path: instruct/output_processing_scripts/default.py
+  fn_name: normalize_em
+
+top1_processing:
+  enabled: true
+  normalize_confidence: true
+```
+
+### 2. CoT Model Configuration
+```yaml
+defaults:
+  - model: mistral-instruct
+  - _self_
+
+cot_processing:
+  enabled: true
+  extract_final_answer: true
+
+process_output_fn:
+  path: instruct/output_processing_scripts/cot.py
+  fn_name: normalize_cot
+```
+
+### 3. Multi-Task Model Setup
+```yaml
+defaults:
+  - model: stablelm-chat
+  - _self_
+
+process_output_fn:
+  path: instruct/output_processing_scripts/multi_task.py
+  fn_name: normalize_mt
+
+topk_processing:
+  enabled: true
+  k: 4
+  aggregate_method: "max"
+```
+
+## Common Issues and Solutions
+
+### 1. Output Format Mismatches
+- Problem: Model outputs don't match expected instruction format
+- Solution: Configure custom processing functions
+- Example: Use task-specific normalizers
+
+### 2. Confidence Scale Differences
+- Problem: Different models use different confidence scales
+- Solution: Enable confidence normalization
+- Example: Configure model-specific scaling
+
+### 3. Processing Pipeline Conflicts
+- Problem: Multiple processing steps interfering
+- Solution: Configure processing order
+- Example: Set priority for different normalizers
+
+### 4. Performance Bottlenecks
+- Problem: Slow processing of instruction outputs
+- Solution: Enable caching and batch processing
+- Example: Configure appropriate batch sizes
\ No newline at end of file
diff --git a/docs/normalization/4_impact_areas_and_default_behaviors.md b/docs/normalization/4_impact_areas_and_default_behaviors.md
new file mode 100644
index 00000000..0564855a
--- /dev/null
+++ b/docs/normalization/4_impact_areas_and_default_behaviors.md
@@ -0,0 +1,164 @@
+# LM-Polygraph Normalization: Impact Areas and Default Behaviors
+
+## Normalization Impact Areas
+
+### 1. Score Transformation
+- **Raw Uncertainty Scores**
+  - Original uncertainty estimates in unbounded ranges
+  - Higher values indicate more uncertainty
+  - Various scales depending on estimation method
+  
+- **Normalized Confidence Values** 
+  - Bounded in [0,1] range
+  - Higher values indicate more confidence
+  - Directly interpretable probabilities
+  - Preserves relative ordering of original scores (for Isotonic PCC)
+
+### 2. Evaluation Pipeline
+- **Calibration Stage**
+  - Uses calibration dataset to learn normalization parameters
+  - Requires generation quality metrics for Performance-Calibrated Confidence
+  - Can use either task-specific or background data
+  - Parameters are saved for reuse
+
+- **Inference Stage** 
+  - Applies learned normalization to new uncertainty estimates
+  - No additional model inference required
+  - Fast transformation using stored parameters
+  - Can be applied to any compatible uncertainty estimator
+
+### 3. Quality Metrics Integration
+- **Metric Normalization**
+  - Quality metrics are normalized to [0,1] range
+  - Enables consistent calibration across different metrics
+  - Handles various metric types (ROUGE, BLEU, accuracy, etc.)
+  - Supports both bounded and unbounded metrics
+
+- **Metric Selection**
+  - Different metrics for different tasks
+  - Task-specific normalization of quality scores
+  - Multiple metrics can be used simultaneously
+  - Quality metrics guide confidence calibration
+
+### 4. Model Types Support
+- **White-box Models**
+  - Access to internal probabilities and logits
+  - Can normalize token-level uncertainties
+  - Supports both sequence and token-level calibration
+  - Works with HuggingFace models
+
+- **Black-box Models**
+  - Limited to output-based uncertainty estimation
+  - Only sequence-level normalization 
+  - Compatible with API-based models (OpenAI, etc.)
+  - No access to internal model states
+
+## Default Behaviors
+
+### 1. Score Processing
+
+```python
+# Default score processing behavior
+normalize_scores = {
+    'clip_values': True,           # Clip to [0,1] range
+    'flip_uncertainty': True,      # Convert uncertainty to confidence
+    'preserve_order': True,        # Maintain sample ordering
+    'handle_nans': 'ignore'        # Skip NaN values in calibration
+}
+```
+
+### 2. Calibration Settings
+
+```python
+# Default calibration configuration
+calibration_defaults = {
+    'strategy': 'dataset_specific',    # Use task-specific calibration
+    'num_samples': 1000,               # Default calibration set size
+    'background_data': None,           # No background data by default
+    'split_ratio': None,               # No train/test split
+    'cache_enabled': True              # Cache calibration parameters
+}
+```
+
+### 3. Method Selection
+
+```python
+# Default normalization method selection
+method_defaults = {
+    'primary_method': 'isotonic_pcc',  # Default to Isotonic PCC
+    'fallback_method': 'minmax',       # Use MinMax as fallback
+    'combine_methods': False,          # Don't combine multiple methods
+    'quality_metric': 'auto'           # Auto-select appropriate metric
+}
+```
+
+### 4. Task-Specific Defaults
+
+```yaml
+# Task-type specific defaults
+task_defaults:
+  qa:
+    metric: 'accuracy'
+    normalize_answers: true
+    ignore_case: true
+    
+  translation:
+    metric: 'bleu'
+    normalize_translations: true
+    source_cleaning: true
+    
+  summarization:
+    metric: 'rouge'
+    normalize_summaries: true
+    trim_outputs: true
+```
+
+### 5. Error Handling
+
+```python
+# Default error handling behavior
+error_handling = {
+    'invalid_scores': 'skip',          # Skip invalid uncertainty scores
+    'missing_metrics': 'error',        # Raise error for missing metrics
+    'calibration_fails': 'fallback',   # Use fallback method if calibration fails
+    'out_of_bounds': 'clip'           # Clip out-of-bounds values
+}
+```
+
+### 6. Memory Management
+
+```python
+# Default memory management settings
+memory_settings = {
+    'cache_location': '~/.cache/lm-polygraph/norm',
+    'max_cache_size': '1GB',
+    'clear_cache_on_exit': False,
+    'compression': True
+}
+```
+
+## Usage Guidelines
+
+1. **Choosing Calibration Data**
+   - Use task-specific data when available
+   - Ensure calibration set is representative
+   - Consider using background data for sparse tasks
+   - Monitor calibration set size vs. performance
+
+2. **Method Selection**
+   - Start with Isotonic PCC for best balance
+   - Use MinMax for simple scaling needs
+   - Consider Binned PCC for interpretability
+   - Evaluate multiple methods if uncertain
+
+3. **Error Handling**
+   - Monitor normalization failures
+   - Validate calibration success
+   - Check normalized score distributions
+   - Verify quality metric calculations
+
+4. **Performance Optimization**
+   - Enable caching for repeated use
+   - Adjust calibration set size as needed
+   - Use appropriate quality metrics
+   - Monitor memory usage
\ No newline at end of file