From 3120f1187b6aa2840b31b074df0b474524b9037d Mon Sep 17 00:00:00 2001
From: arva_puls8 <125344600+Atharva-developer@users.noreply.github.com>
Date: Thu, 2 Jan 2025 20:06:48 +0530
Subject: [PATCH] Elaborate steps for handling lagging finality (#556)

---
 docs/recovery-steps.md | 47 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)
 create mode 100644 docs/recovery-steps.md

diff --git a/docs/recovery-steps.md b/docs/recovery-steps.md
new file mode 100644
index 000000000..93fc246bc
--- /dev/null
+++ b/docs/recovery-steps.md
@@ -0,0 +1,47 @@
+# Recovery Steps and Best Practices for Handling Lagging Finality
+
+## Recovery Steps
+### 1. **Detect Lagging Finality**
+- **Check Logs**:
+  - Parse the logs for keywords like `"finality is lagging"` or discrepancies between the `best block` and `finalized block` heights.
+- **Monitor Metrics**:
+  - Compare `best_block_height` and `finalized_block_height`. If the gap exceeds a predefined threshold, finality is lagging.
+
+### 2. **Diagnose the Issue**
+- **Validator Status**:
+  - Identify offline validators by querying the validator statuses.
+  - Check if validators are unreachable or experiencing issues.
+- **Network Health**:
+  - Detect network partitions or connectivity issues by examining peer counts and latency metrics.
+
+### 3. **Take Corrective Actions**
+#### a. Restart Offline Validators
+1. Identify the validators that are down.
+2. Restart them using the appropriate system or orchestration tool (e.g., Docker, Kubernetes).
+3. Verify that the validators have rejoined the network.
+
+#### b. Resynchronize Nodes
+1. Identify out-of-sync nodes.
+2. Resynchronize them by resetting their database or reinitializing their state.
+
+#### c. Resolve Network Partitions
+1. Investigate and fix connectivity issues causing the partition.
+2. Ensure that nodes can communicate across subnets or regions.
+
+### 4. **Verify Recovery**
+- Compare `best_block_height` and `finalized_block_height`. If they match, finality has been restored.
+- Confirm that validators are online and producing blocks.
+
+## Best Practices
+1. **Validator Setup**:
+   - Use reliable hardware with redundancy.
+   - Maintain backup nodes for failover.
+2. **Network Configuration**:
+   - Ensure low-latency, high-bandwidth connections.
+   - Distribute nodes geographically to avoid single points of failure.
+3. **Monitoring and Alerts**:
+   - Use tools like Prometheus and Grafana to monitor metrics.
+   - Set alerts for abnormal block height discrepancies or validator downtime.
+4. **Disaster Recovery Plan**:
+   - Maintain and test recovery scripts regularly.
+   - Document procedures for common failure scenarios.