stakater · karl-johan-grahn · Jan 20, 2025 · Jan 19, 2025 · Jan 19, 2025
diff --git a/content/channels.md b/content/channels.md
@@ -1,9 +1,9 @@
 # Support Channels
 
-Stakater provides support via these channels depending on the type of request:
+Stakater provides support via these channels depending on the support tier and the severity of the request:
 
 * Email: For documenting and providing detailed updates and progress reports
-* Phone calls and video meetings: When immediate or more personal communication is necessary
+* Video and phone: When immediate or more personal communication is necessary
 * Chat: For quick, real-time interactions during troubleshooting, foremost Slack
 * [Service desk portal](https://stakater-cloud.atlassian.net/servicedesk/customer/portals): Where you can log in at any time to view the latest updates, add information or ask questions
 * Onsite visit: When physical presence is required
diff --git a/content/index.md b/content/index.md
@@ -9,3 +9,33 @@ Responses are provided on a best-effort basis during the same or next business d
 ## Our Promise
 
 We are here to support you, not just until the problem is solved, but to ensure that your experience is as seamless as possible. Your success is our success and through our ongoing response efforts, we pledge to uphold the highest standards of customer service and satisfaction.
+
+## Priorities
+
+You as a customer can set the initial priority for a Request by specifying the appropriate priority: `Critical`, `High`, `Medium` or `Low`. The Engineer on Duty has the right to adjust it at their own discretion based on the rules below:
+
+Request Priority | Description of the Request Priority
+--- | ---
+`Critical` |  Large-scale failure or complete unavailability of OpenShift or Customer's business application deployed on OpenShift. The `Critical` priority will be lowered to `High` if there is a workaround for the problem. Example: Router availability issues, synthetic monitoring availability issues.
+`High` | Partial degradation of OpenShift core functionality or Customer's business application functionality with potential adverse impact on long-term performance. The `High` priority will be lowered to `Medium` if there is a workaround for the problem. Example: Node Group and Control Plane availability problems.
+`Medium` | Partial, non-critical loss of functionality of OpenShift or the Customer's business application. This category also includes major bugs in OpenShift that affect some aspects of the Customer's operations and have no known solutions. The `Medium` priority will be lowered to `Low` if there is a workaround for the problem. This priority is assigned to Requests by default. If the Request does not have an priority set by the Customer, it will be assigned the default priority `Medium`. Example: Problems with the monitoring availability and Pod autoscaling.
+`Low` | This category includes: Requests for information and other matters, requests regarding extending the functionality of the Kubernetes Platform, performance issues that have no effect on functionality, Kubernetes platform flaws with known solutions or moderate impact on functionality. Example: Issues with extension availability.
+
+## Support Tiers
+
+Stakater offers three levels of support tiers, as described in the table below.
+
+| | Essential | Advanced | Premium |
+| - | - | - | - |
+| Use case | Basic minimum support | Development support | Production and critical workload support |
+| Support hours | 24x5 | 24x7 | 24x7 |
+| Modes of support | Ticket | Ticket, Video | Ticket, Video, Chat, Phone |
+| Support response team | Regular | Specialized | Dedicated |
+| Recommendations for improvements | No | No | Yes |
+| Training and enablement sessions | No | No | Yes |
+| Technical Account Manager (TAM) | No | No | Yes |
+| Key Account Manager (KAM) | No | No | Yes |
+| Ticket response times - Critical | 12h | 2h | 1h |
+| Ticket response times - High | 12h | 4h | 2h |
+| Ticket response times - Medium | 24h | 8h | 4h |
+| Ticket response times - Low | 24h | 24h | 24h |
diff --git a/content/irm.md b/content/irm.md
@@ -0,0 +1,87 @@
+# Incident Response Management
+
+The SRE team at Stakater has both the responsibility and the authority to resolve incidents.
+
+Incidents are anomalous conditions that result in — or may lead to — service degradation or outages. These events may require human intervention to avert disruptions or restore service to operational status. Incidents should always be given immediate attention.
+
+Stakater's incident management system (IMS) is based on [Google's IMS](https://sre.google/sre-book/managing-incidents/) which in turn is based on the [Incident Command System](https://www.fema.gov/emergency-managers/nims).
+
+The goal of incident management is to organize chaos into swift incident resolution. To that end, incident management provides:
+
+1. Well-defined roles and responsibilities and workflow for members of the incident team
+1. Control points to manage the flow information and the resolution path
+1. A root cause analysis where follow-up actions, lessons, and techniques are extracted and shared
+
+## Tools
+
+Tools used to facilitate incident management at Stakater:
+
+* `Alertmanager` - for creating alerts from Prometheus
+* `Grafana OnCall` - for paging of alerts
+* `Slack` - for asynchronous communication
+* `Google Meet` - for synchronous communication
+
+## Incident Ownership
+
+By default, the SRE on-call is the owner of the incident.
+
+## Roles and Responsibilities
+
+Clear role responsibilities is important during an incident. Quick resolution requires focus and a clear hierarchy for delegation of tasks. The focus of incident response should be on resolving the incident, not on resolving confusion on who should do what - clear roles and responsibilities prevent confusions around accountability when an incident actually happens.
+
+The three main roles in incident response are:
+
+1. Incident Commander (IC) - leads the incident response
+    * Commands and coordinates the incident response
+    * Assumes all roles that have not been delegated yet
+    * Communicates effectively
+    * Escalates alerts: Notifies the team until someone acknowledges the alert and takes on the CL role
+1. Communications Lead (CL) - reports to the IC
+    * Public face of the incident response team
+    * Provides periodic updates to customers and the incident response team
+    * Manages inquiries about the incident
+1. Operations or Ops Lead (OL) - report to the IC
+    * Responds to the incident by applying operational tools to mitigate or resolve the incident
+
+One person can be assigned to one or multiple roles. The most important thing is that all roles are needed to effectively deal with an incident.
+
+```mermaid
+flowchart TD
+    CL --> |Gathers incident response status|IC
+    CL --> |Updates customer|Customer
+    CL --> |Updates internal team|Team
+    OL --> |Assists in the incident response|IC
+    classDef incident fill:#f00,color:white
+    IC --> |Leads the incident response|Incident:::incident
+```
+
+## SOP (Standard Operating Procedure) for an Incident
+
+An incident should be declared if any of the following is true:
+
+* Does the incident affect customers?
+* Does the incident affect the customer SLA?
+
+To resolve an incident:
+
+1. Make SRE on-call aware of the incident
+1. Assign incident management roles
+1. IC defines the incident in terms of:
+    * Impact
+    * Frequency
+    * Severity
+1. IC creates an `Incident` ticket for the incident in the Stakater ticket system
+1. CL informs the customer and keeps them updated every hour of the progress
+    * Inform customer in external customer Slack channel
+    * Inform customer via email and add their manager on CC
+1. IC and OL begins understand why it happened
+    * Always replicate issues with incognito user to avoid using cached content
+1. IC and OL begins address it by involving other teams
+1. IC and OL hands over the ownership when needed if their shifts end
+1. CL creates a document to start analyzing the root cause
+
+To do a post-mortem of an incident:
+
+1. CL informs the customer that the incident is resolved
+1. IC schedules a root cause analysis meeting, where every involved attends and collaboratively fills out the incident document
+1. IC creates sub-tasks in the incident ticket for follow-up actions
diff --git a/content/responsetimes.md b/content/responsetimes.md
@@ -1,25 +1,5 @@
 # Response Times
 
-## Priorities
-
-You as a customer can set the initial priority for a Request by specifying the appropriate priority: `Critical`, `High`, `Medium` or `Low`. The Engineer on Duty has the right to adjust it at their own discretion based on the rules below:
-
-Request Priority | Description of the Request Priority
---- | ---
-`Critical` |  Large-scale failure or complete unavailability of OpenShift or Customer's business application deployed on OpenShift. The `Critical` priority will be lowered to `High` if there is a workaround for the problem. Example: Router availability issues, synthetic monitoring availability issues.
-`High` | Partial degradation of OpenShift core functionality or Customer's business application functionality with potential adverse impact on long-term performance. The `High` priority will be lowered to `Medium` if there is a workaround for the problem. Example: Node Group and Control Plane availability problems.
-`Medium` | Partial, non-critical loss of functionality of OpenShift or the Customer's business application. This category also includes major bugs in OpenShift that affect some aspects of the Customer's operations and have no known solutions. The `Medium` priority will be lowered to `Low` if there is a workaround for the problem. This priority is assigned to Requests by default. If the Request does not have an priority set by the Customer, it will be assigned the default priority `Medium`. Example: Problems with the monitoring availability and Pod autoscaling.
-`Low` | This category includes: Requests for information and other matters, requests regarding extending the functionality of the Kubernetes Platform, performance issues that have no effect on functionality, Kubernetes platform flaws with known solutions or moderate impact on functionality. Example: Issues with extension availability.
-
-## Production Support Terms of Service
-
-Request Priority | Initial Response Time | Ongoing Response
---- | --- | ---
-`Critical` | 2 business hours | 2 business hours or as agreed
-`High` | 4 business hours | 4 business hours or as agreed
-`Medium` | 2 business day | 2 business days or as agreed
-`Low` | 5 business days | 5 business days or as agreed
-
 ## Initial Response Time and Our Commitment to You
 
 Once we've addressed your initial support request, our commitment doesn't end there. We believe in providing continuous support to ensure that your issue is resolved satisfactorily. Our ongoing response framework is designed to keep you informed and confident that we're working diligently to address your needs.

diff --git a/theme_override/mkdocs.yml b/theme_override/mkdocs.yml
@@ -12,3 +12,4 @@ nav:
   - signup.md
   - channels.md
   - responsetimes.md
+  - irm.md