Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update support tiers and incident response #32

Merged
merged 2 commits into from
Jan 20, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions content/channels.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Support Channels

Stakater provides support via these channels depending on the type of request:
Stakater provides support via these channels depending on the support tier and the severity of the request:

* Email: For documenting and providing detailed updates and progress reports
* Phone calls and video meetings: When immediate or more personal communication is necessary
* Video and phone: When immediate or more personal communication is necessary
* Chat: For quick, real-time interactions during troubleshooting, foremost Slack
* [Service desk portal](https://stakater-cloud.atlassian.net/servicedesk/customer/portals): Where you can log in at any time to view the latest updates, add information or ask questions
* Onsite visit: When physical presence is required
30 changes: 30 additions & 0 deletions content/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,33 @@ Responses are provided on a best-effort basis during the same or next business d
## Our Promise

We are here to support you, not just until the problem is solved, but to ensure that your experience is as seamless as possible. Your success is our success and through our ongoing response efforts, we pledge to uphold the highest standards of customer service and satisfaction.

## Priorities

You as a customer can set the initial priority for a Request by specifying the appropriate priority: `Critical`, `High`, `Medium` or `Low`. The Engineer on Duty has the right to adjust it at their own discretion based on the rules below:

Request Priority | Description of the Request Priority
--- | ---
`Critical` | Large-scale failure or complete unavailability of OpenShift or Customer's business application deployed on OpenShift. The `Critical` priority will be lowered to `High` if there is a workaround for the problem. Example: Router availability issues, synthetic monitoring availability issues.
`High` | Partial degradation of OpenShift core functionality or Customer's business application functionality with potential adverse impact on long-term performance. The `High` priority will be lowered to `Medium` if there is a workaround for the problem. Example: Node Group and Control Plane availability problems.
`Medium` | Partial, non-critical loss of functionality of OpenShift or the Customer's business application. This category also includes major bugs in OpenShift that affect some aspects of the Customer's operations and have no known solutions. The `Medium` priority will be lowered to `Low` if there is a workaround for the problem. This priority is assigned to Requests by default. If the Request does not have an priority set by the Customer, it will be assigned the default priority `Medium`. Example: Problems with the monitoring availability and Pod autoscaling.
karl-johan-grahn marked this conversation as resolved.
Show resolved Hide resolved
`Low` | This category includes: Requests for information and other matters, requests regarding extending the functionality of the Kubernetes Platform, performance issues that have no effect on functionality, Kubernetes platform flaws with known solutions or moderate impact on functionality. Example: Issues with extension availability.

## Support Tiers

Stakater offers three levels of support tiers, as described in the table below.

| | Essential | Advanced | Premium |
| - | - | - | - |
| Use case | Basic minimum support | Development support | Production and critical workload support |
| Support hours | 24x5 | 24x7 | 24x7 |
| Modes of support | Ticket | Ticket, Video | Ticket, Video, Chat, Phone |
| Support response team | Regular | Specialized | Dedicated |
| Recommendations for improvements | No | No | Yes |
| Training and enablement sessions | No | No | Yes |
| Technical Account Manager (TAM) | No | No | Yes |
| Key Account Manager (KAM) | No | No | Yes |
| Ticket response times - Critical | 12h | 2h | 1h |
| Ticket response times - High | 12h | 4h | 2h |
| Ticket response times - Medium | 24h | 8h | 4h |
| Ticket response times - Low | 24h | 24h | 24h |
87 changes: 87 additions & 0 deletions content/irm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Incident Response Management

The SRE team at Stakater has both the responsibility and the authority to resolve incidents.

Incidents are anomalous conditions that result in — or may lead to — service degradation or outages. These events may require human intervention to avert disruptions or restore service to operational status. Incidents should always be given immediate attention.

Stakater's incident management system (IMS) is based on [Google's IMS](https://sre.google/sre-book/managing-incidents/) which in turn is based on the [Incident Command System](https://www.fema.gov/emergency-managers/nims).

The goal of incident management is to organize chaos into swift incident resolution. To that end, incident management provides:

1. Well-defined roles and responsibilities and workflow for members of the incident team
1. Control points to manage the flow information and the resolution path
1. A root cause analysis where follow-up actions, lessons, and techniques are extracted and shared

## Tools

Tools used to facilitate incident management at Stakater:

* `Alertmanager` - for creating alerts from Prometheus
* `Grafana OnCall` - for paging of alerts
* `Slack` - for asynchronous communication
* `Google Meet` - for synchronous communication

## Incident Ownership

By default, the SRE on-call is the owner of the incident.

## Roles and Responsibilities

Clear role responsibilities is important during an incident. Quick resolution requires focus and a clear hierarchy for delegation of tasks. The focus of incident response should be on resolving the incident, not on resolving confusion on who should do what - clear roles and responsibilities prevent confusions around accountability when an incident actually happens.

The three main roles in incident response are:

1. Incident Commander (IC) - leads the incident response
* Commands and coordinates the incident response
* Assumes all roles that have not been delegated yet
* Communicates effectively
* Escalates alerts: Notifies the team until someone acknowledges the alert and takes on the CL role
1. Communications Lead (CL) - reports to the IC
* Public face of the incident response team
* Provides periodic updates to customers and the incident response team
* Manages inquiries about the incident
1. Operations or Ops Lead (OL) - report to the IC
* Responds to the incident by applying operational tools to mitigate or resolve the incident

One person can be assigned to one or multiple roles. The most important thing is that all roles are needed to effectively deal with an incident.

```mermaid
flowchart TD
CL --> |Gathers incident response status|IC
CL --> |Updates customer|Customer
CL --> |Updates internal team|Team
OL --> |Assists in the incident response|IC
classDef incident fill:#f00,color:white
IC --> |Leads the incident response|Incident:::incident
```

## SOP (Standard Operating Procedure) for an Incident

An incident should be declared if any of the following is true:

* Does the incident affect customers?
* Does the incident affect the customer SLA?

To resolve an incident:

1. Make SRE on-call aware of the incident
1. Assign incident management roles
1. IC defines the incident in terms of:
* Impact
* Frequency
* Severity
1. IC creates an `Incident` ticket for the incident in the Stakater ticket system
1. CL informs the customer and keeps them updated every hour of the progress
* Inform customer in external customer Slack channel
* Inform customer via email and add their manager on CC
1. IC and OL begins understand why it happened
* Always replicate issues with incognito user to avoid using cached content
1. IC and OL begins address it by involving other teams
1. IC and OL hands over the ownership when needed if their shifts end
1. CL creates a document to start analyzing the root cause

To do a post-mortem of an incident:

1. CL informs the customer that the incident is resolved
1. IC schedules a root cause analysis meeting, where every involved attends and collaboratively fills out the incident document
1. IC creates sub-tasks in the incident ticket for follow-up actions
20 changes: 0 additions & 20 deletions content/responsetimes.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,5 @@
# Response Times

## Priorities

You as a customer can set the initial priority for a Request by specifying the appropriate priority: `Critical`, `High`, `Medium` or `Low`. The Engineer on Duty has the right to adjust it at their own discretion based on the rules below:

Request Priority | Description of the Request Priority
--- | ---
`Critical` | Large-scale failure or complete unavailability of OpenShift or Customer's business application deployed on OpenShift. The `Critical` priority will be lowered to `High` if there is a workaround for the problem. Example: Router availability issues, synthetic monitoring availability issues.
`High` | Partial degradation of OpenShift core functionality or Customer's business application functionality with potential adverse impact on long-term performance. The `High` priority will be lowered to `Medium` if there is a workaround for the problem. Example: Node Group and Control Plane availability problems.
`Medium` | Partial, non-critical loss of functionality of OpenShift or the Customer's business application. This category also includes major bugs in OpenShift that affect some aspects of the Customer's operations and have no known solutions. The `Medium` priority will be lowered to `Low` if there is a workaround for the problem. This priority is assigned to Requests by default. If the Request does not have an priority set by the Customer, it will be assigned the default priority `Medium`. Example: Problems with the monitoring availability and Pod autoscaling.
`Low` | This category includes: Requests for information and other matters, requests regarding extending the functionality of the Kubernetes Platform, performance issues that have no effect on functionality, Kubernetes platform flaws with known solutions or moderate impact on functionality. Example: Issues with extension availability.

## Production Support Terms of Service

Request Priority | Initial Response Time | Ongoing Response
--- | --- | ---
`Critical` | 2 business hours | 2 business hours or as agreed
`High` | 4 business hours | 4 business hours or as agreed
`Medium` | 2 business day | 2 business days or as agreed
`Low` | 5 business days | 5 business days or as agreed

## Initial Response Time and Our Commitment to You

Once we've addressed your initial support request, our commitment doesn't end there. We believe in providing continuous support to ensure that your issue is resolved satisfactorily. Our ongoing response framework is designed to keep you informed and confident that we're working diligently to address your needs.
Expand Down
1 change: 1 addition & 0 deletions theme_override/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,4 @@ nav:
- signup.md
- channels.md
- responsetimes.md
- irm.md
Loading