-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsre-presentation.html
executable file
·463 lines (292 loc) · 12.2 KB
/
sre-presentation.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
<!DOCTYPE html>
<html>
<head>
<title>Title</title>
<meta charset="utf-8">
<style>
@import url(https://fonts.googleapis.com/css?family=Yanone+Kaffeesatz);
@import url(https://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic);
@import url(https://fonts.googleapis.com/css?family=Ubuntu+Mono:400,700,400italic);
body { font-family: 'Droid Serif';font-size: 2em; }
h1, h2, h3 {
font-family: 'Yanone Kaffeesatz';
font-weight: normal;
}
.remark-code { font-size: 20px; }, .remark-inline-code { font-family: 'Ubuntu Mono'; }
.remark-slide-content { font-size: 35px; padding-left: 50px; padding-right: 50px; }
.remark-slide-content-small { font-size: 25px; }
.p-margin-small p { margin-bottom: 0.6em; margin-top: 0.6em; }
</style>
</head>
<body>
<textarea id="source">
class: center, middle
# Site Reliability Engineering
# SRE
---
# Agenda
1. The mind models of software operations
1. SRE Principles
1. SLI and SLO
1. Alert on SLO
1. The Calculus of Service Availability
1. Non Large Abstract System Design
---
class: p-margin-small
# The Mind Models of Software Operations
Hard Boundary (ITIL)
You build it, you run it aka DevOps
Site Reliability Engineering SRE
---
class: p-margin-small
# Model: Hard Boundary (ITIL)
We accept things like incidents and technical debt exists, but we try to get someone else to handle them. The classical Dev Department, Ops Department model.
**Strengths**: Sharp lines lead to clarity
**Weakness**: Broken Feedback Loop, Fingerpointing, diverged Culture, Cost Centric, poor Automation, limited Incident Response Capability, long Lead Time.
---
class: p-margin-small
# Model: You build it, you run it aka DevOps
If a team builds a thing, they run the thing. Keep all responsibility in-house.
**Strengths**: Production Ownership by Devs and Ops, good Feedback Loop, short Lead Time, end to end Automation.
**Weakness**: Poor aligment between Teams on Ops practicies, Teams reeinvent things, not all Solutions fit this Model, lots off skills needed in one Team (Development, Reliability, Security, Operation)
---
class: p-margin-small
# Model: Site Reliability Engineering SRE
A dedicated Role with expertise in maintaining Reliability (Design for Reliability, Implement Reliability, Change Management, Monitoring, Incident Response)
**Strengths**: Aligns with DevOps, Reliability is a feature, clear contract when to work on Reliability, one language for Reliability, specific role which owns the Knowledge, can scale.
**Weakness**: If there is no governance on tools and practices, its hard. SRE practices are not for free.
---
class: center, middle
**Site Reliability Engineering is an engineering discipline devoted to helping an organization sustainably achieve the appropriate level of reliability in their systems, services and products**
---
class: p-margin-small
# Is SRE kind of DevOps?
SRE provides the Practicies how Reliability can be addressed.
DevOps focus on Lead Time and Automation, but there is no Guidance, how to Address the Aspect of Reliability.
Anyway what is the Ops aspect? It means to **reliable** run the product and introduce change.
SRE is not a competitor to DevOps, it provides Practicies how to implement the Reliability aspect.
---
class: p-margin-small
# Why does Reliability matter
If your product is down, your product has no value.
If you product misbehaves, you can loose your whole business.
---
class: p-margin-small
# Most Important SRE Principles/Practicies
1. **SLI, SLO and Errorbudget**
1. **Alert on SLO**
1. ~~Eliminating Toil~~
1. *Incident Response*
1. ~~Postmortem Culture~~
1. *Canarying Releases*
1. *Non Large Abstract System Design*
---
class: center, middle
# SLI/SLO
**Service Level Indicator**
A quantifiable measure of service reliability.
With focus on how the user experiences reliability.
---
# SLI and the user happiness
.center[![Default-aligned image](images/sli_happy_unhappy.png)]
---
# SLI and the user happiness
.center[![Default-aligned image](images/sli_happy_unhappy_metric.png)]
---
# SLI and the user happiness
.center[![Default-aligned image](images/sli_happy_unhappy_metric_noise_ratio.png)]
---
class: p-margin-small
# Possible metrics for SLI
* http response code
* response time
* response data quality
---
# Definition of SLI
.center[![Default-aligned image](images/good_events_valid_events.png)]
---
# Definition of SLI http response code example
.center[![Default-aligned image](images/good_events_valid_events.png)]
**valid events** = `100` http resp
**good events** = `95` http resp with `code != 5.*`
**SLI** = `95`
---
class: center, middle
# SLI/SLO
**Service Level Objective**
Set a reliability target for an SLI
---
# Definition of SLO
.center[![Default-aligned image](images/happy_unhappy_slo.png)]
---
class: center, middle
# Definition of SLO
The SLI http response code must be 99
over a rolling window of 30 days
---
class: p-margin-small
# We got an error budget
In a rolling window 30 days we can only have `100 - SLI (99)` bad response codes.
Means if there are 1'000'000 requests we can afford 10'000 bad requests/response.
---
class: p-margin-small
# Act on the error budget
The error budget is actionable
When the error budget declined to a certain point, we can prioritize reliability improvements over new features
**We just added reliability as a feature to our value stream**
The PO can treat new functionality and reliability head to head in the same model
---
class: p-margin-small
# Alert on SLI
As consequence we should also base our alerting on the SLI metric
But when should we page the on-call engineer?
And when should we just create an issue, that can be worked on in working hours?
---
# Alert on SLO Burndown Rate
.center[![Default-aligned image](images/slo_burndown_rate.png)]
---
class: p-margin-small
# Alert on SLO Burndown Rate
If we have a burn rate of 14% of the overall errorbudget in the last 5min, we page somebody.
If we have a burn rate of 1% over the last 6 hours we create a ticket, notify over another channel
---
class: p-margin-small
# Alert on SLO Burndown Rate
Not so simple because you have several factors:
* Detection Time (System down should alert immediately)
* Reset Time (When the SLI is back to 100, the alert should not fire anymore)
The problem can be solve with a multiwindow alerting and expressed in PromQL for example.
For an example check [Alerting on SLOs like Pros](https://developers.soundcloud.com/blog/alerting-on-slos)
---
class: p-margin-small
# The Calculus of Service Availability
What is the impact of nines on our Incident Response Time.
Availability and the downtime per year:
99 -----> 87.6h
99.9 ---> 8.76h
99.99 --> 52.6m
99.999 -> 5.26m
---
class: p-margin-small
# SLO Dependencies
A service can only be as reliable as the sum of SLO's of his critical dependency.
A critical dependency brings your service down, if it is down:
* network
* run infrastructure (*aaS)
* *database?*
* ~~another service~~
Critical dependencies needs an extra 9
---
class: p-margin-small
# Example 99.9
* 5 Critcal Dependencies
* Error Budget is 526m Downtime, Dependencies eat 263m, **263m** is left
* expect 2 Outages, time MTTR per outage is 132m
* Detection Time for alert 2m
* Until oncall is ready to fix 20m
* **Remaining time for fix 110m**
---
class: p-margin-small
# Example 99.99
* 5 Critcal Dependencies
* Error Budget is 52.6m Downtime, Dependencies eat 26.3m, **26.3m** is left
* expect 2 Outages, time MTTR per outage is 13.2m
* Detection Time for alert 2m
* Until oncall is ready to fix 20m
* --> **Game Over, we missed our four nines**
---
class: center, middle
# Calculus of Nines conclusion
Relying on humans for fixing production issues can not lead to high reliability
We need other measures
---
class: p-margin-small
# Have multiple instances of your service
First of all you need multiple instances. The more instances you have, the less you consume from your SLI when one instance fails.
when 1 of your 10 instances fails, your SLI is down to 90.
If 1 instance of your 100 instances fail, your SLI is down to 99.
---
class: p-margin-small
# Distribute your System, no Single Point of Failure
**Multi Instance, Multi Node, Multi Datacenter**
The more the nines, the more distributed your System needs to be
But also the cost of beeing (eventual) consistent increases (CAP)
---
class: p-margin-small
# Roll out your changes gradually
The known method for rolling out change gradually is canarying.
This can be done with everything.
Config change, code change, BIOS update ... just name it.
Again, If 1 instance of your 100 instances fail on your change, your SLI is down to 99.
Automate the Rollback, based on metrics.
---
class: p-margin-small
# Non Large Abstract System Design
NLASD is the ability to assess, design, and evaluate (large) systems.
Practically, NALSD combines elements of:
* capacity planning
* component isolation
* instance, data distribution
* high availability
* clustering
* graceful system degradation
---
class: p-margin-small
# Non Large Abstract System Design
Non Abstract, becasue the White Board Design has to be translated into Response Times, Network Bandwith, IOPS, RAM Access Time, Failure Modes etc.
And these numbers will be measured and the Design will be verified.
---
class: center, middle
# Should you do SRE
As always it depends
Is reliability a problem for you.
If you are a 2 person startup, most probably not SRE is the solution.
If you are a large company with high reliability requirements for some of you services, SRE can help you.
---
class: p-margin-small
# How can you start
You don't have to start with a full blown SRE department.
Start small and iterate:
* Try to define your first SLI/SLO's
* Get in contact with Business/Users to align on SLO
* Start to monitor the error budget
* Alert on SLO
* Assess your Services with the NALSD
---
class: center, middle
Much more to say, but not enough time
---
class: p-margin-small
# Conclusion
SRE and Reliability doesn't come for free
If the built in Reliability of your Application and run Environment is sufficent, go with it.
If you need to care about Reliability, SRE gives you a good framwork to treat Reliabilty in a predictible model.
---
class: p-margin-small
# Resources
* [The SRE Workbook](https://landing.google.com/sre/workbook/toc/)
* [The Mental Models for Reliable Software Engineering](https://www.youtube.com/watch?v=qD_qBUFkKMQ&feature=youtu.be&t=1050)
* [DevOps Versus SRE](https://devops.com/site-reliability-engineering-101-devops-versus-sre/)
* [The Calculus of Service Availability](https://queue.acm.org/detail.cfm?id=3096459)
* [Best Practices for Setting SLOs and SLIs For Modern, Complex Systems](https://blog.newrelic.com/engineering/best-practices-for-setting-slos-and-slis-for-modern-complex-systems/)
* [SLOs Are the API for Your Engineering Team](https://www.infoq.com/articles/slos-engineering-team-API/)
* [Stop Talking & Listen; Practices for Creating Effective Customer SLOs](https://www.infoq.com/presentations/slo-pitfalls-2019/)
---
class: p-margin-small
# Resources
* [Canary Analysis Service](https://queue.acm.org/detail.cfm?id=3194655)
* [Why SRE Documents Matter](https://queue.acm.org/detail.cfm?id=3283589)
* [Alerting on SLOs like Pros](https://developers.soundcloud.com/blog/alerting-on-slos)
* [https://github.com/floriankammermann/sre-presentation](https://github.com/floriankammermann/sre-presentation)
---
class: center, middle
# keep the evolution going
Get better in Reliability, by not compromising the Value Stream
</textarea>
<script src="https://gnab.github.io/remark/downloads/remark-latest.min.js">
</script>
<script>
var slideshow = remark.create();
</script>
</body>
</html>