monorepo/docs/internal/runbooks/_template.md

82 lines
2.5 KiB
Markdown
Raw Normal View History

2024-07-04 18:49:39 +00:00
---
# front-matter
---
# Runbook Template
Template credit goes to [CatieM20][]. If you haven't seen their talk on [Tackling Alert Fatigue][] yet, go give it a
watch. I've made a handful of modifications to include some site reliability engineering geared content.
[CatieM20]: https://github.com/CaitieM20/Talks/blob/master/TacklingAlertFatigue/runbook.md
## General
A quick description of the services. 1 to 2 sentences max. Why does this service matter? What is it's core
functionality? What Features does it provide users?
## Failure Mode and Effect Analysis
[FMEA][] is a method of failure analysis that helps teams create reliable systems and develop comprehensive on-call
response patterns.
[FMEA]: https://en.wikipedia.org/wiki/Failure_mode_and_effects_analysis
| Service | Failure Mode | Possible Cause | Effects | Probability (P) | Severity (S) | Detection (D) | Risk |
| :-------- | :------------------- | :-------------- | :-------------------------------- | :-------------- | :------------ | :------------ | :--- |
| DockerHub | Outage / Unreachable | DockerHub DDOSd | Cannot update or deploy extractor | remote (B) | no effect (I) | high | low |
### Production Outage Scenarios
- [Example scenario](_outage.md)
## Dashboards
Links to the Dashboards for this service.
## Alerts
Links to the Alerts for this service
For Every Alert there should be a corresponding section in alphabetical order
### Alert Title
Alert Description: Why do we have this alert? What does it mean? What is typically the cause of this alert?
#### Impact to Customers:
How does this situation impact our customers? If the customers are not being impacted, this is a good indicator that the alert can be deleted.
#### Remediation Steps:
Checklist manifesto style steps for how to resolve this alert. A person who has never worked on our stack should be able to follow these steps and remediate the incident. If it cannot be remediated, include escalation steps here.
1. Do this
2. Check this graph
3. Do this thing
4. Do this other thing
5. Verify service has recovered
## Deployment
How do you deploy this services. Favor Checklist manifesto style lists here as well.
1. Do this thing
2. Do this other thing
3. Finally do this thing
### Canary Deploy
Instructions on how to do a Canary Deployment
1. Do this canary thing
2. another canary task
### Rollback Deploy
Instructions on how to Rollback a Deploy.
1. Get the rollback build here
2. Do this thing
3. Do this other thing.