2.5 KiB
Runbook Template
Template credit goes to CatieM20. If you haven't seen their talk on [Tackling Alert Fatigue][] yet, go give it a watch. I've made a handful of modifications to include some site reliability engineering geared content.
General
A quick description of the services. 1 to 2 sentences max. Why does this service matter? What is it's core functionality? What Features does it provide users?
Failure Mode and Effect Analysis
FMEA is a method of failure analysis that helps teams create reliable systems and develop comprehensive on-call response patterns.
Service | Failure Mode | Possible Cause | Effects | Probability (P) | Severity (S) | Detection (D) | Risk |
---|---|---|---|---|---|---|---|
DockerHub | Outage / Unreachable | DockerHub DDOSd | Cannot update or deploy extractor | remote (B) | no effect (I) | high | low |
Production Outage Scenarios
Dashboards
Links to the Dashboards for this service.
Alerts
Links to the Alerts for this service
For Every Alert there should be a corresponding section in alphabetical order
Alert Title
Alert Description: Why do we have this alert? What does it mean? What is typically the cause of this alert?
Impact to Customers:
How does this situation impact our customers? If the customers are not being impacted, this is a good indicator that the alert can be deleted.
Remediation Steps:
Checklist manifesto style steps for how to resolve this alert. A person who has never worked on our stack should be able to follow these steps and remediate the incident. If it cannot be remediated, include escalation steps here.
- Do this
- Check this graph
- Do this thing
- Do this other thing
- Verify service has recovered
Deployment
How do you deploy this services. Favor Checklist manifesto style lists here as well.
- Do this thing
- Do this other thing
- Finally do this thing
Canary Deploy
Instructions on how to do a Canary Deployment
- Do this canary thing
- another canary task
Rollback Deploy
Instructions on how to Rollback a Deploy.
- Get the rollback build here
- Do this thing
- Do this other thing.