RB-001
Failed Production Deployment
Health checks fail or key endpoint error rate spikes after release.
First 5 Minutes
- - Freeze further deployments and mark incident channel active.
- - Confirm blast radius with dashboards and synthetic checks.
- - Rollback to last known healthy release artifact.
- Escalation
- Escalate to release owner + on-call backend engineer.
- Recovery SLO
- Restore stable service within 20 minutes.
- Automation
- Automatic rollback when readiness probes fail for 3 consecutive checks.