SRE Practices: Scenario-Based Questions
11. A critical service went down unexpectedly. What steps would you take during and after the incident as an SRE?
In SRE (Site Reliability Engineering), handling incidents requires a blend of technical troubleshooting and structured coordination. A well-managed response minimizes downtime and prevents future recurrences.
🚨 During the Incident
- Declare the Incident: Assign a severity level (SEV-1, SEV-2) and notify stakeholders.
- Assemble the Response Team: Assign roles: Incident Commander, Communications Lead, Subject Matter Experts.
- Start a War Room or Slack Channel: Coordinate using a dedicated, time-stamped communication channel.
- Mitigate Impact: Rollback, failover, scale up, or serve cached content.
- Capture Timeline: Log timestamps, actions, and decisions in real time.
🔍 Root Cause Identification
- Use logs, metrics, and traces to identify failure point (e.g., upstream API outage, DB saturation).
- Run
kubectl
,top
,ps
, ornetstat
for node-level issues. - Check CI/CD history and audit logs for recent changes.
📝 After the Incident: Postmortem
- Write a Blameless Postmortem: Capture what happened, why it happened, and how to prevent recurrence.
- Include: timeline, impact, root cause, contributing factors, remediation, and follow-up tasks.
- Assign Action Items: Fix broken alerting, improve dashboards, automate failover.
- Review Publicly: Share internally with relevant teams and add to knowledge base.
✅ Best Practices
- Use an incident management tool like PagerDuty, Opsgenie, or FireHydrant.
- Automate runbooks and playbooks for common failure modes.
- Ensure alerts are actionable and prioritized (reduce alert fatigue).
🚫 Common Mistakes
- Skipping the retrospective or blaming individuals.
- Over-focusing on RCA without restoring service first.
- Letting temporary fixes persist without root remediation.
📌 Real-World Insight
SRE-driven orgs prioritize Mean Time to Recovery (MTTR) over just uptime. Structured incident handling builds operational resilience and trust between engineering and business stakeholders.