SRE Practices: Scenario-Based Questions

11. A critical service went down unexpectedly. What steps would you take during and after the incident as an SRE?

In SRE (Site Reliability Engineering), handling incidents requires a blend of technical troubleshooting and structured coordination. A well-managed response minimizes downtime and prevents future recurrences.

🚨 During the Incident

Declare the Incident: Assign a severity level (SEV-1, SEV-2) and notify stakeholders.
Assemble the Response Team: Assign roles: Incident Commander, Communications Lead, Subject Matter Experts.
Start a War Room or Slack Channel: Coordinate using a dedicated, time-stamped communication channel.
Mitigate Impact: Rollback, failover, scale up, or serve cached content.
Capture Timeline: Log timestamps, actions, and decisions in real time.

🔍 Root Cause Identification

Use logs, metrics, and traces to identify failure point (e.g., upstream API outage, DB saturation).
Run kubectl, top, ps, or netstat for node-level issues.
Check CI/CD history and audit logs for recent changes.

📝 After the Incident: Postmortem

Write a Blameless Postmortem: Capture what happened, why it happened, and how to prevent recurrence.
Include: timeline, impact, root cause, contributing factors, remediation, and follow-up tasks.
Assign Action Items: Fix broken alerting, improve dashboards, automate failover.
Review Publicly: Share internally with relevant teams and add to knowledge base.

✅ Best Practices

Use an incident management tool like PagerDuty, Opsgenie, or FireHydrant.
Automate runbooks and playbooks for common failure modes.
Ensure alerts are actionable and prioritized (reduce alert fatigue).

🚫 Common Mistakes

Skipping the retrospective or blaming individuals.
Over-focusing on RCA without restoring service first.
Letting temporary fixes persist without root remediation.

📌 Real-World Insight

SRE-driven orgs prioritize Mean Time to Recovery (MTTR) over just uptime. Structured incident handling builds operational resilience and trust between engineering and business stakeholders.

←→