Swiftorial Logo
Home
Swift Lessons
AI Tools
Learn More
Career
Resources

SRE Practices: Scenario-Based Questions

11. A critical service went down unexpectedly. What steps would you take during and after the incident as an SRE?

In SRE (Site Reliability Engineering), handling incidents requires a blend of technical troubleshooting and structured coordination. A well-managed response minimizes downtime and prevents future recurrences.

🚨 During the Incident

  • Declare the Incident: Assign a severity level (SEV-1, SEV-2) and notify stakeholders.
  • Assemble the Response Team: Assign roles: Incident Commander, Communications Lead, Subject Matter Experts.
  • Start a War Room or Slack Channel: Coordinate using a dedicated, time-stamped communication channel.
  • Mitigate Impact: Rollback, failover, scale up, or serve cached content.
  • Capture Timeline: Log timestamps, actions, and decisions in real time.

🔍 Root Cause Identification

  • Use logs, metrics, and traces to identify failure point (e.g., upstream API outage, DB saturation).
  • Run kubectl, top, ps, or netstat for node-level issues.
  • Check CI/CD history and audit logs for recent changes.

📝 After the Incident: Postmortem

  • Write a Blameless Postmortem: Capture what happened, why it happened, and how to prevent recurrence.
  • Include: timeline, impact, root cause, contributing factors, remediation, and follow-up tasks.
  • Assign Action Items: Fix broken alerting, improve dashboards, automate failover.
  • Review Publicly: Share internally with relevant teams and add to knowledge base.

✅ Best Practices

  • Use an incident management tool like PagerDuty, Opsgenie, or FireHydrant.
  • Automate runbooks and playbooks for common failure modes.
  • Ensure alerts are actionable and prioritized (reduce alert fatigue).

🚫 Common Mistakes

  • Skipping the retrospective or blaming individuals.
  • Over-focusing on RCA without restoring service first.
  • Letting temporary fixes persist without root remediation.

📌 Real-World Insight

SRE-driven orgs prioritize Mean Time to Recovery (MTTR) over just uptime. Structured incident handling builds operational resilience and trust between engineering and business stakeholders.