System Reliability: Scenario-Based Questions
94. What are key patterns for designing resilient distributed systems?
Distributed systems are prone to partial failures, timeouts, and unpredictable latency. Designing for resilience means planning for failure and recovery โ not avoiding it.
๐ก๏ธ Core Resilience Patterns
- Retries with Backoff: Retry failed calls with exponential delay
- Circuit Breakers: Stop sending traffic to broken services
- Bulkheads: Isolate resources (e.g., thread pools, containers) to limit blast radius
- Timeouts: Prevent hanging dependencies from stalling upstream systems
๐ฆ System-Level Strategies
- Redundancy: Multi-zone or multi-region replicas
- Health Checks & Auto-Healing: Replace unhealthy instances
- Failover Routing: Route traffic to backup services
๐งช Test & Validation
- Inject failure via chaos engineering tools
- Define SLAs and SLOs with alerting on error budgets
- Run game days simulating cascading failures
โ Best Practices
- Use idempotent APIs where retries are involved
- Track latency percentiles, not just averages
- Use observability tools to understand propagation of failures
๐ซ Common Pitfalls
- Retry storms โ retries amplifying traffic and causing overload
- No timeouts โ callers hang indefinitely on bad dependencies
- Monolithic retry logic โ hard to tune or observe per service
๐ Final Insight
Resilience is an architecture choice. By combining failure isolation, retries, circuit breakers, and observability, you build systems that degrade gracefully and recover quickly.
