Resilience Engineering: Scenario-Based Questions
92. How do you apply fault injection and chaos engineering in production-grade systems?
Fault injection and chaos engineering help validate that your systems fail gracefully and recover predictably under stress. The goal is not to break things β itβs to discover weakness before users do.
π Fault Injection Techniques
- Kill processes (e.g., terminate pods, simulate crashes)
- Introduce latency, packet loss, or DNS failures
- Throttle resources (CPU, memory, disk I/O)
- Expire secrets or rotate credentials mid-run
βοΈ Tooling
- ChaosMesh: Kubernetes-native fault injection
- Gremlin: SaaS chaos platform for controlled experiments
- LitmusChaos: Declarative fault workflows
- Toxiproxy: Inject latency/packet drops between services
π§ͺ Controlled Experimentation
- Run in staging first with full observability
- Use blast radius controls β one instance at a time
- Define SLO-based abort conditions
- Coordinate with incident response and rollback plans
β Best Practices
- Log and tag all experiments (who, what, when)
- Correlate faults with downstream impact (e.g., 5xx rate)
- Automate recurring tests as part of CI/CD or weekly checks
π« Common Pitfalls
- Running chaos tests without monitoring β silent breakage
- No stakeholder buy-in β seen as risky or unhelpful
- Injecting faults during active incidents or rollouts
π Final Insight
Chaos engineering is not about recklessness β it's about confidence. Inject faults in a safe, observable, and measurable way to build trust in your systems' ability to weather real-world stress.
