Resilience Engineering: Scenario-Based Questions

92. How do you apply fault injection and chaos engineering in production-grade systems?

Fault injection and chaos engineering help validate that your systems fail gracefully and recover predictably under stress. The goal is not to break things — it’s to discover weakness before users do.

🔁 Fault Injection Techniques

Kill processes (e.g., terminate pods, simulate crashes)
Introduce latency, packet loss, or DNS failures
Throttle resources (CPU, memory, disk I/O)
Expire secrets or rotate credentials mid-run

⚙️ Tooling

ChaosMesh: Kubernetes-native fault injection
Gremlin: SaaS chaos platform for controlled experiments
LitmusChaos: Declarative fault workflows
Toxiproxy: Inject latency/packet drops between services

🧪 Controlled Experimentation

Run in staging first with full observability
Use blast radius controls — one instance at a time
Define SLO-based abort conditions
Coordinate with incident response and rollback plans

✅ Best Practices

Log and tag all experiments (who, what, when)
Correlate faults with downstream impact (e.g., 5xx rate)
Automate recurring tests as part of CI/CD or weekly checks

🚫 Common Pitfalls

Running chaos tests without monitoring → silent breakage
No stakeholder buy-in — seen as risky or unhelpful
Injecting faults during active incidents or rollouts

📌 Final Insight

Chaos engineering is not about recklessness — it's about confidence. Inject faults in a safe, observable, and measurable way to build trust in your systems' ability to weather real-world stress.

←→