Chaos Engineering for Resilient AWS Architectures
A guide to implementing chaos engineering in AWS to test and improve system resilience using tools like AWS Fault Injection Simulator and Chaos Monkey, ensuring robust high availability.
1) Why Chaos Engineering?
Chaos engineering proactively tests system resilience by injecting controlled failures, identifying weaknesses before they cause outages. In AWS, this ensures high availability (HA) and disaster recovery (DR) systems perform under stress. Key benefits include:
- Resilience: Uncover and fix failure points in production-like environments.
- Confidence: Validate HA/DR mechanisms under real-world conditions.
- Proactivity: Address issues before they impact users.
- Compliance: Meet SLAs by proving system reliability.
This guide covers implementing chaos engineering using AWS Fault Injection Simulator (FIS), Chaos Monkey, and custom scripts, with practical examples for production systems.
2) Architecture: Chaos-Ready AWS Design
A chaos-ready architecture incorporates redundancy, monitoring, and automated recovery to withstand injected failures.
Client
└─> Route 53 (DNS failover)
├─ Application Load Balancer (ALB)
├─ Auto Scaling Group (EC2/ECS Fargate)
└─ RDS/DynamoDB (multi-AZ)
Chaos Engineering Tools
└─> AWS Fault Injection Simulator (FIS)
├─ Chaos Monkey (open-source failure injection)
└─ CloudWatch (monitoring and alarms)
(health checks, auto-recovery, and chaos experiments applied)
Rule of thumb: Design for failure with multi-AZ redundancy, auto-scaling, and robust monitoring before running chaos experiments.
3) Core Chaos Engineering Tools and Techniques
3.1 AWS Fault Injection Simulator (FIS)
FIS injects controlled failures like instance termination, network latency, or API throttling to test resilience.
{
"ExperimentTemplateId": "my-fis-experiment",
"Actions": [
{
"ActionId": "aws:ec2:terminate-instances",
"Parameters": {
"duration": "PT5M",
"instanceIds": "i-1234567890abcdef0"
},
"Targets": {
"Instances": "my-target-instances"
}
}
],
"Targets": {
"my-target-instances": {
"ResourceType": "aws:ec2:instance",
"SelectionMode": "COUNT(1)",
"Filters": [
{
"Path": "tag:environment",
"Values": ["prod"]
}
]
}
},
"StopConditions": [
{
"Source": "aws:cloudwatch:alarm",
"Value": "arn:aws:cloudwatch:us-east-1:123456789012:alarm:my-alarm"
}
]
}
3.2 Chaos Monkey for Random Failures
Chaos Monkey, part of the Netflix Simian Army, randomly terminates instances to simulate failures.
# chaosmonkey.yml
schedule:
enabled: true
cron: "0 9 * * MON-FRI"
chaosmonkey:
enabled: true
termination:
strategy: "random"
groups:
- name: "my-asg"
region: "us-east-1"
probability: 0.1
3.3 Custom Lambda Chaos Scripts
Custom Lambda functions can inject failures like stopping services or simulating latency.
import boto3
def lambda_handler(event, context):
ec2 = boto3.client('ec2')
ec2.stop_instances(InstanceIds=['i-1234567890abcdef0'])
return {
'statusCode': 200,
'body': 'Instance stopped for chaos testing'
}
3.4 Network Stress with FIS
FIS can simulate network latency or packet loss to test application behavior under degraded conditions.
{
"ExperimentTemplateId": "my-network-experiment",
"Actions": [
{
"ActionId": "aws:ssm:send-command",
"Parameters": {
"documentName": "AWSFIS-Run-Network-Latency",
"parameters": "{\"durationSeconds\": \"300\", \"latencyMilliseconds\": \"500\"}"
},
"Targets": {
"Instances": "my-target-instances"
}
}
],
"Targets": {
"my-target-instances": {
"ResourceType": "aws:ec2:instance",
"SelectionMode": "PERCENT(10)",
"Filters": [
{
"Path": "tag:environment",
"Values": ["prod"]
}
]
}
}
}
4) Designing Chaos Experiments
Effective chaos experiments follow a structured approach to ensure safety and value.
- Hypothesis: Define what you’re testing (e.g., “ALB failover works if an AZ fails”).
- Scope: Limit impact to specific resources or environments.
- Monitoring: Use CloudWatch to track metrics during experiments.
- Rollback: Set stop conditions to halt experiments if critical issues arise.
{
"ExperimentTemplateId": "my-chaos-experiment",
"Description": "Test ALB failover on instance termination",
"Actions": [
{
"ActionId": "aws:ec2:terminate-instances",
"Parameters": { "duration": "PT10M" },
"Targets": { "Instances": "my-asg-instances" }
}
],
"StopConditions": [
{
"Source": "aws:cloudwatch:alarm",
"Value": "arn:aws:cloudwatch:us-east-1:123456789012:alarm:high-error-rate"
}
]
}
5) Monitoring and Observability
CloudWatch is critical for monitoring chaos experiments and validating system resilience.
- Metrics: Track latency, error rates, and instance health.
- Alarms: Set thresholds for experiment termination or alerts.
- Logs: Aggregate logs from ECS, ALB, and FIS for analysis.
{
"AlarmName": "HighErrorRateDuringChaos",
"MetricName": "RequestErrorRate",
"Namespace": "AWS/ApplicationELB",
"Threshold": 5.0,
"ComparisonOperator": "GreaterThanThreshold",
"Period": 60,
"EvaluationPeriods": 2,
"Statistic": "Average",
"AlarmActions": ["arn:aws:sns:us-east-1:123456789012:my-sns-topic"]
}
6) Security and Governance
Secure chaos experiments to prevent unintended impacts.
- IAM Permissions: Restrict FIS and Chaos Monkey to specific resources.
- Guardrails: Use stop conditions and experiment scopes to limit blast radius.
- Audit Logging: Track chaos experiment actions with CloudTrail.
{
"PolicyName": "ChaosExperimentPolicy",
"PolicyDocument": {
"Statement": [
{
"Effect": "Allow",
"Action": [
"fis:StartExperiment",
"ec2:TerminateInstances",
"ssm:SendCommand"
],
"Resource": [
"arn:aws:fis:us-east-1:123456789012:experiment-template/*",
"arn:aws:ec2:us-east-1:123456789012:instance/*",
"arn:aws:ssm:us-east-1:123456789012:document/*"
],
"Condition": {
"StringEquals": { "aws:ResourceTag/environment": "chaos-test" }
}
}
]
}
}
7) CI/CD for Chaos Engineering
Integrate chaos experiments into CI/CD pipelines to automate resilience testing.
- Experiment Automation: Trigger FIS experiments post-deployment.
- Validation: Verify system recovery with automated tests.
- Security Scans: Ensure experiment configurations are secure.
name: chaos-engineering-pipeline
on: [push]
jobs:
chaos-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Deploy application
run: aws ecs update-service --cluster my-ecs-cluster --service my-service --force-new-deployment
- name: Run FIS experiment
run: aws fis start-experiment --experiment-template-id my-fis-experiment
- name: Validate recovery
run: npm run test:resilience
- name: Security scan
run: checkov -f fis-template.json --skip-check CKV_AWS_145
8) Example: E-Commerce Platform Resilience
An e-commerce platform requires HA for flash sales. Chaos engineering tests include:
- FIS terminates EC2 instances in one AZ to validate ALB failover.
- Chaos Monkey randomly stops ECS tasks to test auto-scaling.
- Network latency injection to ensure DynamoDB read replicas handle degraded conditions.
- CloudWatch alarms to monitor and halt experiments if error rates spike.
This ensures the platform remains available during unexpected failures.
9) 30–60–90 Roadmap
Days 0–30:
• Set up AWS FIS and Chaos Monkey for basic instance termination tests.
• Configure CloudWatch alarms for experiment monitoring.
• Run initial chaos experiment in a non-production environment.
Days 31–60:
• Add network latency and API throttling experiments with FIS.
• Integrate chaos tests into CI/CD pipelines.
• Test failover and recovery in production-like staging.
Days 61–90:
• Automate weekly chaos experiments for critical services.
• Document findings and remediate identified weaknesses.
• Train team on chaos engineering best practices.
10) FAQ
Q: How do I start with chaos engineering safely?
A: Begin in a non-production environment, limit experiment scope, and use stop conditions.
Q: Can I use chaos engineering with serverless?
A: Yes, FIS supports Lambda and other serverless resources for failure injection.
Q: How often should I run chaos experiments?
A: Weekly for critical systems; monthly for stable environments.
Takeaway: Chaos engineering with AWS FIS and Chaos Monkey ensures resilient architectures by proactively testing failure scenarios. Combine with robust monitoring, automation, and guardrails to maintain high availability and reliability.