Cloud Architecture: Scenario-Based Questions
35. How do you design a highly available architecture in the cloud?
High Availability (HA) ensures your system can operate continuously without failure for a long time. In cloud environments, HA relies on redundancy, fault tolerance, and smart distribution of resources.
🧱 Core Principles
- Redundancy: Duplicate critical components (e.g., multiple web servers, DB replicas).
- Failover: Automatic switching to backup systems during failure (e.g., multi-AZ DB failover).
- Load Balancing: Distribute requests evenly to avoid overloading any node.
- Health Checks: Continuously monitor component status and remove unhealthy ones.
🏗️ Cloud Design Patterns
- Deploy across multiple Availability Zones (AZs) within a region.
- Use managed services with built-in HA (e.g., RDS Multi-AZ, DynamoDB, Cloud Spanner).
- Design stateless services behind autoscaling groups and ALBs.
- Use regional or global load balancers (e.g., AWS ALB/ELB, GCP Load Balancer).
📦 Example Architecture
- Frontend in AWS behind an ALB with EC2 or Fargate instances in 2+ AZs.
- RDS or Aurora with Multi-AZ failover and read replicas.
- Redis or Memcached with replication and failover nodes.
- CI/CD to roll out updates gradually and avoid downtime.
🧰 Supporting Tools
- Route 53 / GCP Cloud DNS: For DNS-based failover across regions.
- Terraform: Codify HA setups with reusable modules.
- Monitoring: Use CloudWatch, Stackdriver, or Datadog to detect and alert on downtime.
✅ Best Practices
- Test failover regularly using chaos testing or blue/green deployments.
- Keep services loosely coupled and resilient to downstream outages.
- Use SLA-backed managed services where HA is mission-critical.
🚫 Common Pitfalls
- Single AZ deployments — vulnerable to outages or data center failures.
- Hard dependencies on stateful services that don’t replicate well.
- No clear RTO/RPO definitions — leading to surprise during failure.
📌 Real-World Insight
High availability is not just about “uptime.” It’s about designing failure into your architecture — assuming components will break, and preparing your system to recover automatically and gracefully.