Swiftorial Logo
Home
Swift Lessons
Tutorials
Learn More
Career
Resources

AWS Observability Stack

Introduction to AWS Observability Stack

The AWS Observability Stack delivers end-to-end monitoring for distributed AWS-native systems, leveraging CloudWatch for metrics, CloudWatch Logs for centralized logging, and AWS X-Ray with OpenTelemetry for distributed tracing. This stack enables real-time insights into application performance, rapid troubleshooting, and optimization across services like Lambda, EC2, ECS, EKS, and API Gateway. It supports diverse workloads, from serverless APIs to containerized microservices, ensuring visibility into system health, latency, and errors in complex architectures.

AWS observability tools provide metrics, logs, and traces to proactively monitor and debug distributed systems.

Observability Stack Architecture Diagram

The diagram illustrates the observability workflow: AWS services emit metrics to CloudWatch, logs to CloudWatch Logs, and traces to AWS X-Ray via OpenTelemetry. CloudWatch Alarms trigger notifications via SNS or automated actions via Lambda. Data can also be exported to third-party tools like Grafana. Arrows are color-coded: blue for metrics, green for logs, orange for traces, purple for alerts, and dashed gray for external integrations.

graph TD %% Styling for nodes classDef service fill:#ff6f61,stroke:#c62828,stroke-width:2px,color:#ffffff,rx:5,ry:5; classDef metrics fill:#42a5f5,stroke:#1e88e5,stroke-width:2px,rx:5,ry:5; classDef logs fill:#2ecc71,stroke:#1b5e20,stroke-width:2px,rx:5,ry:5; classDef tracing fill:#fbc02d,stroke:#f9a825,stroke-width:2px,rx:5,ry:5; classDef alerts fill:#9b59b6,stroke:#6a1b9a,stroke-width:2px,rx:5,ry:5; classDef external fill:#78909c,stroke:#455a64,stroke-width:2px,rx:5,ry:5; %% Flow A[Lambda] -->|Metrics| B[(CloudWatch)] A -->|Logs| C[(CloudWatch Logs)] A -->|Traces| D[OpenTelemetry] E[EC2] -->|Metrics| B E -->|Logs| C E -->|Traces| D F[ECS] -->|Metrics| B F -->|Logs| C F -->|Traces| D G[API Gateway] -->|Metrics| B G -->|Logs| C G -->|Traces| D D -->|Sends| H[(AWS X-Ray)] B -->|Triggers| I[CloudWatch Alarms] I -->|Notifies| J[SNS/Email] I -->|Triggers| K[Lambda] C -->|Exports| L[Third-Party Tools] H -->|Exports| L %% Subgraphs for grouping subgraph AWS Services A E F G end subgraph Observability B C H I K end subgraph Tracing D end subgraph External L end %% Apply styles class A,E,F,G service; class B metrics; class C logs; class D,H tracing; class I,J,K alerts; class L external; %% Annotations linkStyle 0,3,6,9 stroke:#405de6,stroke-width:2.5px; linkStyle 1,4,7,10 stroke:#2ecc71,stroke-width:2.5px; linkStyle 2,5,8,11,12 stroke:#ff6f61,stroke-width:2.5px; linkStyle 13,14 stroke:#9b59b6,stroke-width:2.5px; linkStyle 15,16 stroke:#78909c,stroke-width:2.5px,stroke-dasharray:4,4;
The observability stack integrates AWS services with external tools for comprehensive system insights.

Use Cases

The AWS Observability Stack supports various scenarios:

  • Serverless API Monitoring: Track latency and errors in API Gateway and Lambda with CloudWatch metrics and X-Ray traces.
  • Containerized Workloads: Monitor ECS or EKS task CPU/memory usage and application logs for microservices.
  • Batch Processing: Analyze Step Functions workflows with CloudWatch Logs Insights for job failures.
  • Hybrid Systems: Use OpenTelemetry to trace requests across AWS and on-premises services.
  • Incident Response: Configure CloudWatch Alarms to trigger SNS notifications and Lambda for auto-remediation.
Tailored observability configurations address diverse AWS workloads and hybrid environments.

Key Components

The observability stack relies on the following AWS components:

  • CloudWatch: Collects metrics (e.g., CPU, latency, request count) and visualizes them via dashboards.
  • CloudWatch Logs: Aggregates logs from AWS services, applications, and custom sources for querying.
  • AWS X-Ray: Traces requests across distributed systems, generating service maps and latency insights.
  • OpenTelemetry: Collects traces and metrics with SDKs, supporting X-Ray and third-party tools.
  • CloudWatch Alarms: Monitors metrics against thresholds, triggering notifications or actions.
  • SNS: Delivers alerts from CloudWatch Alarms to email, SMS, or other endpoints.
  • Lambda: Processes logs, triggers actions, or enriches observability data.
  • IAM: Secures access to observability tools with granular permissions.
  • CloudTrail: Logs API calls for auditing, integrated with CloudWatch Logs for monitoring.
  • CloudWatch Logs Insights: Runs advanced queries on logs for rapid troubleshooting.
  • CloudWatch Synthetics: Simulates user interactions to monitor application availability.

Benefits of AWS Observability Stack

The observability stack provides significant advantages:

  • Holistic Visibility: Combines metrics, logs, and traces for complete system insights.
  • Proactive Issue Detection: CloudWatch Alarms and Synthetics identify problems before user impact.
  • Fast Troubleshooting: X-Ray service maps and Logs Insights pinpoint root causes.
  • Scalable Monitoring: Handles high-volume data from large-scale, distributed systems.
  • Automated Responses: Integrates with Lambda and SNS for real-time remediation.
  • Standards Compliance: OpenTelemetry ensures interoperability with multi-cloud environments.
  • Cost Efficiency: Pay-per-use pricing with options to optimize sampling and retention.
  • Customizability: Supports custom metrics, annotations, and third-party integrations.

Implementation Considerations

Implementing the observability stack requires addressing key considerations:

  • Metric Selection: Choose relevant metrics (e.g., error rate, p99 latency) for each service.
  • Log Retention: Set CloudWatch Logs retention (e.g., 7 days, 30 days) to balance cost and compliance.
  • Tracing Setup: Instrument applications with OpenTelemetry or X-Ray SDKs for complete trace coverage.
  • Alarm Tuning: Configure thresholds and periods to minimize false positives (e.g., 2 periods above 80% CPU).
  • Security Practices: Encrypt logs/traces with KMS, use least-privilege IAM roles, and restrict access.
  • Cost Optimization: Use X-Ray sampling rules and limit log ingestion with filters in Cost Explorer.
  • Query Optimization: Write efficient Logs Insights queries (e.g., filter by error codes) for performance.
  • Testing Observability: Simulate failures (e.g., Lambda timeouts) to validate metrics, logs, and traces.
  • Dashboard Design: Build CloudWatch dashboards with widgets for KPIs like latency and error counts.
  • Compliance Requirements: Enable CloudTrail, encrypt data, and retain logs for audits (e.g., SOC 2, HIPAA).
  • High-Volume Systems: Use CloudWatch Contributor Insights for pattern detection in large datasets.
Optimized instrumentation and cost management ensure scalable, secure observability.

Advanced Tracing with X-Ray and OpenTelemetry

AWS X-Ray and OpenTelemetry enable detailed tracing for distributed systems:

  • Service Maps: X-Ray visualizes service dependencies and latency, highlighting slow components.
  • Custom Subsegments: Add annotations (e.g., user ID, query type) to traces for context-aware debugging.
  • Sampling Rules: Configure dynamic sampling (e.g., 5% of requests) to reduce costs while capturing outliers.
  • OpenTelemetry SDKs: Use language-specific SDKs (e.g., Python, Java) to instrument applications consistently.
  • Third-Party Integration: Export OpenTelemetry data to tools like Jaeger or Grafana Tempo.
# Example: X-Ray Sampling Rule in CloudFormation
Resources:
  SamplingRule:
    Type: AWS::XRay::SamplingRule
    Properties:
      SamplingRule:
        RuleName: MySamplingRule
        Priority: 10
        FixedRate: 0.05
        ReservoirSize: 10
        ResourceARN: "*"
        ServiceName: "*"
        ServiceType: "*"
        Host: "*"
        HTTPMethod: "*"
        URLPath: "*"
        Version: 1
                
Advanced tracing with X-Ray and OpenTelemetry provides granular insights into system performance.

CI/CD Integration for Observability

Automating observability setup with CI/CD pipelines ensures consistency:

  • IaC for Observability: Use CloudFormation, Terraform, or CDK to provision CloudWatch dashboards and alarms.
  • Pipeline Stages:
  • Code Validation: Validate IaC templates with tools like cfn-lint or tflint in CodeBuild.
  • Monitoring Deployment: Use CodePipeline to push X-Ray sampling rules or Lambda log processors.
  • Rollback Safety: Include observability checks (e.g., alarm status) in canary deployments.
# Example: CodePipeline Stage for Observability Setup
Resources:
  ObservabilityPipeline:
    Type: AWS::CodePipeline::Pipeline
    Properties:
      Name: Observability-Pipeline
      RoleArn: arn:aws:iam::123456789012:role/CodePipelineRole
      ArtifactStore:
        Type: S3
        Location: my-pipeline-bucket
      Stages:
        - Name: Source
          Actions:
            - Name: Source
              ActionTypeId:
                Category: Source
                Owner: AWS
                Provider: CodeCommit
                Version: '1'
              Configuration:
                RepositoryName: observability-repo
                BranchName: main
        - Name: Deploy
          Actions:
            - Name: DeployCloudWatch
              ActionTypeId:
                Category: Deploy
                Owner: AWS
                Provider: CloudFormation
                Version: '1'
              Configuration:
                ActionMode: CREATE_UPDATE
                StackName: ObservabilityStack
                TemplatePath: SourceArtifact::cloudwatch-template.yaml
                
CI/CD pipelines streamline observability setup with automated IaC deployments.

Example Configuration: CloudWatch Dashboard

Below is a CloudFormation template for a CloudWatch dashboard monitoring Lambda and API Gateway.

AWSTemplateFormatVersion: '2010-09-09'
Description: CloudWatch dashboard for Lambda and API Gateway
Resources:
  MultiServiceDashboard:
    Type: AWS::CloudWatch::Dashboard
    Properties:
      DashboardName: MultiService-Monitoring
      DashboardBody: '''
        {
          "widgets": [
            {
              "type": "metric",
              "x": 0,
              "y": 0,
              "width": 12,
              "height": 6,
              "properties": {
                "metrics": [
                  [ "AWS/Lambda", "Invocations", "FunctionName", "MyFunction" ],
                  [ ".", "Errors", ".", "." ],
                  [ "AWS/ApiGateway", "5XXError", "ApiName", "MyApi" ]
                ],
                "period": 300,
                "stat": "Sum",
                "region": "us-west-2",
                "title": "Lambda and API Gateway Errors"
              }
            },
            {
              "type": "metric",
              "x": 12,
              "y": 0,
              "width": 12,
              "height": 6,
              "properties": {
                "metrics": [
                  [ "AWS/Lambda", "Duration", "FunctionName", "MyFunction", { "stat": "Average" } ],
                  [ "AWS/ApiGateway", "Latency", "ApiName", "MyApi", { "stat": "Average" } ]
                ],
                "period": 300,
                "region": "us-west-2",
                "title": "Latency Metrics"
              }
            },
            {
              "type": "text",
              "x": 0,
              "y": 6,
              "width": 24,
              "height": 3,
              "properties": {
                "markdown": "## Monitoring Notes\nTrack Lambda errors and API Gateway latency for performance insights."
              }
            }
          ]
        }
      '''
Outputs:
  DashboardName:
    Value: !Ref MultiServiceDashboard
                
This dashboard monitors Lambda and API Gateway metrics with a notes widget.

Example Configuration: OpenTelemetry with ECS

Below is a Terraform configuration for an ECS task with OpenTelemetry sidecar for tracing.

provider "aws" {
  region = "us-west-2"
}

resource "aws_ecs_cluster" "my_cluster" {
  name = "my-cluster"
}

resource "aws_ecs_task_definition" "my_task" {
  family                   = "my-task"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = "512"
  memory                   = "1024"
  execution_role_arn       = aws_iam_role.ecs_task_execution_role.arn
  container_definitions    = jsonencode([
    {
      name  = "app-container"
      image = "my-app:latest"
      essential = true
      portMappings = [
        {
          containerPort = 8080
          hostPort      = 8080
        }
      ]
    },
    {
      name  = "otel-collector"
      image = "public.ecr.aws/aws-observability/aws-otel-collector:latest"
      essential = true
      environment = [
        {
          name  = "AWS_REGION"
          value = "us-west-2"
        }
      ]
    }
  ])
}

resource "aws_ecs_service" "my_service" {
  name            = "my-service"
  cluster         = aws_ecs_cluster.my_cluster.id
  task_definition = aws_ecs_task_definition.my_task.arn
  desired_count   = 1
  launch_type     = "FARGATE"
  network_configuration {
    subnets         = ["subnet-12345678"]
    security_groups = ["sg-12345678"]
  }
}

resource "aws_iam_role" "ecs_task_execution_role" {
  name = "ecs-task-execution-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ecs-tasks.amazonaws.com"
        }
      }
    ]
  })
}

resource "aws_iam_role_policy" "ecs_task_policy" {
  name = "ecs-task-policy"
  role = aws_iam_role.ecs_task_execution_role.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "xray:PutTraceSegments",
          "xray:PutTelemetryRecords",
          "logs:CreateLogStream",
          "logs:PutLogEvents"
        ]
        Resource = "*"
      }
    ]
  })
}
                
This ECS task includes an OpenTelemetry sidecar for tracing to X-Ray.

Example Configuration: CloudWatch Logs Insights Query

Below is a sample CloudWatch Logs Insights query to analyze Lambda errors.

fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 50
                
This query retrieves recent Lambda error logs for troubleshooting.