Swiftorial Logo
Home
Swift Lessons
AI Tools
Learn More
Career
Resources

System Design FAQ: Top Questions

65. How would you design a Metrics Aggregation System (like Prometheus)?

A Metrics Aggregation System collects, stores, and visualizes numerical performance data over time from services and infrastructure. It powers dashboards, alerting, and historical analysis for SRE and engineering teams.

๐Ÿ“‹ Functional Requirements

  • Support high-frequency metric collection (CPU, latency, errors)
  • Queryable via a time-series database
  • Tag-based filtering (labels)
  • Basic math functions (rate, avg, sum)

๐Ÿ“ฆ Non-Functional Requirements

  • Efficient compression and storage
  • Low-latency queries (even across 30-day spans)
  • Horizontal scalability

๐Ÿ—๏ธ Architecture Overview

  • Instrumented Service: Emits metrics via HTTP or gRPC
  • Scraper (e.g., Prometheus): Pulls metrics periodically
  • TSDB: Stores metrics with timestamps and tags
  • Query Layer: Provides PromQL or GraphQL interface
  • Dashboards: Grafana/React charts show live metrics

๐Ÿงช Example Metric Format (Prometheus Exposition)


# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET", path="/", status="200"} 3258
        

โš™๏ธ Prometheus Config Snippet


global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'my-service'
    static_configs:
      - targets: ['localhost:8080']
        labels:
          instance: 'app-1'
        

๐Ÿ” PromQL Query Example


rate(http_requests_total{status="500"}[5m])
        

๐Ÿ“ˆ Retention & Storage Optimization

  • Retain high-res data for 15 days, downsample after
  • Use block storage (e.g., Thanos, Cortex with S3/GCS)
  • Chunked time blocks reduce cardinality explosion

๐Ÿ”” Alerting Example


groups:
  - name: high-latency
    rules:
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1.2
        for: 2m
        labels:
          severity: page
        annotations:
          summary: "P95 latency too high"
        

๐Ÿงฐ Tools & Stack

  • TSDB: Prometheus, VictoriaMetrics, Mimir
  • Storage: S3, GCS for long-term
  • Dashboards: Grafana
  • Agent: Node Exporter, custom /metrics endpoints

๐Ÿ“Œ Final Insight

Use Prometheus-style pull collection for control and reliability. Apply retention/downsampling for cost. Tag cardinality is your scalability bottleneck โ€” use labels judiciously.