System Design FAQ: Top Questions

65. How would you design a Metrics Aggregation System (like Prometheus)?

A Metrics Aggregation System collects, stores, and visualizes numerical performance data over time from services and infrastructure. It powers dashboards, alerting, and historical analysis for SRE and engineering teams.

📋 Functional Requirements

Support high-frequency metric collection (CPU, latency, errors)
Queryable via a time-series database
Tag-based filtering (labels)
Basic math functions (rate, avg, sum)

📦 Non-Functional Requirements

Efficient compression and storage
Low-latency queries (even across 30-day spans)
Horizontal scalability

🏗️ Architecture Overview

Instrumented Service: Emits metrics via HTTP or gRPC
Scraper (e.g., Prometheus): Pulls metrics periodically
TSDB: Stores metrics with timestamps and tags
Query Layer: Provides PromQL or GraphQL interface
Dashboards: Grafana/React charts show live metrics

🧪 Example Metric Format (Prometheus Exposition)


# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET", path="/", status="200"} 3258

⚙️ Prometheus Config Snippet


global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'my-service'
    static_configs:
      - targets: ['localhost:8080']
        labels:
          instance: 'app-1'

🔍 PromQL Query Example


rate(http_requests_total{status="500"}[5m])

📈 Retention & Storage Optimization

Retain high-res data for 15 days, downsample after
Use block storage (e.g., Thanos, Cortex with S3/GCS)
Chunked time blocks reduce cardinality explosion

🔔 Alerting Example


groups:
  - name: high-latency
    rules:
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1.2
        for: 2m
        labels:
          severity: page
        annotations:
          summary: "P95 latency too high"

🧰 Tools & Stack

TSDB: Prometheus, VictoriaMetrics, Mimir
Storage: S3, GCS for long-term
Dashboards: Grafana
Agent: Node Exporter, custom /metrics endpoints

📌 Final Insight

Use Prometheus-style pull collection for control and reliability. Apply retention/downsampling for cost. Tag cardinality is your scalability bottleneck — use labels judiciously.

←→