System Design FAQ: Top Questions
65. How would you design a Metrics Aggregation System (like Prometheus)?
A Metrics Aggregation System collects, stores, and visualizes numerical performance data over time from services and infrastructure. It powers dashboards, alerting, and historical analysis for SRE and engineering teams.
๐ Functional Requirements
- Support high-frequency metric collection (CPU, latency, errors)
- Queryable via a time-series database
- Tag-based filtering (labels)
- Basic math functions (rate, avg, sum)
๐ฆ Non-Functional Requirements
- Efficient compression and storage
- Low-latency queries (even across 30-day spans)
- Horizontal scalability
๐๏ธ Architecture Overview
- Instrumented Service: Emits metrics via HTTP or gRPC
- Scraper (e.g., Prometheus): Pulls metrics periodically
- TSDB: Stores metrics with timestamps and tags
- Query Layer: Provides PromQL or GraphQL interface
- Dashboards: Grafana/React charts show live metrics
๐งช Example Metric Format (Prometheus Exposition)
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET", path="/", status="200"} 3258
โ๏ธ Prometheus Config Snippet
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'my-service'
static_configs:
- targets: ['localhost:8080']
labels:
instance: 'app-1'
๐ PromQL Query Example
rate(http_requests_total{status="500"}[5m])
๐ Retention & Storage Optimization
- Retain high-res data for 15 days, downsample after
- Use block storage (e.g., Thanos, Cortex with S3/GCS)
- Chunked time blocks reduce cardinality explosion
๐ Alerting Example
groups:
- name: high-latency
rules:
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1.2
for: 2m
labels:
severity: page
annotations:
summary: "P95 latency too high"
๐งฐ Tools & Stack
- TSDB: Prometheus, VictoriaMetrics, Mimir
- Storage: S3, GCS for long-term
- Dashboards: Grafana
- Agent: Node Exporter, custom /metrics endpoints
๐ Final Insight
Use Prometheus-style pull collection for control and reliability. Apply retention/downsampling for cost. Tag cardinality is your scalability bottleneck โ use labels judiciously.