Performance Engineering: Scenario-Based Questions

27. Your distributed system is experiencing intermittent latency spikes. How do you investigate and resolve the issue?

Latency spikes in distributed systems are often due to unpredictable interactions between services, network hiccups, or load imbalances. Investigating requires tracing end-to-end behavior and correlating system metrics.

🔍 Investigation Workflow

Reproduce: Identify specific times, endpoints, or users affected by high latency.
Trace Calls: Use distributed tracing tools (e.g., OpenTelemetry, Jaeger) to identify slow segments.
Correlate Metrics: Examine latency, throughput, and saturation using Prometheus, Datadog, or Cloud Monitoring.
Inspect Logs: Review structured logs for errors, timeouts, or retries in the latency window.

🧪 Common Root Causes

N+1 Queries: Multiple sequential calls increasing response time.
Cold Starts: New instances or containers initializing slowly.
GC Pauses: JVM or Python garbage collection blocking requests.
Queue Bottlenecks: Backpressure in Kafka, SQS, or RabbitMQ.
Network Issues: High latency between services across regions or AZs.

🧰 Diagnostic Tools

Distributed Tracing: Jaeger, Zipkin, AWS X-Ray.
Profiling: Pyroscope, Flamegraphs, eBPF tools (BPFTrace, Pixie).
Network Inspection: mTR, tcpdump, VPC flow logs.
Latency SLO Dashboards: Track P50, P95, and P99 latency with alerts.

✅ Resolution Techniques

Introduce caching (local or distributed) to reduce load.
Batch or debounce requests to limit overhead.
Optimize DB queries and index usage.
Use autoscaling and connection pool tuning to handle burst traffic.

🚫 Common Mistakes

Focusing on average latency only — ignore P95/P99 at your peril.
Ignoring upstream or downstream service health.
Adding retries blindly, compounding latency under load.

📌 Real-World Insight

Latency in distributed systems rarely stems from one service. High maturity teams trace entire request paths, simulate spikes in staging, and design for graceful degradation during transient delays.

←→