System Design FAQ: Top Questions
38. How would you design a Distributed Cache System like Memcached or Redis?
A Distributed Cache System speeds up data access by storing frequently read data in memory. It reduces backend load, improves latency, and is crucial in high-traffic applications.
๐ Functional Requirements
- Read/write to cache with defined TTL (time-to-live)
- Auto-evict stale or least-used items (LRU)
- Distributed consistency across nodes
- Support for pub/sub, atomic ops, and eviction policies
๐ฆ Non-Functional Requirements
- Low-latency (<1ms read, <5ms write)
- High availability and data partitioning
- Horizontal scalability
๐๏ธ Core Components
- Client SDK: Encodes keys, sends requests, applies consistent hashing
- Shard Map / Ring: Maps keys to node clusters
- Cache Nodes: Store K/V in RAM, run eviction strategies
- Failover Logic: Detects down nodes and reroutes
๐ Consistent Hashing with Virtual Nodes (Python)
import hashlib
def get_node(key, nodes):
ring = sorted((hashlib.md5((n+v).encode()).hexdigest(), n)
for n in nodes for v in ["#1", "#2", "#3"])
key_hash = hashlib.md5(key.encode()).hexdigest()
for h, n in ring:
if key_hash <= h:
return n
return ring[0][1]
๐งช Redis Caching with TTL
SET page:/docs/home "html" EX 300
GET page:/docs/home
๐ง Lazy vs Write-Through
- Lazy: Read-miss triggers backend fetch and populate cache
- Write-through: Every DB write updates cache immediately
๐ Cache Invalidation Patterns
- On write/update, evict relevant keys
- Use pub/sub to notify other nodes
- Include
etag
orversion
tag in cache entry
๐ Eviction Strategy
- LRU (Least Recently Used) or LFU (Least Frequently Used)
- TTL expiration, max memory usage trigger
๐ Observability
- Cache hit/miss ratio
- Eviction and memory usage metrics
- Node latency and failover events
๐งฐ Tools/Infra Used
- Cache Nodes: Redis, Memcached
- Coordination: Consul, ZooKeeper
- Monitoring: Datadog / Prometheus
๐ Final Insight
Caching is a backbone of scalable systems. To avoid stale data and hotspot nodes, design with smart keying, invalidation, and failover-aware partitioning.