System Design FAQ: Top Questions
55. How would you design a Scalable Data Ingestion Pipeline?
A Scalable Data Ingestion Pipeline collects, validates, transforms, and routes incoming data streams from diverse sources like web apps, IoT devices, and APIs into storage, processing, and analytics systems.
๐ Functional Requirements
- Accept high-throughput event ingestion via APIs, agents, SDKs
- Support batch and streaming modes
- Transform, enrich, and route events to downstream systems
- Ensure durability, ordering, and observability
๐ฆ Non-Functional Requirements
- Horizontal scalability (millions of events/sec)
- Data loss protection and replay support
- Monitoring and schema enforcement
๐๏ธ High-Level Architecture
- Edge Collector: Accepts HTTP/gRPC input, buffers to queue
- Queue/Buffer: Kafka, Kinesis, or Pub/Sub
- Transformer: Cleanses and enriches (e.g., with IP lookup)
- Sink Router: Routes to DBs, lakes, warehouses, etc.
๐ ๏ธ Config Example: Kafka + Flink + S3
// Kafka Topic Creation
kafka-topics.sh --create --topic user-events --partitions 12 --replication-factor 3 --bootstrap-server broker:9092
// Flink job (pseudocode)
DataStream<String> stream = env.addSource(new FlinkKafkaConsumer("user-events", ...));
stream.map(event -> enrich(event))
.addSink(new BucketingSink("/mnt/data/s3"))
.setBucketer(new DateTimeBucketer());
๐ Schema Validation Example
{
"type": "record",
"name": "UserEvent",
"fields": [
{"name": "user_id", "type": "string"},
{"name": "action", "type": "string"},
{"name": "timestamp", "type": "long"}
]
}
๐งช Replay and Durability
- Kafka enables log retention + offset re-consumption
- S3 or HDFS for cold storage backup
๐ Observability
- Lag metrics per topic/partition
- Error rates by transformation stage
- End-to-end latency (ms)
๐ Security and Governance
- SSL/TLS for data in motion
- Schemas via Avro/Protobuf + registry
- Access control via IAM or ACLs
๐งฐ Infra and Tools
- Collector: NGINX, Fluent Bit, Fluentd
- Queue: Kafka, Pub/Sub, Kinesis
- Transform: Flink, Spark Streaming
- Storage: S3, Redshift, BigQuery, HDFS
๐ Final Insight
A robust data ingestion pipeline relies on separation of concerns โ ingestion, transformation, storage โ with idempotent processing, schema validation, and durability built-in. Message brokers like Kafka allow decoupled, scalable real-time pipelines.
