System Design FAQ: Top Questions

55. How would you design a Scalable Data Ingestion Pipeline?

A Scalable Data Ingestion Pipeline collects, validates, transforms, and routes incoming data streams from diverse sources like web apps, IoT devices, and APIs into storage, processing, and analytics systems.

📋 Functional Requirements

Accept high-throughput event ingestion via APIs, agents, SDKs
Support batch and streaming modes
Transform, enrich, and route events to downstream systems
Ensure durability, ordering, and observability

📦 Non-Functional Requirements

Horizontal scalability (millions of events/sec)
Data loss protection and replay support
Monitoring and schema enforcement

🏗️ High-Level Architecture

Edge Collector: Accepts HTTP/gRPC input, buffers to queue
Queue/Buffer: Kafka, Kinesis, or Pub/Sub
Transformer: Cleanses and enriches (e.g., with IP lookup)
Sink Router: Routes to DBs, lakes, warehouses, etc.

🛠️ Config Example: Kafka + Flink + S3


// Kafka Topic Creation
kafka-topics.sh --create --topic user-events --partitions 12 --replication-factor 3 --bootstrap-server broker:9092

// Flink job (pseudocode)
DataStream<String> stream = env.addSource(new FlinkKafkaConsumer("user-events", ...));
stream.map(event -> enrich(event))
      .addSink(new BucketingSink("/mnt/data/s3"))
        .setBucketer(new DateTimeBucketer());

📄 Schema Validation Example


{
  "type": "record",
  "name": "UserEvent",
  "fields": [
    {"name": "user_id", "type": "string"},
    {"name": "action", "type": "string"},
    {"name": "timestamp", "type": "long"}
  ]
}

🧪 Replay and Durability

Kafka enables log retention + offset re-consumption
S3 or HDFS for cold storage backup

📈 Observability

Lag metrics per topic/partition
Error rates by transformation stage
End-to-end latency (ms)

🔐 Security and Governance

SSL/TLS for data in motion
Schemas via Avro/Protobuf + registry
Access control via IAM or ACLs

🧰 Infra and Tools

Collector: NGINX, Fluent Bit, Fluentd
Queue: Kafka, Pub/Sub, Kinesis
Transform: Flink, Spark Streaming
Storage: S3, Redshift, BigQuery, HDFS

📌 Final Insight

A robust data ingestion pipeline relies on separation of concerns — ingestion, transformation, storage — with idempotent processing, schema validation, and durability built-in. Message brokers like Kafka allow decoupled, scalable real-time pipelines.

←→