System Design FAQ: Top Questions
47. How would you design a Cron Scheduler System like Airflow or Kubernetes CronJobs?
A Cron Scheduler runs jobs at recurring intervals, e.g., every 5 minutes or at midnight UTC. It is commonly used for ETL pipelines, batch jobs, reporting, and maintenance tasks.
📋 Functional Requirements
- Register recurring jobs with cron expressions
- Trigger jobs accurately and reliably
- Track job history, retries, and status
- Prevent duplicate execution in distributed settings
📦 Non-Functional Requirements
- High availability
- Exactly-once or at-least-once execution guarantee
- Alerting and observability
🏗️ Core Components
- Scheduler: Parses cron rules and emits triggers
- Executor: Runs job in Docker/K8s/VM
- Metadata Store: Job config, logs, state
- Lock Manager: Ensures single execution per job
⏰ Cron Expression Example
# Run every hour at minute 0
0 * * * * /scripts/export.sh
🗄️ PostgreSQL Schema Example
CREATE TABLE cron_jobs (
id UUID PRIMARY KEY,
name TEXT,
cron_expr TEXT,
command TEXT,
last_run TIMESTAMP,
status TEXT,
retry_policy JSONB
);
🔒 Locking with Redis SETNX
import redis, time
def acquire_lock(job_id):
r = redis.Redis()
return r.set(f"lock:{job_id}", "1", nx=True, ex=300)
⚙️ Airflow-style DAG Config (Python)
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime
with DAG("daily_etl", schedule="@daily", start_date=datetime(2024, 1, 1)) as dag:
t1 = BashOperator(task_id="extract", bash_command="python extract.py")
t2 = BashOperator(task_id="transform", bash_command="python transform.py")
t3 = BashOperator(task_id="load", bash_command="python load.py")
t1 >> t2 >> t3
📈 Observability
- Success/failure rate over time
- Average execution duration
- Missed or overlapping runs
🧰 Tools/Infra Used
- Scheduler: Quartz (Java), croniter (Python), K8s native
- Queue: Celery, RabbitMQ, Kubernetes Job CRDs
- Logs: ELK stack, Prometheus + Grafana
📌 Final Insight
Cron scheduling must balance timing accuracy with job safety. Using a reliable lock mechanism and storing metadata for state tracking ensures safe concurrent job execution, especially in distributed environments.