System Design FAQ: Top Questions

20. How would you design a Web Crawler?

A Web Crawler is a distributed system that discovers and indexes web pages by recursively fetching and parsing hyperlinks. It's the core of search engines and data aggregation platforms.

📋 Functional Requirements

Start from seed URLs and recursively crawl
Respect robots.txt rules and crawl rate limits
Store and index fetched content
Support URL deduplication

📦 Non-Functional Requirements

Scalability to billions of pages
Distributed and fault-tolerant crawling
Politeness to target sites

🏗️ Core Components

URL Frontier: Queue of discovered URLs to crawl
Crawler Workers: Fetch and parse HTML from URLs
Parser: Extracts links, metadata, content
Deduplicator: Filters visited URLs via hash store
Indexer: Extracts and stores text, links, and metadata

🔁 URL Queue with Redis


import redis
r = redis.Redis()

def enqueue_url(url):
    r.lpush("url_queue", url)

def dequeue_url():
    return r.rpop("url_queue")

🛑 Respect robots.txt (Python)


from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
if rp.can_fetch("*", "https://example.com/page"):
    crawl_page("https://example.com/page")

🧠 URL Deduplication with Bloom Filter


from pybloom_live import BloomFilter

seen = BloomFilter(capacity=1_000_000, error_rate=0.001)
if url not in seen:
    seen.add(url)
    process(url)

📦 Content Storage (ElasticSearch)


PUT /webpages/_doc/12345
{
  "url": "https://example.com/about",
  "title": "About Us",
  "text": "We are a company that...",
  "timestamp": "2025-06-11T00:00:00Z"
}

📈 Monitoring

Crawl rate per domain
Errors (timeouts, 404, throttling)
CPU/network utilization of crawler nodes

📌 Final Insight

Building a scalable and polite web crawler requires asynchronous fetching, smart URL deduplication, adherence to robots.txt, and distributed processing. Elastic queues, content-based indexing, and graceful failure handling are key for production readiness.

←→