System Design FAQ: Top Questions
20. How would you design a Web Crawler?
A Web Crawler is a distributed system that discovers and indexes web pages by recursively fetching and parsing hyperlinks. It's the core of search engines and data aggregation platforms.
📋 Functional Requirements
- Start from seed URLs and recursively crawl
- Respect
robots.txtrules and crawl rate limits - Store and index fetched content
- Support URL deduplication
📦 Non-Functional Requirements
- Scalability to billions of pages
- Distributed and fault-tolerant crawling
- Politeness to target sites
🏗️ Core Components
- URL Frontier: Queue of discovered URLs to crawl
- Crawler Workers: Fetch and parse HTML from URLs
- Parser: Extracts links, metadata, content
- Deduplicator: Filters visited URLs via hash store
- Indexer: Extracts and stores text, links, and metadata
🔁 URL Queue with Redis
import redis
r = redis.Redis()
def enqueue_url(url):
r.lpush("url_queue", url)
def dequeue_url():
return r.rpop("url_queue")
🛑 Respect robots.txt (Python)
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
if rp.can_fetch("*", "https://example.com/page"):
crawl_page("https://example.com/page")
🧠 URL Deduplication with Bloom Filter
from pybloom_live import BloomFilter
seen = BloomFilter(capacity=1_000_000, error_rate=0.001)
if url not in seen:
seen.add(url)
process(url)
📦 Content Storage (ElasticSearch)
PUT /webpages/_doc/12345
{
"url": "https://example.com/about",
"title": "About Us",
"text": "We are a company that...",
"timestamp": "2025-06-11T00:00:00Z"
}
📈 Monitoring
- Crawl rate per domain
- Errors (timeouts, 404, throttling)
- CPU/network utilization of crawler nodes
📌 Final Insight
Building a scalable and polite web crawler requires asynchronous fetching, smart URL deduplication,
adherence to robots.txt, and distributed processing. Elastic queues, content-based indexing,
and graceful failure handling are key for production readiness.
