Swiftorial Logo
Home
Swift Lessons
AI Tools
Learn More
Career
Resources

System Design FAQ: Top Questions

20. How would you design a Web Crawler?

A Web Crawler is a distributed system that discovers and indexes web pages by recursively fetching and parsing hyperlinks. It's the core of search engines and data aggregation platforms.

📋 Functional Requirements

  • Start from seed URLs and recursively crawl
  • Respect robots.txt rules and crawl rate limits
  • Store and index fetched content
  • Support URL deduplication

📦 Non-Functional Requirements

  • Scalability to billions of pages
  • Distributed and fault-tolerant crawling
  • Politeness to target sites

🏗️ Core Components

  • URL Frontier: Queue of discovered URLs to crawl
  • Crawler Workers: Fetch and parse HTML from URLs
  • Parser: Extracts links, metadata, content
  • Deduplicator: Filters visited URLs via hash store
  • Indexer: Extracts and stores text, links, and metadata

🔁 URL Queue with Redis


import redis
r = redis.Redis()

def enqueue_url(url):
    r.lpush("url_queue", url)

def dequeue_url():
    return r.rpop("url_queue")
        

🛑 Respect robots.txt (Python)


from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
if rp.can_fetch("*", "https://example.com/page"):
    crawl_page("https://example.com/page")
        

🧠 URL Deduplication with Bloom Filter


from pybloom_live import BloomFilter

seen = BloomFilter(capacity=1_000_000, error_rate=0.001)
if url not in seen:
    seen.add(url)
    process(url)
        

📦 Content Storage (ElasticSearch)


PUT /webpages/_doc/12345
{
  "url": "https://example.com/about",
  "title": "About Us",
  "text": "We are a company that...",
  "timestamp": "2025-06-11T00:00:00Z"
}
        

📈 Monitoring

  • Crawl rate per domain
  • Errors (timeouts, 404, throttling)
  • CPU/network utilization of crawler nodes

📌 Final Insight

Building a scalable and polite web crawler requires asynchronous fetching, smart URL deduplication, adherence to robots.txt, and distributed processing. Elastic queues, content-based indexing, and graceful failure handling are key for production readiness.