Web Scraping With Scrapy | Advanced

Scraping websites using the Scrapy framework in Python

Scrapy is a powerful and versatile web scraping framework for Python. It provides all the tools you need to extract data from websites, process the data, and store it in your preferred format. This tutorial explores how to use Scrapy to scrape websites efficiently.

Key Points:

Scrapy is a powerful web scraping framework for Python.
It provides tools for extracting, processing, and storing data from websites.
Scrapy is designed to be fast, simple, and extensible.

Installing Scrapy

To use Scrapy, you need to install it using pip:


pip install scrapy

Creating a Scrapy Project

You can create a new Scrapy project using the startproject command. Here is an example:


scrapy startproject myproject

This command creates a new directory called myproject with the basic structure of a Scrapy project.

Creating a Spider

A spider is a class that defines how Scrapy should crawl a website and extract data. Here is an example of a simple spider:


# myproject/spiders/example_spider.py

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

In this example, the spider starts at the quotes website and extracts the text, author, and tags of each quote. It also follows the next page link to continue scraping.

Running the Spider

You can run the spider using the scrapy crawl command. Here is an example:


scrapy crawl example

This command runs the example spider defined in the previous section.

Storing the Scraped Data

Scrapy allows you to store the scraped data in various formats, including JSON, CSV, and XML. Here is an example of storing the data in a JSON file:


scrapy crawl example -o quotes.json

This command runs the example spider and stores the scraped data in a file called quotes.json.

Using Scrapy Shell

Scrapy Shell is an interactive shell for trying out XPath and CSS expressions to extract data from web pages. Here is an example of using Scrapy Shell:


scrapy shell 'http://quotes.toscrape.com'

This command opens Scrapy Shell with the specified URL loaded. You can then use it to experiment with different extraction techniques.

Handling Requests and Responses

Scrapy allows you to handle requests and responses easily. Here is an example of sending a POST request and handling the response:


import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['http://httpbin.org/post']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.FormRequest(url, formdata={'key': 'value'}, callback=self.parse)

    def parse(self, response):
        self.log(response.text)

In this example, the spider sends a POST request to httpbin.org and logs the response text.

Handling Errors and Retries

Scrapy provides built-in support for handling errors and retries. Here is an example of handling errors and setting retries:


import scrapy
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError, TimeoutError

class ExampleSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['http://httpbin.org/status/404']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse, errback=self.errback)

    def parse(self, response):
        self.log('Response received.')

    def errback(self, failure):
        if failure.check(HttpError):
            response = failure.value.response
            self.log(f'HTTP error occurred: {response}')
        elif failure.check(DNSLookupError):
            request = failure.request
            self.log(f'DNS lookup error occurred: {request}')
        elif failure.check(TimeoutError):
            request = failure.request
            self.log(f'Timeout error occurred: {request}')

In this example, the spider handles HTTP errors, DNS lookup errors, and timeout errors using the errback method.

Using Scrapy Pipelines

Scrapy Pipelines allow you to process the scraped data before storing it. Here is an example of a simple pipeline that converts the text of quotes to uppercase:


# myproject/pipelines.py

class UppercasePipeline:
    def process_item(self, item, spider):
        item['text'] = item['text'].upper()
        return item

# Enable the pipeline in settings.py
# myproject/settings.py

ITEM_PIPELINES = {
    'myproject.pipelines.UppercasePipeline': 300,
}

In this example, the pipeline processes each item to convert the text to uppercase before storing it.

Summary

In this tutorial, you learned about scraping websites using the Scrapy framework in Python. Scrapy provides powerful tools for extracting, processing, and storing data from websites. Understanding how to create and run spiders, handle requests and responses, manage errors, and use pipelines is essential for efficient web scraping with Scrapy.

Python Advanced - Web Scraping with Scrapy