Python Advanced - Web Scraping with Scrapy
Scraping websites using the Scrapy framework in Python
Scrapy is a powerful and versatile web scraping framework for Python. It provides all the tools you need to extract data from websites, process the data, and store it in your preferred format. This tutorial explores how to use Scrapy to scrape websites efficiently.
Key Points:
- Scrapy is a powerful web scraping framework for Python.
- It provides tools for extracting, processing, and storing data from websites.
- Scrapy is designed to be fast, simple, and extensible.
Installing Scrapy
To use Scrapy, you need to install it using pip:
pip install scrapy
Creating a Scrapy Project
You can create a new Scrapy project using the startproject
command. Here is an example:
scrapy startproject myproject
This command creates a new directory called myproject
with the basic structure of a Scrapy project.
Creating a Spider
A spider is a class that defines how Scrapy should crawl a website and extract data. Here is an example of a simple spider:
# myproject/spiders/example_spider.py
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
start_urls = ['http://quotes.toscrape.com']
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
In this example, the spider starts at the quotes website and extracts the text, author, and tags of each quote. It also follows the next page link to continue scraping.
Running the Spider
You can run the spider using the scrapy crawl
command. Here is an example:
scrapy crawl example
This command runs the example
spider defined in the previous section.
Storing the Scraped Data
Scrapy allows you to store the scraped data in various formats, including JSON, CSV, and XML. Here is an example of storing the data in a JSON file:
scrapy crawl example -o quotes.json
This command runs the example
spider and stores the scraped data in a file called quotes.json
.
Using Scrapy Shell
Scrapy Shell is an interactive shell for trying out XPath and CSS expressions to extract data from web pages. Here is an example of using Scrapy Shell:
scrapy shell 'http://quotes.toscrape.com'
This command opens Scrapy Shell with the specified URL loaded. You can then use it to experiment with different extraction techniques.
Handling Requests and Responses
Scrapy allows you to handle requests and responses easily. Here is an example of sending a POST request and handling the response:
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
start_urls = ['http://httpbin.org/post']
def start_requests(self):
for url in self.start_urls:
yield scrapy.FormRequest(url, formdata={'key': 'value'}, callback=self.parse)
def parse(self, response):
self.log(response.text)
In this example, the spider sends a POST request to httpbin.org and logs the response text.
Handling Errors and Retries
Scrapy provides built-in support for handling errors and retries. Here is an example of handling errors and setting retries:
import scrapy
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError, TimeoutError
class ExampleSpider(scrapy.Spider):
name = 'example'
start_urls = ['http://httpbin.org/status/404']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse, errback=self.errback)
def parse(self, response):
self.log('Response received.')
def errback(self, failure):
if failure.check(HttpError):
response = failure.value.response
self.log(f'HTTP error occurred: {response}')
elif failure.check(DNSLookupError):
request = failure.request
self.log(f'DNS lookup error occurred: {request}')
elif failure.check(TimeoutError):
request = failure.request
self.log(f'Timeout error occurred: {request}')
In this example, the spider handles HTTP errors, DNS lookup errors, and timeout errors using the errback method.
Using Scrapy Pipelines
Scrapy Pipelines allow you to process the scraped data before storing it. Here is an example of a simple pipeline that converts the text of quotes to uppercase:
# myproject/pipelines.py
class UppercasePipeline:
def process_item(self, item, spider):
item['text'] = item['text'].upper()
return item
# Enable the pipeline in settings.py
# myproject/settings.py
ITEM_PIPELINES = {
'myproject.pipelines.UppercasePipeline': 300,
}
In this example, the pipeline processes each item to convert the text to uppercase before storing it.
Summary
In this tutorial, you learned about scraping websites using the Scrapy framework in Python. Scrapy provides powerful tools for extracting, processing, and storing data from websites. Understanding how to create and run spiders, handle requests and responses, manage errors, and use pipelines is essential for efficient web scraping with Scrapy.