Scrapy installation and easy to use


Module installation


To install the script, you need to install the dependent environment twisted, which in turn needs to install the C + + dependent environment

If twisted error occurs during PIP install sweep

Download the corresponding twisted version file at ~ gohlke/pythonlibs/ (cp36 stands for python3.6 version)

Then CMD goes to the directory where twisted is. Execute PIP install with twisted filename

Finally, execute PIP install sweep



Ubuntu installation considerations

Do not use the packages provided by Python scrapyubuntu, which are usually too old and slow to catch up with the latest scrapy

To install scratch on an Ubuntu (or Ubuntu based) system, you need to install these dependencies

sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

If you want to install scratch on Python 3, you also need Python 3 development header files

sudo apt-get install python3-dev

In virtualenv, you can use pip to install scratch: PIP install scratch




Simple use

New project

scrapy startproject project_name



Writing crawler

First way: create a single file

To create a class, it must inherit the scratch.spider class and define three attributes

Name: the name of the spider, which must be unique

Start_urls: initial URL list

Parse (self, response) method: called after each initial URL is completed

This parse function has two functions

1. Parse the response, encapsulate it as an item object and return it

2. Extract the new URL to download, create a new request, and return it

The run command of a single file: scratch runspider


Second way: create by command

Crawler name domain name



Run crawler

Scan list to view the crawler files that can be run

Crawler name (value of name attribute)



Tracking links

Create a class variable, page ﹣ num, to record the currently crawled page number, extract the information in the parse function, and then add 1 to the variable page ﹣ num through the crawler object to construct the URL of the next page, and then create the scratch.request object and return

If no information can be extracted from the response, we judge that it has reached the last page, and the parse function returns directly


Define item pipeline

After parse out the information we need, the parse function can package the information into a dictionary object or a scray.item object, and then return

The object is sent to the item pipeline, which processes it by executing several components in sequence. Each item pipeline component is a python class that implements simple methods

They receive and operate on an item and decide whether the item should continue to pass through the pipeline or be discarded and not to be processed


Typical use of item pipeline:

Clean up HTML data

Verify deleted data (check if the item contains some fields)

Check for duplicates (and delete them)

Persist crawled items



Write pipeline class

#Def open spider (self, spider) when the crawler starts

#Execute def close “spider (self, spider) when the reptile is closed

#Item def process UU item (self, item, spider) processed by the item passed and returned

To activate this pipeline component, you must add it to the item? Pipeline settings, set it in the settings file

The integer values assigned to classes in this setting determine the order in which they run: from the lower value to the higher value



Define item

Scrapy provides the item class

Edit the file in the project directory

Import the item class we defined in the crawler, and use it to structure data after instantiation




Operation process

data stream

First get the initial request from the crawler

Put the request into the scheduling module and get the next request to be crawled

The scheduling module returns the next request to be crawled to the engine

The engine sends the request to the downloader, passing through all the download Middleware in turn

Once the page download is completed, the downloader will return a response containing the page data, and then pass through all the download Middleware in turn

The engine receives the response from the downloader, then sends it to the crawler for parsing, and passes through all the crawler Middleware in turn

The crawler processes the received response, parses the item, generates a new request, and sends it to the engine

The engine sends the processed items to the pipeline component, the generated new requests to the scheduling module, and requests the next request

This process repeats until the scheduler no longer has a request




Spiders crawler processes the data needed by response extraction or other requests to be crawled

The engine engine engine is responsible for controlling the data flow between all components of the system and triggering events when certain operations occur

The scheduler receives a request and queues it

The download Downloader is responsible for downloading the request sent by the engine

The item pipelines pipeline is responsible for storing the data returned by the spider




Download Middleware

Download middleware is a specific hook between the engine and the downloader, which processes requests from the engine to the downloader and the response from the downloader to the engine

Use the downloader middleware to do the following

Process the request before it is sent to the downloader (that is, before the request is sent to the website by the summary)

Before the response is sent to the crawler

Send a new request directly instead of passing the received response to the spider

Pass the response to the spider without getting the web page

Silently give up some requests



Crawler Middleware

Crawler middleware is a specific hook between the engine and the crawler, which can handle the incoming response and the delivered items and requests

Use the crawler middleware to do the following

Process the request or item after the crawler callback

Process start_requests

Handling reptile exceptions

Call errback instead of callback request based on response content



Event driven network

Scrapy is written in twisted, a popular event driven Python Network Framework. It uses non blocking (also known as asynchronous) code for concurrency