Python crawler serializing 17 itempipeline, middleware

Time:2020-7-28

1、 Itempipeline

1. After the data extracted by the crawler is stored in the item, the data saved in the item needs further processing, such as cleaning, de duplication, storage, etc

2. Pipeline needs process_ Item function

(1)process_ Item: the item put forward by the spider is used as a parameter, and the spider is also passed in. This method must be implemented. An item object must be returned, and the discarded item will not be processed by the pipeline

(2)__ init__ : constructor

Do some necessary parameter initialization

(3)open_spider(spider)

Called when the spider object is opened

(4)close_spider(spider)

Called when the spider object is closed

3.Spider

(1) The corresponding file is under the folder spiders

(2)__ init__ : the name of the initialization crawler, start_ URLs list

(3)start_ Requests: generate requests object, marry scratch to download and return response

(4) Parse: parse the corresponding item according to the returned response, and the item will enter PIP line automatically; if necessary, resolve the URL, and the URL will be automatically handed over to the requests module, and the loop will continue

(5)start_ Request: this method can only be called once, reading start_ URLs content and starts the loop process

(6) Name: set the name of the crawler

(7)start_ URLs: set the URL to start the first batch of crawls

(8)allow_ Domains: list of domains allowed to be crawled by spider

(9)start_ Request (self): called only once

(10)parse

(11) Log: logging

2、 Middleware

1. Definition: middleware is a layer of components between engine and downloader

2. Function: preprocess the request sent and the result returned

3. Quantity: there can be many, which are loaded and executed in sequence

4. Location: in the middleware file, it needs to be set in settings to take effect

5. The writing is very simple

6. One or more of the following methods must be implemented

(1)process_request(self,request,spider)

“Is called when the request passes through; it must return none or request or response or raise ignorerequest;

None:scrapy The request will continue to be processed

Request:scrapy Will stop calling process_ Request and flush the request returned by the schedule

Response:scrapy No other processes will be called_ Request or process_ Exception, which directly takes the response as the result and calls process at the same time_ Response function

(2)process_response(self,request,response,spider)

Follow process_ Request is the same; it will be called automatically every time the result is returned; there can be multiple calls in order.

3、 Source code

2.CSDN:https://blog.csdn.net/weixin_44630050

3. Blog Garden: https://www.cnblogs.com/ruigege0000/

4. welcome to WeChat official account: Fourier transform, official account number, only for learning communication, background reply, “gift package”, get big data learning materials.