Scrapy framework of crawler

Time:2020-7-27
  • Framework: it has strong generality and encapsulates some general implementation methods of project template
  • scrapy(asynchronous framework)
    • High performance network requests
    • High performance data analysis
    • High performance persistent storage
    • High performance whole station data crawling
    • High performance deep crawling
    • High performance distributed

Scrapy environment installation

IOS and Linux

  • pip install scrapy

windows

a. pip3 install wheel

      b. Download twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
      
      # Twisted‑17.1.0‑cp35‑cp35m‑win_ Amd64.whl; for Python version 3.5, choose cp35 to download

      c. Enter the download directory and execute PIP3 install twisted  ̄ 17.1.0  ̄ cp35  ̄ cp35m  ̄ win_ amd64.whl
    
      #The installation failure may be caused by the version of this file. Even if the python version is correct, you can download a 32-bit version again
      #If the installation fails, download the python version, and one will succeed

      d. pip3 install pywin32

      e. pip3 install scrapy

After the installation is completed, type ‘scratch’ to test. The following figure shows that the installation is successful.

Basic use of scratch

Create project

  • scrapy startprojct proNmame

    cd proNmameEnter the project directory to execute the crawler file

Project name
	Spiders ා crawler package (folder)
		__init__.py
    __init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py		 #Create a good configuration file for the project
scrapy.cfg		 #The configuration file of scratch is not required to be modified

Create crawler file

  • Creating a crawler file is a py source file
  • scrapy genspider spiderName www.xxx.comThe website can be modified later
    • stayspidersCreate a py file under the package
# -*- coding: utf-8 -*-
import scrapy


class FirstSpider( scrapy.Spider ):	#  scrapy.Spider The parent of all crawler classes
    #Name represents the name of the crawler file and the unique identification of the current crawler file
    name = 'first'
    
    #Allowed domain names are usually commented out
    # allowed_domains = ['www.xx.com']
    
    #The URL list at the beginning, the URL list to be crawled at the beginning
    #Function: can send get request for internal list elements
    start_urls = ['http://www.sougou.com/','www.baidu.com']

    #The parse method is called to parse the data, and the number of method calls is changed from start_ It is determined by the number of elements in URLs list
    Def parse (self, response): ා response represents a response object,
        pass

Basic configuration

  • UA camouflage

  • Non compliance of robots protocol

    staysettings.pyLieutenant generalROBOTSTXT_OBEY = TrueChange to false

  • Specify log level

    staysettings.pyAddLOG_LEVEL = 'ERROR'

Execution of works

  • scrapy crawl spiderName

  • Executing the project is not displaying log files

    scrapy crawl spiderName --nolog

    In this way, the program will report errors and will not be displayed; after setting the log level, you can directly execute the project.

Data analysis

  • response.xpath ('xpath expression ')

  • AndetreeThe differences are as follows:

    Take text / property: return aSelectorObject in which the text data is stored

    • Selector object [0]. Extract ()Return string
    • Selector object.extract_ first()Return string
    • Selector object. Extract ()Back to list

Common operations

  • If the list has only one elementSelector object.extract_ first(), return string
  • If the list has more than one elementSelector object. Extract ()Returns a list containing a string

spiderName.pyfile

# -*- coding: utf-8 -*-
import scrapy

class DuanziSpider(scrapy.Spider):
    name = 'duanzi'
    # allowed_domains = ['www.xx.com']
    start_urls = ['https://duanziwang.com/']

    def parse(self, response):
        article_ list =  response.xpath ('/ HTML / body / section / div / div / main / article') # parsing based on XPath expression
        for article in article_list:
            title =  article.xpath ('. / div [1] / H1 / A / text()') [0] ා returns a selector object
            # 
            title =  article.xpath ('. / div [1] / H1 / A / text()') [0]. Extract() ා returns a string
            #Life proverbs about health and longevity_ Duan subnet contains the latest jokes
            title =  article.xpath ('./div[1]/h1/a/text()').extract_ First() ා return string
            #Life proverbs about health and longevity_ Duan subnet contains the latest jokes
            title =  article.xpath ('. / div [1] / H1 / A / text()'). Extract() ා returns the list
            #Life proverbs about health and longevity_ Duan subnet contains the latest paragraph ']
            print(title)
            break

Persistent storage

Persistent storage based on terminal instruction

  • You can only store the return value of the parse method in a text file with the specified suffix

    Specify suffix:'json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle', usually using CSV

    instructionsscrapy crawl spiderName -o filePath

Case study: persistent storage of text data

# -*- coding: utf-8 -*-
import scrapy

class DuanziSpider(scrapy.Spider):
    name = 'duanzi'
    # allowed_domains = ['www.xx.com']
    start_urls = ['https://duanziwang.com/']

    #Persistent storage based on terminal instruction
    def parse(self, response):
        article_ list =  response.xpath ('/ HTML / body / section / div / div / main / article') # parsing based on XPath expression
        all_data = []
        for article in article_list:
            title = article.xpath('./div[1]/h1/a/text()').extract_first()
            content = article.xpath('./div[2]/p//text()').extract()
            content = ''.join(content)
            dic = {
                "title": title,
                "content": content
            }
            all_data.append(dic)
        return all_data
#Terminal instruction
# scrapy crawl spiderName -o duanzi.csv

Persistent storage based on pipeline

scrapyPipe persistent storage is recommended

Implementation process

  • Data analysis(spiderName .py

  • Instantiate an object of type item(items.py

    stayitems.pyThe related properties are defined in the item class of

    fieldNmae = scrapy.Field()

  • Encapsulate the parsed data store into an object of type item(spiderName .py

    item['fileName'] = valueAssign a value to the fieldnmae property of the item object

  • Submit the item object to(spiderName .py

    yield itemSubmit the item to the highest priority pipeline

  • After receiving the item in the pipeline, the data stored in the item can be persisted in any form(pipelines.py

    process_item(): responsible for receiving and persisting the item object

  • In the configuration filesettings.pyOpen pipeline mechanism in

    Find the following code and uncomment it

    ITEM_PIPELINES = {
        #300 means priority. The smaller the value, the higher the priority
       'duanziPro.pipelines.DuanziproPipeline': 300,
    }

Case study: persistent storage of text data

According to the above insettings.pyFind the pipeline code and uncomment it.

spiderName .py

# -*- coding: utf-8 -*-
import scrapy
from duanziPro.items import DuanziproItem

class DuanziSpider(scrapy.Spider):
    name = 'duanzi'
    # allowed_domains = ['www.xx.com']
    start_urls = ['https://duanziwang.com/']
    #Persistent storage based on pipeline
    def parse(self, response):
        article_ list =  response.xpath ('/ HTML / body / section / div / div / main / article') # parsing based on XPath expression
        for article in article_list:
            title = article.xpath('./div[1]/h1/a/text()').extract_first()
            content = article.xpath('./div[2]/pre/code//text()').extract()
            content = ''.join(content)
            print(content)
            #Instantiate the item object
            item = DuanziproItem()
            #Access the property in the form of brackets and assign it a value
            item['title'] = title
            item['content'] = content
            #Submit item to pipeline
            yield item

items.py

import scrapy

class DuanziproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #Two properties are defined using intrinsic properties
    #Field is a universal data type
    title = scrapy.Field()
    content = scrapy.Field()

pipelines.py

class DuanziproPipeline(object):
    #Override the method of the parent class: this method is only executed once at the beginning of the crawler
    fp = None
    
    #Open the file
    def open_spider(self, spider):
        print('open spider')
        self.fp = open('./duanzi.txt', 'w', encoding='utf-8')

    #Close the file
    def close_spider(self, spider):
        print('close spider')
        self.fp.close()

    #Receive crawler file, return item object, process_ The item method receives an item object each time it is called
    #Item parameter: an item object received
    def process_item(self, item, spider):
        #Value
        title = item['title']
        content = item['content']
        self.fp.write(title + ":" + content + "\n")
        return item

Pipeline storage details processing

  • What does a pipe class in a pipeline file represent?

    A pipeline class corresponds to a storage form (text file, database)

    If you want to achieve data backup, you need to use multiple pipeline classes (multiple storage forms: MySQL, redis)

  • process_ In itemretutn item

    Pass the item to the next item to be executed (according to the item in the configuration file_ Pipelines are sorted by weight)

Store to MySQL

staypipelines.pyAdd the following code to the

import pymysql

class MysqlPipeline(object):
    conn = None
    cursor = None

    def open_spider(self, spider):
        self.conn = pymysql.Connect(host='127.0.0.1', port=3306, user='root', password='123', db='spider',
                                    charset='utf8')

    def process_item(self, item, spider):
        #Value
        title = item['title']
        content = item['content']
        self.cursor = self.conn.cursor()
        #SQL statement
        sql = 'insert into duanzi values ("%s","%s")' % (title, content)
        try:
            self.cursor.execute(sql)
            self.conn.commit()
        except Exception as e:
            print(e)
            self.conn.rollback()
        return item

    def close_spider(self, spider):
        self.cursor.close()
        self.conn.close()

staysettings.pyRegister the mysqlpipeline class with item_ In pipelines

ITEM_PIPELINES = {
    #300 means priority. The smaller the value, the higher the priority
    'duanziPro.pipelines.DuanziproPipeline': 300,
    'duanziPro.pipelines.MysqlPipeline': 301,
}

Store to redis

  • Because some versions of redis do not support storing dictionaries, download version 2.10.6

    pip install redis==2.10.6

staypipelines.pyAdd the following code to the

from redis import Redis


class RedisPipeline(object):
    conn = None

    def open_spider(self, spider):
        self.conn = Redis(host='127.0.0.1', port=6379, password='yourpassword')

    def process_item(self, item, spider):
        self.conn.lpush('duanziList', item)
        #Error: because some versions of redis do not support storing dictionaries, PIP install redis = = 2.10.6

staysettings.pyRegister redispipeline class with item_ In pipelines

ITEM_PIPELINES = {
    #300 means priority. The smaller the value, the higher the priority
    'duanziPro.pipelines.DuanziproPipeline': 300,
    'duanziPro.pipelines.RedisPipeline': 301,
}

Send request manually

  • You can start_ URLs add URLs to this list, but it is cumbersome

  • Get request send

    yield scrapy.Request(url,callback)

    • URL: specify the URL of the request
    • Callback: the callback function specified by callback must be executed (data parsing)
  • Post request send

    yield scrapy.FormRequest(url,callback,formdata)

    • Formdata stores request parameters and dictionary type
  • Start in parent class_ The principle of requests sending

#Simple simulation of the parent class method, mainly look at yield
def start_requests(self):
    for url in self.start_urls:
        #Initiate get request
        yield scrapy.Request(url=url,callback=self.parse)
        #Send a post request, and formdata stores the request parameters
        yield scrapy.FormRequest(url=url,callback=self.parse,formdata={})

code implementation

  • Mainly in thespiderName .pyThe recursive method is used and the conditions for the end of recursion are defined;

    Using parent class yield to realize whole station crawling

# -*- coding: utf-8 -*-
import scrapy
from duanziPro.items import DuanziproItem


class DuanziSpider(scrapy.Spider):
    name = 'duanzi'
    # allowed_domains = ['www.xx.com']
    start_urls = ['https://duanziwang.com/']
    
    #Send the manual request, and request the data of other page numbers
    #Defining a common URL template
    url = "https://duanziwang.com/page/%d/"
    pageNum = 2

    def parse(self, response):
        article_ list =  response.xpath ('/ HTML / body / section / div / div / main / article') # parsing based on XPath expression
        all_data = []
        for article in article_list:
            title = article.xpath('./div[1]/h1/a/text()').extract_first()
            content = article.xpath('./div[2]/pre/code//text()').extract()
            content = ''.join(content)
            #Instantiate the item object
            item = DuanziproItem()
            #Access the property in the form of brackets and assign it a value
            item['title'] = title
            item['content'] = content
            #Submit item to pipeline
            yield item
        if self.pageNum < 5:
            new_url = format(self.url%self.pageNum)
            self.pageNum += 1
            #Recursive implementation of the whole station data crawling, callback specified parsing method
            yield scrapy.Request(url=new_url, callback=self.parse)
  • staypipelines.pyRealize data persistent storage in
class DuanziproPipeline(object):
    #Override the method of the parent class: this method is only executed once at the beginning of the crawler
    fp = None

    def open_spider(self, spider):
        print('open spider')
        self.fp = open('./duanzi.txt', 'w', encoding='utf-8')

    #Close FP
    def close_spider(self, spider):
        print('close spider')
        self.fp.close()

    #Receive crawler file, return item object, process_ The item method receives an item object each time it is called
    #Item parameter: an item object received
    def process_item(self, item, spider):
        #Value
        title = item['title']
        content = item['content']
        self.fp.write(title + ":" + content + "\n")
        #Pass on the item to the next pipeline class to be executed
        return item
  • staysettings.pyOpen pipe class in
ITEM_PIPELINES = {
    #300 means priority. The smaller the value, the higher the priority
   'duanziPro.pipelines.DuanziproPipeline': 300,
}

The use of yield in scratch

  • Submit an item object to the pipeline

    yield item

  • Manual request sending

    yield scrapy.Request(url,callback)

Five core components

  • Scratch engine

    Process the data flow of the whole system and trigger things (the core of the framework).

  • Scheduler

    It is used to receive requests from the engine, push them into the queue, and return when the engine requests again.

  • Downloader

    It is used to download web content and return it to the spider (scrapy Downloader is based on the efficient model twisted).

  • Spiders

    Crawlers are mainly working, and they are used to extract the information they need from a specific web page, namely, the so-called item. Users can also extract links from it and let scrapy continue to grab the next page

  • Item pipeline

    It is responsible for processing entities extracted from web pages by crawlers. The main functions are to persist entities, verify the effectiveness of entities, and clear unnecessary information. When the page is parsed by the crawler, it is sent to the project pipeline, and the data is processed in several specific orders.

Workflow of five core components


When the crawler file is executed, the five core components are working

First, the crawler file spider is executed. The function of the spider is
(1) The original URL is stored in the spider
1: When the spider is executed, it first sends a request to the starting URL and encapsulates the starting URL into a request object
2: Pass the request object to the engine
3: The engine passes the request object to the scheduler, which stores the request in the queue (first in first out)
4: the scheduler dispatched the corresponding object from the queue to the URL and sent the request to the engine.
5: The engine sends the request object to the downloader through the download middleware
6: The downloader gets the request and downloads it on the Internet
7: The Internet encapsulates the downloaded data to the response object to the downloader
8: The downloader sends the response object to the engine through the download middleware
9: The engine returns the response object that encapsulates the data to the response object in the parse method of the spider class
10: When the parse method in the spider is called, the response has the response value
11: The parse code is written in the parse method of spider;
(1) Will parse out another batch of URLs, (2) will parse out the relevant text data
12: Encapsulate the data obtained by parsing into item
13: Item submits the encapsulated text data to the engine
14: The engine submits the data to the pipeline for persistent storage (a complete request data)
15: If another batch of URLs parsed in the parder method want to continue to submit, you can continue to send the request manually
16: The spider encapsulates and submits the batch of request objects to the engine
17: The engine distributes the request objects to the scheduler
16: This batch of URLs is filtered out by the filter in the scheduler, and the duplicate URLs are stored in the scheduler’s queue
17: The scheduler then sends the requests to the engine

Engine function:
1: Processing stream data 2: triggering things
The engine makes judgment according to the mutual data flow, and calls the method in the next step component according to the obtained stream data

Download middleware: located between the engine and the downloader, it can intercept the request and response objects; after intercepting the request and response objects, it can
Tampering with page content and request and response header information.
Crawler middleware: located between the spider and the engine, it can also intercept request and response objects, which is not commonly used.