The operation mechanism of each part of scrapy is “None” in XPath? How to write multi-level response? Take care of crapy’s hole

Time:2019-11-2

Preface

How are so many modules of scrapy combined? It is clear that the XPath helper plug-in on chrome has written the XPath. Why does it read none when it comes to the program? Can scratch write multi-level response directly? Do you have to use the requests library again??

It’s OK. This article is a one-stop solution to the common pitfalls in the science

Operation mechanism of each part of scrapy

  • Scrapy is an application framework written in pure Python for crawling website data and extracting structural data, which is widely used.

  • The power of the framework, users only need to customize the development of a few modules can easily achieve a crawler, used to grab web content and various pictures, very convenient.

  • Scrapy uses twisted [‘tw ɪ St ɪ D] (its main opponent is Tornado) asynchronous network framework to deal with network communication, which can speed up our download speed, do not need to implement asynchronous framework by ourselves, and include various middleware interfaces, which can flexibly complete various requirements.

                                                                  

  • Scrap engine: responsible for the communication, signal and data transmission among spider, itempipeline, downloader and scheduler.

  • Scheduler: it is responsible for receiving the request sent by the engine, arranging and queuing it in a certain way, and returning it to the engine when the engine needs it.

  • Downloader: it is responsible for downloading all requests sent by the scrapy engine, and returning the obtained responses to the scrapy engine, which will be processed by the engine to spider,

  • Spider: it is responsible for processing all responses, analyzing and extracting data from them, obtaining the data required by the item field, submitting the URL to be followed up to the engine, and entering the scheduler again,

  • Item pipeline: it is responsible for processing items obtained in spider and post-processing (detailed analysis, filtering, storage, etc.)

  • Downloader middleware: it can be regarded as a component that can customize the extended download function.

  • Spider middleware (spider Middleware): it can be understood as a functional component that can customize the communication between the extension and operation engine and spider (such as responses entering spider and requests leaving spider)

On the Internet, I saw a big guy telling us how to work vividly, which is easy to understand:

Operation process of scrapy

After the code is written, the program starts to run

  1. Engine: Hi! Spider, which website do you want to deal with?

  2. Spider: boss asked me to deal with xxxx.com.

  3. Engine: you give me the first URL to process.

  4. Spider: Here you are. The first URL is xxxxxxx.com.

  5. Engine: Hi! Dispatcher, I have a request for you to help me sort and join the team.

  6. Dispatcher: OK, I’m dealing with you. Wait a minute.

  7. Engine: Hi! Scheduler, give me your processed request.

  8. Dispatcher: Here you are. This is my processed request

  9. Engine: Hi! Downloader, you can download the request request for me according to the settings of the download middleware of the boss

  10. Downloader: OK! Here you are. This is a good download. (if it fails: sorry, the request download fails. Then the engine tells the scheduler that the download of this request failed. Please record it and we will download it later.)

  11. Engine: Hi! Spider, this is a good thing to download, and it has been handled according to the download middleware of the eldest brother. Please handle it yourself (note! By default, the responses are given to the def parse() function.)

  12. Spider: (for URLs to be followed up after data processing), hi! Engine, I have two results here, this is the URL I need to follow, and this is the item data I get.

  13. Engine: Hi! I have an item here, please help me deal with it! Scheduler! This is a follow-up URL you need to help me deal with. Then start the cycle from step 4 until you have all the information you need.

  14. Pipeline scheduler: OK, do it now!

Note: the entire program will stop only if the scheduler does not have a request to process. (for URLs that fail to download, scrapy will also download them again.)


 

Xpath problem

Some websites clearly use the XPath helper plug-in to write the XPath statement and chrome verification is OK, but when it comes to response.xpath, it just can’t get the value, which is none all the time. What’s the problem????

First of all, we need to know that the HTML code we see in the browser may not be the same as that we get from scratch, so sometimes there will be problems that the browser’s XPath cannot be used in the program

Resolvent

  1. If it is a foreign website, first of all, pay attention to the automatic translation function of chrome. It is recommended to display the original webpage and then write it
  2. According to some netizens, in some cases, tbody is an extra tag added by browser standardization. The actual source code of the web page is not available, so we need to delete it manually

The ultimate solution, the scratch shell

Doesn’t it mean that what we get from scratch is different from what we see in the browser? OK, let’s go directly to the response from scratch!

Open the command line, and enter the command summary shell URL, such as Baidu

 

D:\pythonwork>scrapy shell www.baidu.com

 

You’ll find that now we’re in the shell of the story

Next, input view (response) according to the prompt, and it is found that the browser has been opened automatically and a local web page has been opened

That’s right. This is the response obtained through the scratch to operate directly on this web page. Then you can find the XPath that is applicable to our program (the XPath helper needs to open the allowed access to the file URL, that is, the local file in the settings)

There must be some friends who find some pages 403 can’t go in. Think about why? Because there is no user agent in the command line, some websites need to modify user agent ourselves. What should we do?

We can use it in the project directory where we have written various settings. Otherwise, we will only use the default settings


 

 

Compilation of multi-level response

When we use scrapy as a crawler, we often encounter data distributed on multiple pages. We need to send multiple requests to collect enough information. For example, the first page has only a few simple list information, and more information is in the response of other requests.

Generally, we only write such as “yield curve. Request (next link, callback = self. Parse)”, which is just the top page and then directly callback the next run, but what if our information has not been processed yet? Need more request and response processing

How to operate it? You can’t use the requests library anymore, can you?

Solution:

I encountered this problem when I crawled to a wallpaper website. The home page only got thumbnails of each picture, while the original image address needs to click another HTML file to get it

The solution code is as follows:

# -*- coding: utf-8 -*-
import scrapy
from wallpaper.items import WallpaperItem


class WallspiderSpider(scrapy.Spider):
    name = 'wallspider'
    allowed_domains = ['wall.alphacoders.com']
    start_urls = ['https://wall.alphacoders.com/']


    def parse(self, response):
        picx_list = response.xpath("//div[@class='center']//div[@class='boxgrid']/a/@href").getall()

        for picx in picx_list:
            url = 'https://wall.alphacoders.com/'+str(picx)
            #Callback to next level processing function
            yield scrapy.Request(url,callback=self.detail_parse)


    def detail_parse(self, response):
        pic_url = response.xpath("//*[@id='page_container']/div[4]/a/@href").get()
        pic_size = response.xpath("//*[@id='wallpaper_info_table']/tbody//span/span[2]/a/text()").get()
        pic_name = response.xpath("//*[@id='page_container']/div/a[4]/span/text()").get()
        wall_item = WallpaperItem()
        wall_item['pic_url'] = pic_url
        wall_item['pic_size'] = pic_size.split()[0]
        wall_item['pic_name'] = pic_name
        print(wall_item)
        return wall_item

At the same time, I see that the solution written by the user named kocor in CSDN is more instructional, the solution is as follows:

yield scrapy.Request(item['url'], meta={'item': item}, callback=self.detail_parse)

Scrapy can send the previously collected information to the new request with meta = {‘item’: item} when it sends the request with scrapy. Request, and accept it with item = response. Meta (‘item ‘) in the new request, and then add the new collected information in the item.

How many levels of requested data can be collected.

Spider.py file

# -*- coding: utf-8 -*-
import scrapy
from Tencent.items import TencentItem
 
 
class TencentSpider(scrapy.Spider):
    #Reptile name
    name = 'tencent'
    #Domain name allowed to crawl
    allowed_domains = ['www.xxx.com']
    #Basic address of crawler is used for splicing of crawler domain name
    base_url = 'https://www.xxx.com/'
    #Crawler access address
    start_urls = ['https://www.xxx.com/position.php']
    #Initial value of page number control for crawler
    count = 1
    #Crawler crawls 10 pages for one page only
    page_end = 1
 
    def parse(self, response):
 
 
        nodeList = response.xpath("//table[@class='tablelist']/tr[@class='odd'] | //table[@class='tablelist']/tr[@class='even']")
        for node in nodeList:
            item = TencentItem()
 
            item['title'] = node.xpath("./td[1]/a/text()").extract()[0]
            if len(node.xpath("./td[2]/text()")):
                item['position'] = node.xpath("./td[2]/text()").extract()[0]
            else:
                item['position'] = ''
            item['num'] = node.xpath("./td[3]/text()").extract()[0]
            item['address'] = node.xpath("./td[4]/text()").extract()[0]
            item['time'] = node.xpath("./td[5]/text()").extract()[0]
            item['url'] = self.base_url + node.xpath("./td[1]/a/@href").extract()[0]
            #Crawling according to the address on the inner page
            yield scrapy.Request(item['url'], meta={'item': item}, callback=self.detail_parse)
 
            #There are subordinate pages to crawl, comment out and return data
            # yield item
 
        #Cycle page turning
        nextPage = response.xpath("//a[@id='next']/@href").extract()[0]
        #Page crawling control and last page control
        if self.count < self.page_end and nextPage != 'javascript:;':
            if nextPage is not None:
                #The control value of the number of pages crawled increases automatically
                self.count = self.count + 1
                #Page turning request
                yield scrapy.Request(self.base_url + nextPage, callback=self.parse)
        else:
            #Reptile end
            return None
 
    def detail_parse(self, response):
        #Receive data crawled by superior
        item = response.meta['item']   
 
        #First level inner page data extraction 
        item['zhize'] = response.xpath("//*[@id='position_detail']/div/table/tr[3]/td/ul[1]").xpath('string(.)').extract()[0]
        item['yaoqiu'] = response.xpath("//*[@id='position_detail']/div/table/tr[4]/td/ul[1]").xpath('string(.)').extract()[0]
 
        #Secondary inner page address crawling
        yield scrapy.Request(item['url'] + "&123", meta={'item': item}, callback=self.detail_parse2)
 
        #There are subordinate pages to crawl, comment out and return data
        # return item
 
    def detail_parse2(self, response):
        #Receive data crawled by superior
        item = response.meta['item']
 
        #Second level inner page data extraction 
        item['test'] = "111111111111111111"
 
        #Finally return data to the crawler engine
        return item

His article address: https://blog.csdn.net/ygc123189/article/details/79160146