Python crawls Douban movie examples using scratch framework

Time:2020-9-30

This article describes Python crawling Douban movie with scrapy framework. For your reference, the details are as follows:

1. Concept

Scrapy is an application framework for crawling website data and extracting structural data. It can be used in a series of programs including data mining, information processing or storing historical data.

The python package management tool can be used to install scratch conveniently. If an error is reported during the installation, the missing package will be installed through pip


pip install scrapy

The composition and structure of scratch are shown in the figure below

Scratch engineIt is used for signal and data transmission of other parts of transfer dispatching

Scheduler, a queue for storing requests. The engine sends the connection of the request to the scheduler. It queues the request, but the engine sends the first request in the queue to the engine when it needs to

DownloaderAfter the engine sends the request link to the downloader, it downloads the corresponding data from the Internet and hands the returned data responses to the engine

SpidersThe engine sends the downloaded responses data to spiders for analysis and extracts the web page information we need. If a new URL connection is found in the parsing process, spiders will send the link to the engine and store it in the scheduler

Item piplineThe crawler will pass the data in the page to the pipeline for further processing, filtering, storage and other operations

Download middleware downloader Middleware, custom extension component, used to encapsulate proxy, HTTP request header and other operations when the page is requested

Spider Middleware, which is used to modify the data such as responses entering spiders and requests going out

The workflow of scratch: first of all, we give the entry URL to the spider crawler. The crawler puts the URL into the scheduler through the engine. After queued by the scheduler, it returns the first request. The engine then forwards the request to the downloader for downloading, and the downloaded data is handed over to the crawler for crawling. Part of the crawled data is the data we need to deliver to the pipeline for data cleaning and storage In addition, the new URL connection will be handed over to the scheduler again, and then the data crawling will be recycled

2. New scratch project

First, open the command line in the folder where the project is stored, and enter the name of the scratch startproject project project under the command line, and the python files required by the project will be automatically created in the current folder. For example, create a project double for crawling Douban movies. Its directory structure is as follows:

Db_Project/
  scrapy.cfg         --The configuration file for the project
  Double / -- the python module directory of the project, where you write Python code
    Wei init__ . py -- initialization file of Python package
    items.py        --Used to define the item data structure
    pipelines.py      --Pipelines file in the project
    settings.py       --Define the global settings for the project, such as download latency, concurrency
    Spiders / -- the package directory where the crawler code is stored
      __init__.py
      ...

After that, enter the spiders directory and enter the crawler name and domain name to generate the crawler file douban.py File, which is used to define crawler’s crawling logic and regular expressions


scrapy genspider douban movie.douban.com

3. Defining data

The website of Douban movie to be crawled is https://movie.douban.com/top250 Each of the films is as follows

We need to crawl the serial number, name, introduction, star rating, number of comments, description of these key information, so we need to use the pipeline file items.py First define these objects, similar to ORM, through the scrapy.Field The () method defines a data type for each field

import scrapy
 
 
class DoubanItem(scrapy.Item):
  ranking =  scrapy.Field () ranking
  name =  scrapy.Field () film title
  introduce =  scrapy.Field Introduction to ()
  star =  scrapy.Field () star rating
  comments =  scrapy.Field () number of comments
  describe =  scrapy.Field () description

4. Data crawling

Open the crawler file that you created earlier in the spiders folder movie.py As shown below, three variables and a method are automatically created to process the response of returned data in the parse method. We need to start the_ URLs provides the entry address of the crawler. Note that the crawler will automatically filter out the allowed_ Domain names other than domains, so you need to pay attention to the assignment of this variable

# spiders/movie.py
import scrapy
 
 
class MovieSpider(scrapy.Spider):
  #Reptile name
  name = 'movie'
  #Domain names allowed to crawl
  allowed_domains = ['movie.douban.com']
  #Entry URL
  start_urls = ['https://movie.douban.com/top250']
 
  def parse(self, response):
    pass

Before crawling data, you should first set up some network agents settings.py User found in file_ The agent variable is modified as follows:


USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'

You can start a crawler named double from the command line: scratch crawl double, or write a startup file run.py The file is as follows, run it


from scrapy import cmdline
cmdline.execute('scrapy crawl movie'.split())

Next, we need to filter the crawled data. Through the XPath rule, we can easily select the specified elements in the web page. As shown in the figure below, each movie entry is wrapped in a < li > tag under < ol >, so we can use XPath: / / OL [@ class = grid)_ View ‘] / Li selects all the movie entries on this page. You can get the XPath value through the XPath plug-in of Google browser or the chropath of Firefox browser. Right click to view the element in the browser, and the developer tool as shown in the figure below will pop up. The chropath plug-in is on the far right, which intuitively displays the XPath value of the element: / / div [@ id =’wrapper ‘] / / Li
 

The xpath() method of the crawler response object can directly process the XPath rule string and return the corresponding page content. These contents are selector objects, which can be further refined for content selection. Through XPath, the movie name, introduction, evaluation, star rating and other contents can be selected, that is, before the items.py The data structure doublanitem defined in the file. Loop through each movie list, crawl the accurate movie information from it, and save it as the doubleitem object item. Finally, return the item object from spiders to the item pipeline through yield.

In addition to extracting item data from the page, the crawler also crawls the URL link to form the request request for the next page. As shown in the figure below, the next page information at the bottom of Douban page is shown. The parameter of the second page is “? Start = 25 & filter =”, and the website address https://movie.douban.com/top250 Splicing together can get the address of the next page. As above, extract the content through XPath. If it is not empty, the spliced request yield is submitted to the scheduler

The ultimate reptile movie.py The documents are as follows

# -*- coding: utf-8 -*-
import scrapy
from items import DoubanItem
 
 
class MovieSpider(scrapy.Spider):
  #Reptile name
  name = 'movie'
  #Crawl the domain name of the website
  allowed_domains = ['movie.douban.com']
  #Entry URL
  start_urls = ['https://movie.douban.com/top250']
 
  def parse(self, response):
    #First grab the movie list
    movie_list = response.xpath("//ol[@class='grid_view']/li")
    for selector in movie_list:
      #Traverse each movie list, accurately grab the required information from it and save it as an item object
      item = DoubanItem()
      item['ranking'] = selector.xpath(".//div[@class='pic']/em/text()").extract_first()
      item['name'] = selector.xpath(".//span[@class='title']/text()").extract_first()
      text = selector.xpath(".//div[@class='bd']/p[1]/text()").extract()
      intro = ""
      For s in text: ා put the introduction into a string
        Intro + =. Join (s.split()) ා remove spaces
      item['introduce'] = intro
      item['star'] = selector.css('.rating_num::text').extract_first()
      item['comments'] = selector.xpath(".//div[@class='star']/span[4]/text()").extract_first()
      item['describe'] = selector.xpath(".//span[@class='inq']/text()").extract_first()
      # print(item)
      Yield item ා returns the resulting item object to the item pipeline
    #Crawls the URL information of the next page in the web page
    next_link = response.xpath("//span[@class='next']/a[1]/@href").extract_first()
    if next_link:
      next_link = "https://movie.douban.com/top250" + next_link
      print(next_link)
      #Submit the request request to the scheduler
      yield scrapy.Request(next_link, callback=self.parse)

XPath selector

/Indicates to search from the next level directory of the current location, / / indicates to search from any subdirectory of the current location,

By default, the search starts from the root directory. Represents the search from the current directory, @ is followed by the label attribute, and the text() function is used to retrieve the text content

//Div [@ id = wrapper ‘] / / Li means to find the div tag with ID wrapper from the root directory, and then take out all the Li tags under it

. / / div [@ class =’pic ‘] / EM [1] / text() means to search the first EM tag under all div with class pic from the current selector directory, and take out the text content

String (/ / div [@ id = endtext ‘] / P [position() > 1]) represents all text content after the second P tag under the div with ID endtext selected

/Bookstore / book [last() – 2] selects the third to last book element that belongs to the child element of the bookstore.

CSS selector

You can also use the CSS selector to select elements within the page, which express the selected elements in the form of CSS pseudo classes. Use the following

#Select the text in the P tag under the div with the class name left
response.css('div.left p::text').extract_first()
 
#Select the text in the element named star under the element with ID tag
response.css('#tag .star::text').extract_first()

5. Data preservation

When running the crawler file, you can specify the location of the file by using the – O parameter. You can choose to save it as a JSON or CSV file according to the file suffix name, such as


scrapy crawl movie -o data.csv

just so so piplines.py In the file, the obtained item data is further operated to save it to the database through Python operation

6. Middleware settings

Sometimes, in order to deal with the anti crawler mechanism of the website, it is necessary to set some camouflage settings on the download middleware, including using IP agent and proxy user-agent mode, and setting the download Middleware in the middlewares.py Create a new user in the file_ The agent class is used to add the user list for the request header, and query some commonly used user agents into user from the Internet_ AGENT_ List list, and then randomly select a user from the list as a proxy and set as the request header of requests_ Agent field

class user_agent(object):
  def process_request(self, request, spider):
    #User agent list
    USER_AGENT_LIST = [
      'MSIE (MSIE 6.0; X11; Linux; i686) Opera 7.23',
      'Opera/9.20 (Macintosh; Intel Mac OS X; U; en)',
      'Opera/9.0 (Macintosh; PPC Mac OS X; U; en)',
      'iTunes/9.0.3 (Macintosh; U; Intel Mac OS X 10_6_2; en-ca)',
      'Mozilla/4.76 [en_jp] (X11; U; SunOS 5.8 sun4u)',
      'iTunes/4.2 (Macintosh; U; PPC Mac OS X 10.2)',
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:5.0) Gecko/20100101 Firefox/5.0',
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:9.0) Gecko/20100101 Firefox/9.0',
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:16.0) Gecko/20120813 Firefox/16.0',
      'Mozilla/4.77 [en] (X11; I; IRIX;64 6.5 IP30)',
      'Mozilla/4.8 [en] (X11; U; SunOS; 5.7 sun4u)'
    ]
    agent =  random.choice (USER_ AGENT_ List) ා select a proxy randomly from the above list
    request.headers ['User_ Agent '] = agent ා user agent to set the request header

stay settings.py If the download middleware is set in the file, the comments in the following lines will be cancelled and the agent class user will be registered_ The smaller the number, the higher the priority

For more about Python related content, please refer to our special topic: summary of Python socket programming skills, summary of Python regular expression usage, python data structure and algorithm tutorial, python function use skill summary, Python string operation skill summary, python introduction and advanced classic tutorial, and python file and directory operation skill summary

I hope this article will be helpful to python programming.