On a website dedicated to training crawler skills for crawler beginners（ http://quotes.toscrape.com ）Climb up to get famous aphorisms.
Before you can start crawling, you must create a new scratch project. Go to the directory where you want to store the code and run the following command:
(base) λ scrapy startproject quotes New scrapy project 'quotes '， using template directory 'd: \anaconda3\lib\site-packages\scrapy\temp1ates\project ', created in: D:\XXX You can start your first spider with : cd quotes scrapy genspider example example. com
First, switch to the new crawler project directory, that is, the / quotes directory. Then execute the command to create the crawler file:
D:\XXX(master) (base) λ cd quotes\ D:\XXX\quotes (master) (base) λ scrapy genspider quotes quotes.com cannot create a spider with the same name as your project D :\XXX\quotes (master) (base) λ scrapy genspider quote quotes.com created spider 'quote' using template 'basic' in module:quotes.spiders.quote
This command will create a quotes directory that contains the following:
Robots protocol is also called robots Txt (Unified lowercase) is an ASCII encoded text file stored in the root directory of the website. It usually tells the web spider of the web search engine what content in the website should not be obtained by the crawler of the search engine and what can be obtained by the crawler.
Robots protocol is not a standard, but just a convention.
#filename : settings.py #obey robots.txt rules ROBOTSTXT__OBEY = False
Before writing a crawler program, we first need to analyze the page to be crawled. Mainstream browsers have tools or plug-ins to analyze the page. Here, we choose the developer tools of Chrome browser to analyze the page.
Open page in Chrome browser http://lquotes.toscrape.com , and then select “elements” to view its HTML code.
You can see that each label is wrapped in
After analyzing the page, write the crawler. Write a crawler in scratch, and in scratch Write code in spider. Spider is a class written by users to crawl data from a single website (or – some websites).
It includes – initial URLs for downloading, how to follow up the links in the web page, how to analyze the content in the page, and how to extract the method of generating item.
In order to create a spider, you must inherit the script Spider class and defines the following three properties:
- Name: used to distinguish spider. The name must be unique – and you cannot set the same name for different spiders.
- start _ URLs: contains the list of URLs that spider crawls at startup. Therefore, the first page obtained will be one of them. Subsequent URLs are extracted from the data obtained from the initial URL.
- Parse (): is a method of spider. When called, the response object generated after each initial URL is downloaded will be passed to the function as a unique parameter. This method is responsible for parsing the returned data (response data), extracting the data (generating item) and generating the request object of the URL that needs further processing.
import scrapy class QuoteSpi der(scrapy . Spider): name ='quote' allowed_ domains = [' quotes. com '] start_ urls = ['http://quotes . toscrape . com/'] def parse(self， response) : pass
The following is a brief description of the implementation of quote.
- scrapy. Spider: the base class of the crawler. Every other spider must inherit from this class (including other spiders provided by scripy and spiders written by yourself).
- Name is the name of the crawler, which is specified in the genspider.
- allowed_ Domains is the domain name that the crawler can grab. The crawler can only grab web pages under this domain name and can not write.
- start_ Ur1s is a website captured by scrapy. It is an iterative type. Of course, if there are multiple web pages, you can write multiple web addresses in the list. It is commonly used in the form of list derivation.
- Parse is called a callback function, and the response in this method is start_ The response to the URL request. Of course, you can also specify other functions to receive the response. A page parsing function usually needs to complete the following two tasks:
1. Extract the data in the page (re, XPath, CSS selector)
2. Extract the links in the page and generate a download request for the linked page.
The page parsing function is usually implemented as a generator function. Each data extracted from the page and each download request for the linked page are submitted to the scripy engine by the yield statement.
import scrapy def parse(se1f，response) : quotes = response.css('.quote ') for quote in quotes: text = quote.css( '.text: :text ' ).extract_first() auth = quote.css( '.author : :text ' ).extract_first() tages = quote.css('.tags a: :text' ).extract() yield dict(text=text，auth=auth，tages=tages)
- response. CSS (directly use CSS syntax to extract the data in the response.
- start_ Multiple URLs can be written in ur1s, which can be separated in list format.
- Extract () is to extract the data in the CSS object. After extraction, it is a list, otherwise it is an object. And for
- extract_ First () is to extract the first
Run grapycrawlquotes in the / quotes directory to run the crawler project.
What happened after running the crawler?
Scripy is the start of spider_ Each URL in the URLs attribute creates a scene Request object and assign the parse method to request as a callback function.
The request object is scheduled and executed to generate a script http. The response object is returned to the spider parse () method for processing.
After completing the code, run the crawler to crawl the data, and execute scratch crawl < spider in the shell_ The name > command runs the crawler ‘quote’ and stores the crawled data in the CSV file:
(base) λ scrapy craw1 quote -o quotes.csv 2021-06-19 20:48:44 [scrapy.utils.log] INF0: Scrapy 1.8.0 started (bot: quotes)
After waiting for the crawler to run, it will generate a quota in the current directory CSV file. The data in it has been stored in CSV format.
-O support saving in multiple formats. The saving method is also very simple. Just give the suffix name of the file. (CSV, JSON, pickle, etc.)
This is the end of this article on the basis of Python crawler, the first use of the instance of the scratch crawler. For more information about the python scratch framework, please search the previous articles of developeppaer or continue to browse the relevant articles below. I hope you will support developeppaer in the future!