The basics of Python crawler briefly talk about the framework structure of scratch

Time:2022-5-21

Sketch frame structure

reflection

  • Why is scratch a framework rather than a library?
  • How does scratch work?

Project structure

Before you can start crawling, you must create a new scratch project. Go to the directory where you want to store the code and run the following command:

Note: when creating a project, a new directory of crawler project will be created under the current directory.

These documents are:

  • scrapy. CFG: configuration file of the project
  • Quotes /: the python module of the project. You will then add code here
  • quotes/items. Py: item file in the project
  • quotes/middlewares. Py: crawler middleware, Download middleware (processing request body and response body)
  • quotes/pipelines. Py: pipeline files in the project
  • quotes/settings. Py: project settings file
  • Quotes / spiders /: the directory where the Spider code is placed

Schematic diagram of scripy

Introduction of each component

1.Engine。 Engine, which processes the data flow of the whole system and triggers transactions, is the core of the whole framework.

2.ltem。 Item, which defines the data structure of the crawling result, and the crawling data will be assigned to the ltem object.

3.Scheduler。 The scheduler accepts the request sent by the engine and adds it to the queue, and provides the request to the engine when the engine requests again.

4.Downloader。 The downloader downloads the web content and returns the web content to the spider.

5.Spiders。 Spider, which defines the crawling logic and web page parsing rules. It is mainly responsible for parsing the response and generating the results and new requests.

6.Item Pipeline。 Project pipeline is responsible for processing projects extracted from web pages by spiders. Its main task is to clean, verify and store data.

7.Downloader Middlewares。 Downloader middleware is a hook framework between engine and Downloader, which mainly processes requests and responses between engine and downloader.

8.Spider Middlewares。 Spider middleware, a hook framework between engine and spider, mainly deals with spider input response, output result and new request.

Data flow

  • Scratch engine: responsible for the communication, signal and data transmission among spider, ltempipeline, downloader and scheduler.
  • Scheduler: it is responsible for accepting the request sent by the engine, sorting and arranging it in a certain way, joining the queue, and returning it to the engine when needed by the engine.
  • Downloader: responsible for downloading all requests sent by the scratch engine, and returning the responses obtained to the scratch engine, which will be handed over to spider for processing,
  • Spider: responsible for processing all responses, analyzing and extracting data from them, obtaining the data required by ltem field, submitting the URL to be followed up to the engine and entering the scheduler again,
  • Ltem pipeline: the place where ltem obtained from spider is processed and post processed (detailed analysis, filtering, storage, etc.)
  • Downloader middleware: you can regard it as a component that can customize and expand the download function.
  • Spider middleware: you can understand it as a functional component that can customize the expansion and operation of the intermediate communication between the engine and the spider (such as responses entering the spider and requests leaving the spider)

This is the end of this article about the basics of Python crawler. Let’s briefly talk about the framework structure of sweep. For more information about the framework structure of sweep, please search the previous articles of developeppaer or continue to browse the relevant articles below. I hope you will support developeppaer in the future!