Introduction to Python crawler, 8 minutes is enough. This is the simplest basic teaching I have ever seen

Time:2020-2-7

1、 Basic introduction

1.1 what is reptile

Spider (also known as web crawler) is a program that sends a request to a website / network, obtains resources, analyzes and extracts useful data.

From the technical level, it is to simulate the behavior of the browser to request the site, crawl the HTML code / JSON data / binary data (pictures, videos) returned by the site to the local, and then extract the data needed by itself, and store it for use.

Xiaobian has created a two thousand person Python communication group with zero foundation and working friends, as well as related e-books and video downloads. Welcome friends who are learning Python and friends who are already working to come in and communicate with each other! Group: 877169862

1.2 basic process of reptile

How users get network data:

Method 1: Browser submits request — > Download Web code — > parse into page

Mode 2: simulate browser to send request (get web code) – > extract useful data – > store in database or file

All the reptiles have to do is mode 2.

Getting started with Python crawler, 10 minutes is enough, which is probably the simplest basic teaching I have ever seen

Introduction to Python crawler, 8 minutes is enough. This is the simplest basic teaching I have ever seen

1 initiate request

Use HTTP library to send a request to the target site

Request includes: request header, request body, etc

Request module defect: unable to execute JS and CSS code

2 get response content

If the server responds normally, it will get a response

Response includes: HTML, JSON, picture, video, etc

3 analysis content

Parsing HTML data: regular expression (re module), XPath (mainly used), beautiful soup, CSS

Parsing JSON data: JSON module

Parsing binary data: writing files in WB mode

4 save data

The form of database (mysql, mongdb, redis) or file.

1.3 HTTP protocol request and response

HTTP protocol

Introduction to Python crawler, 8 minutes is enough. This is the simplest basic teaching I have ever seen

Request: users send their information to the server through the socket client

Response: the server receives the request, analyzes the request information sent by the user, and then returns the data (the returned data may contain other links, such as pictures, JS, CSS, etc.)

PS: after the browser receives the response, it will parse its content to display to the user, and the crawler will extract the useful data after simulating the browser to send a request and then receive the response.

1.3.1

request

(1) Request method

Common request mode: get / post

(2) Requested URL

URL global uniform resource locator is used to define a unique resource on the Internet. For example, a picture, a file and a video can be uniquely determined by URL

(3) Request header

User agent: if there is no user agent client configuration in the request header, the server may regard you as an illegal user host;

Cookies: cookies are used to save login information

Note: generally, a request header is added to a crawler.

Parameters to be noted in the request header:

Referrer: where is the source of the visit (for some large websites, the security chain strategy will be implemented through referrer; all crawlers should also pay attention to simulation)

User agent: Browser accessed (to be added otherwise it will be treated as a crawler)

Cookie: please take care of the request header

(4) Requestor

If the request body is in get mode, there is no content in the request body (the request body of get request is placed in the parameter after the URL, which can be seen directly). If it is in post mode, the request body is format data

PS: 1. Log in window, file upload, etc. information will be attached to the request body. 2. Log in, input the wrong user name and password, and then submit, you can see the post. After the correct login, the page usually jumps, unable to capture the post

1.3.2

response

(1) Response status code

200: success

301: for jump

404: file does not exist

403: no access

502: server error

(2)response header

Parameters to be noted for response header: set Cookie: bdsvrtm = 0; path = /: there may be multiple parameters to tell the browser to save the cookie

(3) Preview is the source code of web page

JSON data

Such as webpage HTML, picture

Binary data, etc

02

2、 Basic module

2.1requests

Requests is a simple and easy-to-use HTTP library implemented by python, which is upgraded from urllib.

Open source address:

https://github.com/kennethrei…

Chinese API:

http://docs.python-requests.o…

2.2re regular expression

Use the built-in re module in Python to use regular expressions.

Disadvantages: unstable data processing and heavy workload

2.3XPath

XPath (XML path language) is a language for finding information in XML documents, which can be used to traverse elements and attributes in XML documents.

In Python, the lxml library is mainly used for XPath acquisition (lxml is not used in the framework, and XPath is used directly in the framework)

Lxml is an HTML / XML parser. Its main function is how to parse and extract HTML / XML data.

Lxml, like regular, is also implemented in C. It is a high-performance Python HTML / XML parser. We can use the XPath syntax we learned earlier to quickly locate specific elements and node information.

2.4BeautifulSoup

Like lxml, beautiful soup is also an HTML / XML parser. Its main function is how to parse and extract HTML / XML data.

Using beautifulsop requires importing BS4 Library

Disadvantages: relatively regular and XPath processing speed is slow

Advantages: easy to use

2.5Json

JSON (JavaScript object notation) is a lightweight data exchange format, which makes it easy for people to read and write. At the same time, it is convenient for the machine to analyze and generate. It is applicable to the scene of data interaction, such as the data interaction between the foreground and background of the website.

JSON module is mainly used to process JSON data in Python. JSON resolution website:

https://www.sojson.com/simple…

2.6threading

Use the threading module to create threads, inherit directly from threading.thread, and then override the init method and run method

03

3、 Method instance

3.1 get method instance

demo_get.py

Introduction to Python crawler, 8 minutes is enough. This is the simplest basic teaching I have ever seen

3.2 post method example

demo_post.py

Introduction to Python crawler, 8 minutes is enough. This is the simplest basic teaching I have ever seen

3.3 add agent

demo_proxies.py

Introduction to Python crawler, 8 minutes is enough. This is the simplest basic teaching I have ever seen

3.4 get Ajax class data instance

demo_ajax.py

Introduction to Python crawler, 8 minutes is enough. This is the simplest basic teaching I have ever seen

3.5 using multithreaded instances

demo_thread.py

04

4、 Reptile frame

4.1srcapy framework

Scrapy is an application framework written in pure Python for crawling website data and extracting structural data, which is widely used.

Scrapy uses twisted’tw ɪ St ɪ D asynchronous network framework to deal with network communication, which can speed up our download speed, and does not need to implement asynchronous framework by ourselves, and contains various middleware interfaces, which can flexibly complete various requirements.

4.2scrapy architecture

Introduction to Python crawler, 8 minutes is enough. This is the simplest basic teaching I have ever seen

4.3 main components of scratch

Scrap engine: responsible for the communication, signal and data transmission among spider, itempipeline, downloader and scheduler.

Scheduler: it is responsible for receiving the request sent by the engine, arranging and queuing it in a certain way, and returning it to the engine when the engine needs it.

Downloader: it is responsible for downloading all requests sent by the scrapy engine, and returning the obtained responses to the scrapy engine, which will be processed by the engine to spider,

Spider: it is responsible for processing all responses, analyzing and extracting data from them, obtaining the data required by the item field, submitting the URL to be followed up to the engine, and entering the scheduler again,

Item pipeline: it is responsible for processing items obtained in spider and post-processing (detailed analysis, filtering, storage, etc.)

Downloader middleware: you can think of it as a component that can customize and extend the download function.

Spider middleware (spider Middleware): you can understand it as a functional component that can extend and operate the engine and communicate with spider (for example, responses to enter spider and requests to exit spider)

4.4 operation process of 4crapy

Engine: Hi! Spider, which website do you want to deal with?

Spider: boss asked me to deal with xxxx.com.

Engine: you give me the first URL to process.

Spider: Here you are. The first URL is xxxxxxx.com.

Engine: Hi! Dispatcher, I have a request for you to help me sort and join the team.

Dispatcher: OK, I’m dealing with you. Wait a minute.

Engine: Hi! Scheduler, give me your processed request.

Dispatcher: Here you are. This is my processed request

Engine: Hi! Downloader, you can download the request request for me according to the settings of the download middleware of the boss

Downloader: OK! Here you are. This is a good download. (if it fails: sorry, the request download fails. Then the engine tells the scheduler that the download of this request failed. Please record it and we will download it later.)

Engine: Hi! Spider, this is a good thing to download, and it has been handled according to the download middleware of the eldest brother. Please handle it yourself (note! By default, the responses are given to the def parse() function.)

Spider: (for URLs to be followed up after data processing), hi! Engine, I have two results here, this is the URL I need to follow, and this is the item data I get.

Engine: Hi! I have an item here, please help me deal with it! Scheduler! This is a follow-up URL you need to help me deal with. Then start the cycle from step 4 until you have all the information you need.

Pipeline scheduler: OK, do it now!

4.5 making the 4-step music of scrapy crawler

1. Create a new crawler project, scrape startproject, myspider2. Define the target (write items. Py). Open items.py3 under myspider directory to make crawler (spiders / xxxspider. Py). Scrape genspider guishi365 “guishi365. Com” 4. Store content (pipelines. Py). Design pipeline to store and crawl content

05

5、 Common tools

5.1fidder

Fidder is a bag grabbing tool, mainly used for mobile phone bag grabbing.

5.2XPath Helper

The XPath helper plug-in is a free chrome crawler page parsing tool. It can help users to solve problems such as failure to locate normally when getting XPath path.

Installation and use of Google browser plug-in XPath helper:

https://jingyan.baidu.com/art…

06

6、 Distributed crawler

6.1scrapy-redis

Scrapy redis provides some redis based components (PIP install scrapy redis) to facilitate the implementation of scrapy distributed crawling

6.2 distributed strategy

Master side (core server): build a redis database, not responsible for crawling, only responsible for URL fingerprint weight determination, request allocation, and data storage.