1、 Basic introduction
1.1 what is reptile
Spider (also known as web crawler) is a program that sends a request to a website / network, obtains resources, analyzes and extracts useful data.
From the technical level, it is to simulate the behavior of the browser to request the site, crawl the HTML code / JSON data / binary data (pictures, videos) returned by the site to the local, and then extract the data needed by itself, and store it for use.
Xiaobian has created a two thousand person Python communication group with zero foundation and working friends, as well as related e-books and video downloads. Welcome friends who are learning Python and friends who are already working to come in and communicate with each other! Group: 877169862
1.2 basic process of reptile
How users get network data:
Method 1: Browser submits request — > Download Web code — > parse into page
Mode 2: simulate browser to send request (get web code) – > extract useful data – > store in database or file
All the reptiles have to do is mode 2.
Getting started with Python crawler, 10 minutes is enough, which is probably the simplest basic teaching I have ever seen
1 initiate request
Use HTTP library to send a request to the target site
Request includes: request header, request body, etc
Request module defect: unable to execute JS and CSS code
2 get response content
If the server responds normally, it will get a response
Response includes: HTML, JSON, picture, video, etc
3 analysis content
Parsing HTML data: regular expression (re module), XPath (mainly used), beautiful soup, CSS
Parsing JSON data: JSON module
Parsing binary data: writing files in WB mode
4 save data
The form of database (mysql, mongdb, redis) or file.
1.3 HTTP protocol request and response
Request: users send their information to the server through the socket client
Response: the server receives the request, analyzes the request information sent by the user, and then returns the data (the returned data may contain other links, such as pictures, JS, CSS, etc.)
PS: after the browser receives the response, it will parse its content to display to the user, and the crawler will extract the useful data after simulating the browser to send a request and then receive the response.
(1) Request method
Common request mode: get / post
(2) Requested URL
URL global uniform resource locator is used to define a unique resource on the Internet. For example, a picture, a file and a video can be uniquely determined by URL
(3) Request header
User agent: if there is no user agent client configuration in the request header, the server may regard you as an illegal user host;
Cookies: cookies are used to save login information
Note: generally, a request header is added to a crawler.
Parameters to be noted in the request header:
Referrer: where is the source of the visit (for some large websites, the security chain strategy will be implemented through referrer; all crawlers should also pay attention to simulation)
User agent: Browser accessed (to be added otherwise it will be treated as a crawler)
Cookie: please take care of the request header
If the request body is in get mode, there is no content in the request body (the request body of get request is placed in the parameter after the URL, which can be seen directly). If it is in post mode, the request body is format data
PS: 1. Log in window, file upload, etc. information will be attached to the request body. 2. Log in, input the wrong user name and password, and then submit, you can see the post. After the correct login, the page usually jumps, unable to capture the post
(1) Response status code
301: for jump
404: file does not exist
403: no access
502: server error
Parameters to be noted for response header: set Cookie: bdsvrtm = 0; path = /: there may be multiple parameters to tell the browser to save the cookie
(3) Preview is the source code of web page
Such as webpage HTML, picture
Binary data, etc
2、 Basic module
Requests is a simple and easy-to-use HTTP library implemented by python, which is upgraded from urllib.
Open source address:
2.2re regular expression
Use the built-in re module in Python to use regular expressions.
Disadvantages: unstable data processing and heavy workload
XPath (XML path language) is a language for finding information in XML documents, which can be used to traverse elements and attributes in XML documents.
In Python, the lxml library is mainly used for XPath acquisition (lxml is not used in the framework, and XPath is used directly in the framework)
Lxml is an HTML / XML parser. Its main function is how to parse and extract HTML / XML data.
Lxml, like regular, is also implemented in C. It is a high-performance Python HTML / XML parser. We can use the XPath syntax we learned earlier to quickly locate specific elements and node information.
Like lxml, beautiful soup is also an HTML / XML parser. Its main function is how to parse and extract HTML / XML data.
Using beautifulsop requires importing BS4 Library
Disadvantages: relatively regular and XPath processing speed is slow
Advantages: easy to use
JSON module is mainly used to process JSON data in Python. JSON resolution website:
Use the threading module to create threads, inherit directly from threading.thread, and then override the init method and run method
3、 Method instance
3.1 get method instance
3.2 post method example
3.3 add agent
3.4 get Ajax class data instance
3.5 using multithreaded instances
4、 Reptile frame
Scrapy is an application framework written in pure Python for crawling website data and extracting structural data, which is widely used.
Scrapy uses twisted’tw ɪ St ɪ D asynchronous network framework to deal with network communication, which can speed up our download speed, and does not need to implement asynchronous framework by ourselves, and contains various middleware interfaces, which can flexibly complete various requirements.
4.3 main components of scratch
Scrap engine: responsible for the communication, signal and data transmission among spider, itempipeline, downloader and scheduler.
Scheduler: it is responsible for receiving the request sent by the engine, arranging and queuing it in a certain way, and returning it to the engine when the engine needs it.
Downloader: it is responsible for downloading all requests sent by the scrapy engine, and returning the obtained responses to the scrapy engine, which will be processed by the engine to spider,
Spider: it is responsible for processing all responses, analyzing and extracting data from them, obtaining the data required by the item field, submitting the URL to be followed up to the engine, and entering the scheduler again,
Item pipeline: it is responsible for processing items obtained in spider and post-processing (detailed analysis, filtering, storage, etc.)
Downloader middleware: you can think of it as a component that can customize and extend the download function.
Spider middleware (spider Middleware): you can understand it as a functional component that can extend and operate the engine and communicate with spider (for example, responses to enter spider and requests to exit spider)
4.4 operation process of 4crapy
Engine: Hi! Spider, which website do you want to deal with?
Spider: boss asked me to deal with xxxx.com.
Engine: you give me the first URL to process.
Spider: Here you are. The first URL is xxxxxxx.com.
Engine: Hi! Dispatcher, I have a request for you to help me sort and join the team.
Dispatcher: OK, I’m dealing with you. Wait a minute.
Engine: Hi! Scheduler, give me your processed request.
Dispatcher: Here you are. This is my processed request
Engine: Hi! Downloader, you can download the request request for me according to the settings of the download middleware of the boss
Downloader: OK! Here you are. This is a good download. (if it fails: sorry, the request download fails. Then the engine tells the scheduler that the download of this request failed. Please record it and we will download it later.)
Engine: Hi! Spider, this is a good thing to download, and it has been handled according to the download middleware of the eldest brother. Please handle it yourself (note! By default, the responses are given to the def parse() function.)
Spider: (for URLs to be followed up after data processing), hi! Engine, I have two results here, this is the URL I need to follow, and this is the item data I get.
Engine: Hi! I have an item here, please help me deal with it! Scheduler! This is a follow-up URL you need to help me deal with. Then start the cycle from step 4 until you have all the information you need.
Pipeline scheduler: OK, do it now!
4.5 making the 4-step music of scrapy crawler
1. Create a new crawler project, scrape startproject, myspider2. Define the target (write items. Py). Open items.py3 under myspider directory to make crawler (spiders / xxxspider. Py). Scrape genspider guishi365 “guishi365. Com” 4. Store content (pipelines. Py). Design pipeline to store and crawl content
5、 Common tools
Fidder is a bag grabbing tool, mainly used for mobile phone bag grabbing.
The XPath helper plug-in is a free chrome crawler page parsing tool. It can help users to solve problems such as failure to locate normally when getting XPath path.
Installation and use of Google browser plug-in XPath helper:
6、 Distributed crawler
Scrapy redis provides some redis based components (PIP install scrapy redis) to facilitate the implementation of scrapy distributed crawling
6.2 distributed strategy
Master side (core server): build a redis database, not responsible for crawling, only responsible for URL fingerprint weight determination, request allocation, and data storage.