Life is short, I use python
Xiaobai’s Python crawler (1): Opening
Python crawler (2): preparation (1) installation of basic class library
Learn Python crawler (3): preparation (2) introduction to Linux
Xiaobai’s Python crawler (4): preparation (3) introduction to docker
Python crawler in Xiaobai school (5): pre preparation (4) database foundation
Python crawler (6): preparation (5) installation of crawler framework
The origin of network
This is actually a cold knowledge, you can guess where the origin of the computer network is.
Silicon Valley? university? laboratory? It’s a little close, but it’s not accurate enough.
The exact answer is in the context of the cold war between the United States and the Soviet UnionUS Department of Defense 。
Yes, you’re right. YesUS militaryThe most advanced technology is always applied to the military field first, and then it will be gradually civilian as time goes on.
In 1968, ARPANET was born under the leadership of the senior planning agency of the US Department of defense.
ARPANET has only four nodes, connecting the mainframe computers of UCLA, UC Santa Barbara, Stanford University and University of Utah.
ApanetIs the world recognized as the originator of computer networks.
URI, URL and urn
Crawler is a process that simulates a browser to make HTTP requests. This requires us to understand what happens between the browser entering the URL and getting to the web page.
First, we introduce a set of concepts, URI and URL
- Uri = universal resource identifier, a compact string used to identify abstract or physical resources.
- Url = universal resource locator, a string used to locate the main access mechanism of resources. A standard URL must include: protocol, host, port, path, parameter, anchor.
- Urn = universal resource name, which identifies the resource by a unique name or ID in a specific namespace.
I don’t understand it, right? It’s OK. I don’t need to understand it. Let’s take an example.
For example, the address of the above picture: https://cdn.geekdigging.com/p… It is a URL and also a URI. URL is a subset of URIs, that is to say, every URL is a URI, but not every URI is a URL, because URI also includes a subclass called urn. In the current network, the use of urn is very few, so almost all the URIs are URLs. The general web links can be called URLs or URIs, depending on personal preferences.
What is hypertext?
Hypertext is a word, phrase, or chunk of text that can be linked to another document or text. Hypertext includes text hyperlink and graphic hyperlink.
The web pages we visit in browsers are written in HTML, which is known as hypertext markup language. In HTML code, contains a series of tags, including hyperlinks to images.
Let’s take a look at the source code of a real website. In Chrome browser, use F12 to open developer tools.
HTTP and HTTPS
What is HTTP?
Hypertext transmission protocol is a stateless application layer protocol based on request and response. It is often based on TCP / IP protocol to transmit data. It is the most widely used network protocol on the Internet. All www files must comply with this standard. HTTP was designed to provide a way to publish and receive HTML pages.
What is HTTPS?
It has been mentioned in the book “illustrated HTTP” that HTTPS is HTTP with SSL shell. HTTPS is a kind of transmission protocol for secure communication through computer network. It communicates through HTTP, establishes full channel by using SSL / TLS, and encrypts data packets. The main purpose of the use of HTTPS is to provide the identity authentication of the website server, while protecting the privacy and integrity of the exchanged data.
PS: TLS is the transport layer encryption protocol, formerly SSL protocol, released by Netscape in 1995, sometimes the two are indistinguishable.
Nowadays, more and more websites and apps have developed in the direction of HTTPS, such as:
- Apple has forced all IOS apps to use HTTPS encryption before January 1, 2017, otherwise the app will not be available in the app store;
- Since chrome 56 was launched in January 2017, Google has highlighted the risk warning for URL links that are not encrypted with HTTPS, that is to remind users that “this webpage is not secure” at a prominent position in the address bar;
- The official requirements document of Tencent’s wechat applet requires the background to use HTTPS requests for network communication, and domain names and protocols that do not meet the conditions cannot be requested.
The HTTP protocol itself is very simple. It stipulates that only the client can initiate the request actively, and the server will return the response result after receiving the request. At the same time, HTTP is a stateless protocol, and the protocol itself does not record the historical request records of the client.
To show this process more intuitively, we still open Chrome browser and press F12 to open developer mode.
Look at the first line, www.geekdigging.com In that line:
- Name: the name of the request.
- Status: status code, 200 represents normal response.
- Type: Wendan type. Here we request an HTML document.
- Initiator: request source. Used to mark which object or process initiated the request.
- Size: the size of the resource, which identifies the size of the resource we requested.
- Time: the time consumed, in MS.
- Flow visualization of waterfall: watefall’s network.
We can click on that line to see more details:
It includes header header information, preview (response preview) response information preview, response response response specific HTML code, cookies and timing. The whole request cycle takes time.
General part: request URL is request URL, request method is request method, status code is response status code, remote address is address and port of remote server, and referrer policy is referer discrimination policy.
An HTTP request message consists of request line, header, blank line and request body.
It is divided into three parts: request method, request address URL and HTTP protocol version.
For example, get/ index.html HTTP/1.1。
Http / 1.1 defines eight request methods:
- Get: requests the page and returns the page content.
- Post: mostly used to submit forms or upload files, and the data is contained in the request body.
- Put: data transferred from the client to the server replaces the content in the specified document.
- Delete: requests the server to delete the specified page.
- Patch: it is a supplement to put method, which is used to update known resources locally.
- Head: similar to get request, except that there is no specific content in the returned response to get the header.
- Options: allows the client to view the performance of the server.
- Trace: echo requests received by the server, mainly for testing or diagnostics.
- Connect: http / 1.1 protocol is reserved for the proxy server that can change the connection to pipeline mode.
Commonly used are get and post.
Enter the URL directly in the browser and press enter, which initiates a get request. The parameters of the request are directly included in the URL. The request parameters and corresponding values are appended to the URL, and a question mark is used
?Represents the end of the URL and the beginning of the request parameter. The length of the pass parameter is limited. Because different browsers have different restrictions on the address characters, generally only 1024 characters can be recognized. Therefore, if a large amount of data needs to be transmitted, it is not suitable to use the get method.
The client is allowed to provide more information to the server. The post method encapsulates the request parameters in the HTTP request data in the form of name / value, which can transmit a large amount of data. In this way, the post method has no limit on the size of the transmitted data and will not be displayed in the URL.
Because the amount of information carried by the request line is very limited, there are many things that the client wants to say to the server, which have to be put in the header of the request. The header of the request is used to provide the server with some additional information, such as the user agent to indicate the identity of the client and let the server know whether you are from the browser or the crawler, or from chrome The browser is still Firefox. Http / 1.1 specifies 47 header field types. The format of the HTTP header field is similar to the dictionary type in Python, consisting of key value pairs separated by colons.
The following is a brief description of some common header information.
- Accept: request header field that specifies what types of information the client can accept.
- Accept language: Specifies the language type accepted by the client.
- Accept encoding: Specifies the content encoding that the client can accept.
Host: used to specify the host IP and port number of the request resource, and its content is the location of the original server or gateway of the request URL. Starting with HTTP version 1.1, requests must contain this content.
- Cookie: also commonly used in the plural form of cookies, this is the website to identify users for session tracking and stored in the user’s local data. Its main function is to maintain the current access session. For example, after we enter the user name and password to successfully log in to a website, the server will save the login status information with the session. Later, when we refresh or request other pages of the site, we will find that it is the login status. This is the credit of cookies. There is information in cookies that identifies the session of our corresponding server. Each time the browser requests the page of the site, it will add cookies in the request header and send it to the server. The server identifies ourselves through cookies and finds out that the current status is login status. Therefore, the returned result is the content of the web page that can be seen after login.
- Referer: this content is used to identify which page the request is sent from. The server can get this information and do corresponding processing, such as source statistics, anti-theft chain processing, etc.
- User agent: UA for short, it is a special string header, which enables the server to identify the operating system and version, browser and version information used by customers. If you add this information to a crawler, you can disguise it as a browser; if you don’t, it will be recognized as a crawler.
- Content type: also known as Internet media type or MIME type. In the HTTP protocol header, it is used to represent the media type information in a specific request. For example, text / HTML represents HTML format, image / GIF represents GIF image, and application / JSON represents JSON type. For more correspondence, you can see this cross reference table: http://tool.oschina.net/commons 。
The content of the request body is generally the form data in the post request, while for the get request, the request body is empty.
Note that the way the data is submitted is closely related to the content type set in the request header.
After receiving the request and processing, the server returns the response content to the client. Similarly, the response content must follow the fixed format to be correctly parsed by the browser. HTTP response also consists of three parts: response line, response header and response body, which correspond to the request format of HTTP.
The response line is also composed of three parts, including the HTTP protocol version number supported by the server, the status code, and a brief reason description of the status code.
The response status code indicates the response status of the server. For example, 200 represents the normal response of the server, 404 indicates that the page is not found, and 500 represents an internal error of the server.
The response header contains the response information of the server to the request, such as content type, server, set cookie, etc. The following is a brief description of some common header information.
- Date: identifies the time when the response was generated.
- Last modified: Specifies the last modification time of the resource.
- Content encoding: Specifies the encoding of the response content.
- Server: contains information about the server, such as name, version number, etc.
- Set Cookie: set cookies. The set cookie in the response header tells the browser that this content needs to be put in the cookies, and the cookie request will be carried in the next request.
- Expires: Specifies the expiration time of the response, which enables the proxy server or browser to update the loaded content to the cache. If it is accessed again, it can be loaded directly from the cache, reducing the server load and shortening the loading time.
The most important thing is the content of response body. The response body data is all in the response body. For example, when a web page is requested, its response body is the HTML code of the web page; when a picture is requested, its response body is the binary data of the image.
When doing crawler, we mainly get the source code and JSON data of the web page through the response body, and then extract the corresponding content from it.
If my article is helpful, please scan the code to pay attention to the official account of the author: get the latest dry cargo push: