Basic principles of 03 Python crawler

Time:2021-1-27

Crawler is a program that simulates the user’s operation in the browser or an application and automates the operation process

When we enter a URL in the browser and press enter, what happens in the background? Let’s say you type inhttp://www.sina.com.cn/

To put it simply, this process involves the following four steps

  • Find the IP address corresponding to the domain name.
  • Send the request to the server corresponding to the IP.
  • The server responds to the request and sends back the content of the web page.
  • The browser parses the content of the web page.

Basic principles of 03 Python crawler

The essence of web crawler

The essence is the browser HTTP request

Browser and web crawler are two different web clients, both of which obtain web pages in the same way

Web crawler to do, simply put, is to achieve the function of the browser. By specifying the URL, it can directly return the required data to the user without manual manipulation of the browser step by step.

How does the browser send and receive this data?

Introduction to http

The purpose of HTTP (Hypertext Transfer Protocol) is to provide a way to publish and receive HTML (Hypertext Markup Language) pages.

HTTP protocol layer (learn)

HTTP is based on TCP protocol. The corresponding protocols of each layer in the TCP / IP protocol reference model are shown in the figure below, where HTTP is the protocol of the application layer.The default port number is 80 for HTTP and 443 for HTTPS.

Basic principles of 03 Python crawler

HTTP workflow

An HTTP operation is called a transaction. The whole process is as follows:

1) address resolution,

To request this page with a client browser:http://localhost.com:8080/index.htm

From this, we decompose the protocol name, host name, port, object path and other parts. For our address, the results are as follows: protocol name: http host name: localhost.com Port: 8080 object path:/ index.htm

In this step, you need DNS to resolve the domain name localhost.com To get the IP address of the host.

2) Encapsulating HTTP request packets

Combine the above parts with the information of the machine itself and encapsulate them into an HTTP request packet

3) Encapsulate into TCP packet and establish TCP connection (three handshakes of TCP)

Before the beginning of HTTP work, the client (web browser) must first establish a connection with the server through the network. The connection is completed through TCP. The protocol and IP protocol jointly build the Internet, which is the famous TCP / IP protocol family. Therefore, the Internet is also called TCP / IP network.

HTTP is a higher-level application layer protocol than TCP. According to the rules, only after the low-level protocol is established can the connection of the higher-level protocol be made. Therefore, the TCP connection should be established first. Generally, the port number of the TCP connection is 80. This is port 8080

4) The client sends the request command

After the connection is established, the client sends a request to the server in the format of uniform resource identifier (URL) and protocol version number, followed by mime information, including request modifier, client information and executable content.

5) Server response

After receiving the request, the server gives the corresponding response information in the form of a status line, including the protocol version number of the information, a success or error code, followed by mime information, including server information, entity information and possible content.

  1. The entity message is that after the server sends the header information to the browser, it will send a blank line to indicate that the sending of the header information ends here. Then, it will send the actual data requested by the user in the format described by the content type response header information

6) The server closes the TCP connection

In general, once the web server sends the request data to the browser, it will close the TCP connection, and then if the browser or server adds this line of code to its header information

Connection:keep-alive

The TCP connection will remain open after it is sent, so the browser can continue to send requests over the same connection. Keeping connected saves the time needed to establish a new connection for each request, and also saves network bandwidth.

HTTPS

HTTP (full name: Hypertext Transfer Protocol over secure socket layer) is an HTTP channel aiming at security. In short, it is a secure version of HTTP. That is to say, SSL layer is added under HTTP, and SSL is the security foundation of HTTPS. The port number used is 443.

SSL: secure socket layer, designed by Netscape company, is a secure transport protocol mainly used for web. This protocol has been widely used on the web. Through certificate authentication to ensure that the communication data between client and web server is encrypted and secure.

There are two basic types of encryption and decryption algorithms

1) Symmetric encryption(symmetric encryption): there is only one key, the encryption and decryption is the same password, and the encryption and decryption speed is fast. The typical symmetric encryption algorithms include DES, AES, RC5, 3DES, etc;

The main problem of symmetric encryption is to share the secret key. Unless your computer (client) knows the private key of another computer (server), it cannot encrypt and decrypt the communication stream. The solution to this problem is asymmetric secret key.

2) Asymmetric encryption: use two secret keys: public key and private key. The private key is saved by one party’s password (usually the server), and anyone on the other party can obtain the public key.

This kind of key appears in pairs (and the private key can’t be deduced from the public key, and the public key can’t be deduced from the private key). Encryption and decryption use different keys (public key encryption needs private key decryption, private key encryption needs public key decryption). The speed of symmetric encryption is relatively slow. Typical asymmetric encryption algorithms include RSA, DSA, etc.

Advantages of HTTPS communication:

  • The key generated by client can only be obtained by client and server;
  • Only client and server can get plaintext for encrypted data;
  • The communication between client and server is secure.

Basic principles of 03 Python crawler

Introduction to it | thank you for your attention | practice address:www.520mg.com/it