Python reptile base (1)

  • Common parameters of request header
In the HTP protocol, a request is sent to the server. The data is divided into three islands. The first one is to put the data in the ur, and the second one is to put the data in the body.
1. User-AgentBrowser name, which is often used in Web crawlers. When a request is made for two pages, the server knows the request through this parameter.
2. RefererIndicates which ur1 the request came from, which can also be used as an anti-crawler technique, if not from the specified page.
3. CookieHTP protocol is stateless. That is, the same person sends two requests, and the server has no ability to know whether the two requests come from the same person. So this is
  • Common Response State Codes

1.200Request normal, server normal return data

2.301Permanent re-questioning. Biga redirects to w.Jd.cms when accessing

3.302Temporary redirection, for example, when accessing a page that needs to be cataloged and there is no login at this time, it will be re-asked to the login page.

4.400The W1 requested could not be found on the server. In other words, the request ur1 error

5.403Server access, insufficient privileges

6.500Server internal error. Maybe there’s a bug in the server.


  • Urllib LibraryThe ur1lib library is one of the most basic network request libraries in Python. It can simulate the browser’s behavior, send a request to the specified server, and save the server’s return.
  • Urlopen function:

In Phon3’s urllib library, all the methods related to network requests are gathered under urllib, request module. Let’s first look at the basic of urlopen.

from urllib inport request
print(resp. reado)

In fact, using the browser to access Baidu, right-click to view the source code, you will find that the data we just printed is the same. That is to say, the above three lines of code have helped us crawl down Baidu’s 100 pages of code. The Python code corresponding to a basic URL request is really very simple. Here’s a detailed explanation of the urlopen function.

1.urlThe URL requested

2.dataThe DTA of the request, if netted, will become a post request

3. Return valueReturn value is an object, which is a class file handle object, such as read (sze), readline, read, getcode, etc.


  • Urlretrieve function
This number can be easily saved to a local file on the web page. The following code can easily download Baidu's 100 pages to the local area:
from urllsb import request
  • URLEncode function:

When sending requests with a browser, if the URL contains Chinese or other special characters, the browser will automatically encode them for us. If you use code to send requests, then you have to code manually. At this time, you should use URLEncode to implement. Urlencode can convert dictionary data into uRL-encoded data sample code as follows:

from urllib inport parse

Data = {name":"reptile","greet","he1 lo word","age": 100}
qs. = parse.urlencode (data)

The parse_qs function can decode the encoded u parameters.


  • ProxyHandler processor (proxy settings)

Many websites will check the number of visits to an iP at a certain time (through traffic statistics, system logs, etc.). If the number of visits is not as high as normal people, it will prohibit access to the iP, so we can set up some proxy servers, change proxy every other time, even if the iP is prohibited, it can still change the iP to crawl urlli. ProxyHandler is used to set up proxy server in B. The following code shows how to use custom opener to use proxy: This one does not use proxy.

from urllib import request

# There is no proxy for this.
# resp = request.urlopen("")
# print("utf-8"))

# This is a proxy
handler = request.ProxyHander({"http":"218.66.82:32512"})

opener = request.build_opener(handler)
req = request.Request("")
resp =


Recommended Today

Protocol basis: use telnet to learn IMAP protocol

IMAP introduction IMAPThe full name is Internet Mail Access Protocol, or Interactive Mail Access ProtocolPOP3Similar to one of the mail access standard protocols. The difference is, it’s onIMAPAfter that, the e-mail you received from the e-mail client remains on the server, and the operations on the client will be fed back to the server, such […]