Method of IP proxy configuration for Python 3 scratch crawler framework

Time:2020-9-17

What is scratch?

Scrapy is an application framework for crawling website data and extracting structural data. It is famous and powerful. The so-called framework is a project template which has been integrated with various functions (high-performance, asynchronous download, queue, distribution, parsing, persistence, etc.) and has strong universality. For the learning of framework, the focus is to learn the characteristics of the framework and the usage of each function.

1、 Background

In the process of doing the crawler project, I encountered the problem of IP proxy. I searched the Internet, either using alicloud’s IP agent or searching some existing IP resources on the Internet, and then configured them in the setting file. Both methods have some problems.

1. For Alibaba cloud IP proxy method, the user name and password of Alibaba cloud’s IP agent are configured on the Internet, and then encrypted and decrypted. According to the above operation, I found that there are no user name and password related parameters in the parameters of the IP agent on Alibaba cloud.

2. As for the other method found on the Internet, add the proxy IP resource pool in the setting file, and then add the proxy IP resource pool in the setting file middlewares.py Add some code to the file, but the proxy IP is not necessarily available.

2、 Improvement method

1. Based on the limitations of the two methods mentioned in the background, I combine them here.

2. Improvement method:

1) Use alicloud’s IP proxy API to generate 50 proxy IP resource pools (generated by logging in with your own alicloud account, the IP validity is guaranteed)

2) Directly in middlewares.py Add the following function in the. Proxies is the IP generated on alicloud, which involves personal privacy, so it is replaced by * *.


class my_proxy(object):
  def process_request(self, request, spider):
    PROXIES = ['http://****.****.****.****:8080']
    ip = random.choice(PROXIES)
    request.meta['Proxy-Authorization'] = ip 

be careful: request.meta The key words in the square brackets need to be written correctly, otherwise it can not run normally.

summary

The above is the method of IP proxy configuration of Python 3 scratch crawler framework introduced by Xiaobian. I hope it will be helpful for you. If you have any questions, please leave me a message, and I will reply you in time. Thank you very much for your support to the developeppaer website!
If you think this article is helpful to you, welcome to reprint, please indicate the source, thank you!

Recommended Today

The first Python Programming challenge on the Internet (end)

Date of establishment: March 28, 2020Update Date: April 22, 2020 (end)Personal collection Tool.pywebsite:http://www.pythonchallenge.com/Note: please quote or change this article at will, just mark the source and the author. The author does not guarantee that the content is absolutely correct. Please be responsible for any consequences Title: the first Python Programming challenge on the web Find […]