Reptiles (2) Establishment of proxy IP pool

Time:2019-10-6

Before we said that a common method of anti crawler is to detect IP and limit access frequency. So we need to bypass this limitation by setting up proxy ip. There are many websites that offer free proxy ip, such as https://www.xicidaili.com/nt/. We can get a lot of proxy IP from the website. But not every of these IPS can be used, or very few can be used.

 

We can use beautifulsoup to analyze web pages, then process, extract the proxy IP list, or use regular expressions to match. It’s faster to use regular expressions. Ip_url is https://www.xicidaili.com/nt/, and random_hearder is a function that randomly obtains the request header.

def download_page(url):
    headers = random_header()
    data = requests.get(url, headers=headers)
    return data


def get_proxies(page_num, ip_url):
    available_ip = []
    for page in range(1,page_num):
        Print ("Crawl the proxy IP of page% d"% page)
        url = ip_url + str(page)
        r = download_page(url)
        r.encoding = 'utf-8'
        pattern = re.compile('.*?alt="Cn" />.*?.*?(.*?).*?(.*?)', re.S)
        ip_list = re.findall(pattern, r.text)
        for ip in ip_list:
            if test_ip(ip):
                Print ('% s:% s passes the test and is added to the list of available agents'% (ip [0], IP [1])
                available_ip.append(ip)
        Time. sleep (10) print ('crawl end')
    return available_ip

After getting the ip, we also need to check the IP to make sure that the IP can be used. How to detect it? We can use proxy IP to access a website that can display the access ip, and then check the result of the request.

def test_ip(ip,test_url='http://ip.tool.chinaz.com/'):
    proxies={'http': ip[0]+':'+ip[1]}
    try_ip=ip[0]
    try:
        r=requests.get(test_url, headers=random_header(), proxies=proxies)
        if r.status_code==200:
            r.encoding='gbk'
            result=re.search('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}',r.text)
            result=result.group()
            print(result)
            If result [: 9]== try_ip [: 9]: print ('% S:% s test passes'% (ip [0], IP [1])
                return True
            else:
                Print ('% s:% s carrier agent failed, using local IP'% (ip [0], IP [1])
                return False
        else:
            Print ('% s:% s request code is not 200'% (ip [0], IP [1])
            return False
    except Exception as e:
        print(e)
        Print ('% s:% s error') (ip [0], IP [1])
        return False

Some tutorials just get 200 HTTP status codes and think they’re successful. That’s wrong. Because proxy IP access is unsuccessful, you will default to use your own ip. Of course I can succeed with my own IP access.

 

Finally, we need to detect IP before we use it, because you don’t know when it will not be available. So usually store more proxy ip, so as not to be useless when you need it.

 

The code for this article refers to https://blog.csdn.net/XRRRICK/article/details/78650764, and I have made some modifications.

Recommended Today

Mongoose error: getaddrinfo ENOTFOUND localhost localhost:27017

{ Error: getaddrinfo ENOTFOUND localhost localhost:27017 at errnoException (dns.js:50:10) at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:92:26) code: ‘ENOTFOUND’, errno: ‘ENOTFOUND’, syscall: ‘getaddrinfo’, hostname: ‘localhost’, host: ‘localhost’, port: 27017 } mongoose.connect(‘mongodb://localhost:27017/db1’,{ poolSize:5, useNewUrlParser: true },err=>{ if(err){ console.error(err) }else { Console.log (‘mongodb successfully connected ‘) } }) Try changing localhost to 127.0.0.1 mongoose.connect(‘mongodb://127.0.0.1:27017/db1’,{ poolSize:5, useNewUrlParser: true },err=>{ if(err){ console.error(err) […]