Reptiles (I) Anti-Reptilian Mechanism

Time:2019-10-11

Reptiles will always be sealed up after they have been used for a long time. — Lu Xun

 

Some websites, especially some old ones, have not done anti-crawler mechanism. We can climb freely and happily and put their underwear on it. The data all crawled down. For emotional reasons at most, we crawl slowly and don’t put too much pressure on its servers. But for websites with anti-crawler mechanisms, we can’t do that.

 

U-A check

 

The simplest anti-crawler mechanism should be U-A checking. When a browser sends a request, it attaches some parameters of the browser and the current system environment to the server. This part of the data is placed in the header part of the HTTP request.

 

All we have to do is set up our crawler U-A through the requests library. Generally speaking, there will be a default U-A for the third party library to send requests. If we use this U-A directly, it will be tantamount to telling others directly. I am a crawler. Quickly ban me! Some websites can’t go without U-A. The requests library is also easy to set up for u-a.

def download_page(url):
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'
    }
    data = requests.get(url,headers=headers)
    return data

Of course, if we repeatedly visit the same website, but always use the same U-A, it is not possible. You can make a U-A pool and randomly extract a U-A from it every time you visit it.

 

Access Frequency Limitation

 

Generally speaking, the speed of real people browsing web pages is slower than that of programs, but crawlers are different. If someone visits the same website 100 times a second, there’s almost no doubt that it’s a crawler. Generally speaking, in the face of this situation, we have two ways to solve it.

 

The first is simple. Since visits will be banned too soon, it would be better for me to visit slowly. We can set a time. sleep after each visit to the website to limit the speed of access. It’s better to use a machine to access from slow to fast, find the blocked threshold, and then access at a slightly lower speed.

 

The second way is to change ip. Websites usually identify visitors through ip, so as long as we keep changing ip, we can disguise as different people. The same IP accesses 100 times a second is abnormal, but 100 IPS accesses 100 times a second is normal. So how do we change the ip? In fact, instead of really replacing our ip, we forwarded our requests through proxy ip. Many websites offer many free proxy ips, so we just need to climb them down for a rainy day. However, many proxy IPS do not last long, so they need to be checked frequently. Requests also make it easy to set up proxy ip.

proxies = {"http": "http://42.228.3.155:8080",}
requests.get(url, proxies=proxies)

 

Verification Code

 

Some websites, no matter what you do, login or visit the page, need to enter a validation code to verify. In this case, we must identify the authentication code in order to crawl the website content. Some simple alphanumeric validation codes can be identified by ocr, while others such as sliding validation require other techniques to crack, which will not be discussed in detail here.

 

validate logon

 

Logging in is often a function of the website, and anti-crawler is just a side-effect. We can check the developer’s tool by pressing F12 to see what data the website will send when it logs in, and then simulate the login through the relevant function of requests. If you have time later, you will write an article to elaborate on it.

Recommended Today

Detailed explanation of sshd service and service management command under Linux

sshd SSH is the abbreviation of secure shell, which is the security protocol of application layer. SSH is a reliable protocol which provides security for remote login session and other network services. SSH protocol can effectively prevent information leakage in the process of remote management. openssh-server Function: enable remote hosts to access the sshd service […]