Several main strategies to prevent reptiles from being crawled back

Time:2022-3-24

Crawler collection has become the demand of many companies, enterprises and individuals, but because of this, anti crawler technologies also emerge in endlessly, such as time limit, IP limit, verification code limit and so on, which may lead to crawler failure. So here are some main strategies to prevent crawlers from being crawled back.
• dynamically set the user agent (randomly switch the user agent to simulate the browser information of different users, and you can use the component scratch random useragent)
• disable cookies (for simple websites, do not enable Cookies middleware, do not send cookies to the server, and some websites discover crawler behavior through the use of cookies) can be through cookies_ Enabled controls whether cookiesmiddleware is turned on or off
• enable Cookies (for complex websites, you need to use the headless browser scratch splash to obtain complex cookies generated by JS
• set a delay for downloading (prevent too frequent access, set to 2 seconds or higher)
• Google cache and Baidu cache: if possible, use the page cache of search engine servers such as Google / Baidu to obtain page data.
• referer uses fake sources, such as Baidu links with keywords
• use IP address pool: now most websites ban according to IP, which can be broken through the massive customization of proxy pool through yiniu cloud
• use yiniu cloud crawler agent component code.

#! –– encoding:utf-8 –
import base64
import sys
import random
PY3 = sys.version_info[0] >= 3
def base64ify(bytes_or_str):
if PY3 and isinstance(bytes_or_str, str):
input_bytes = bytes_or_str.encode(‘utf8’)
else:
input_bytes = bytes_or_str
output_bytes = base64.urlsafe_b64encode(input_bytes)
if PY3:
return output_bytes.decode(‘ascii’)
else:
return output_bytes
class ProxyMiddleware(object):
def process_request(self, request, spider):

#Proxy server (product official website [www.16yun. Cn]( http://www.16yun.cn/ ))
            proxyHost = "t.16yun.cn"
            proxyPort = "31111"
            #Proxy tunnel authentication information
            proxyUser = "username"
            proxyPass = "password"
            request.meta['proxy'] = "http://{0}:{1}".format(proxyHost,proxyPort)
            #Add validation header
            encoded_user_pass = base64ify(proxyUser + ":" + proxyPass)
            request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass                    
            #Set IP switching head (as required)
            tunnel = random.randint(1,10000)
            request.headers['Proxy-Tunnel'] = str(tunnel)

Modify the project configuration file (. / project name / settings. Py)
DOWNLOADER_MIDDLEWARES = {
‘scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware’: 110,
‘project name middlewares. ProxyMiddleware’: 100,
}
Set download Middleware
Download middleware is a layer of component between engine (crawler. Engine) and downloader (crawler. Engine. Download ()), and multiple download middleware can be loaded and run.
1. When the engine passes the request to the downloader, the download middleware can process the request (for example, add HTTP header information, add proxy information, etc.);
2. When the downloader completes the HTTP request and transmits the response to the engine, the download middleware can process the response (such as decompressing gzip, etc.)
To activate the downloader middleware component, add it to the downloader_ Middleviews settings. This setting is a dictionary (dict), the key is the path of the middleware class, and the value is the order of the middleware.
Here is an example:
DOWNLOADER_MIDDLEWARES = {
‘mySpider.middlewares.MyDownloaderMiddleware’: 543,
}
Writing downloader middleware is very simple. Each middleware component is a python class that defines one or more of the following methods:
class scrapy.contrib.downloadermiddleware.DownloaderMiddleware
process_request(self, request, spider)
• this method is called when each request passes through the download middleware.
• process_ Request () must return one of the following: a none, a response object, a request object, or raise ignorerequest:
• if it returns none, scripy will continue to process the request and execute the corresponding methods of other middleware until the appropriate downloader handler is called and the request is executed (its response is downloaded).
• if it returns a response object, scrapy will not call any other process_ Request() or
process_ Download function or method accordingly; It will return the response. Of installed Middleware
process_ The response () method will be called when each response returns.
• if it returns a request object, scrape stops the call
process_ The request method and reschedule the returned request. When the newly returned request is executed,
Accordingly, the middleware chain will be called according to the downloaded response.
• if it raises an ignorerequest exception, the process of the installed download middleware_ The exception () method is called. If no method handles the exception, the request’s errback (request. Errback) method will be called. If no code handles the thrown exception, the exception is ignored and not logged (unlike other exceptions).
• parameters:
• request (request object) – the request processed
• spider (spider object) – the spider corresponding to the request
process_response(self, request, response, spider)
Called when the downloader completes the HTTP request and passes the response to the engine
• process_ Request () must return one of the following: a response object, a request object, or raise an ignorerequest exception.
• if it returns a response (which can be the same as the incoming response or a new object), the response will be processed by other middleware in the chain_ The response () method handles.
• if it returns a request object, the middleware chain will stop and the returned request will be rescheduled for download. Processing is similar to process_ Request () returns what request did.
• if it throws an ignorerequest exception, call the request’s errback (request. Errback). If no code handles the thrown exception, the exception is ignored and not logged (unlike other exceptions).
• parameters:
• request (request object) – the request corresponding to the response
• response (response object) – the response being processed
• spider (spider object) – the spider corresponding to the response
Use case:
1. Create middlewares Py file.
Scripy agent IP and Uesr agent are switched through downloader_ Midlewares is controlled by us in settings Create middlewares. Py under the same level directory Py file to wrap all requests.

middlewares.py

#!/usr/bin/env python

– coding:utf-8 –

import random
import base64
from settings import USER_AGENTS
from settings import PROXIES

Random user agent

class RandomUserAgent(object):
def process_request(self, request, spider):
useragent = random.choice(USER_AGENTS)
request.headers.setdefault(“User-Agent”, useragent)
class RandomProxy(object):
def process_request(self, request, spider):
proxy = random.choice(PROXIES)
if proxy[‘user_passwd’] is None:

#Proxy usage without proxy account verification
        request.meta['proxy'] = "http://" + proxy['ip_port']
    else:
        #Base64 code conversion of account password
        base64_userpasswd = base64.b64encode(proxy['user_passwd'])
        #Corresponding to the signaling format of the proxy server
        request.headers['Proxy-Authorization'] = 'Basic ' + base64_userpasswd
        request.meta['proxy'] = "http://" + proxy['ip_port']

Why do HTTP proxies use Base64 encoding
The principle of HTTP proxy is very simple. It is to establish a connection with the proxy server through the HTTP protocol. The protocol signaling contains the IP and port number of the remote host to be connected. If authentication is required, authorization information needs to be added. After receiving the signaling, the server first authenticates, and then establishes a connection with the remote host. After the connection is successful, it will return to the client 200, indicating that the authentication is passed. That’s simple, The following is the specific signaling format:
CONNECT 59.64.128.198:21 HTTP/1.1
Host: 59.64.128.198:21
Proxy-Authorization: Basic bGV2I1TU5OTIz
User-Agent: OpenFetion
Where proxy authorization is the authentication information, and the string after basic is the result of Base64 encoding after the combination of user name and password, that is, Base64 encoding of Username: password.
HTTP/1.0 200 Connection established
OK, after the client receives the signaling from the receiving surface, it indicates that the connection has been successfully established. Next, the data to be sent to the remote host can be sent to the proxy server. After the proxy server establishes the connection, it will put the connection corresponding to the IP address and port number into the cache. After receiving the signaling, it will find the corresponding connection from the cache according to the IP address and port number and forward the data through the connection.
2. Modify settings Py configure user_ Agents and proxies
• add user_ AGENTS:
USER_AGENTS = [
“Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)”,
“Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)”,
“Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)”,
“Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)”,
“Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6”,
“Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1”,
“Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0”,
“Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5”
]
• add proxy IP settings proxies:
Proxy IP can purchase the crawler proxy IP of yiniu cloud:
PROXIES = [
{‘ip_port’: ‘t.16yun.cn:31111’, ‘user_passwd’: ‘16yun:16yun’},
{‘ip_port’: ‘t.16yun.cn:31112’, ‘user_passwd’: ‘16yun:16yun’}
]
• disable cookies unless specifically required to prevent some websites from blocking crawlers based on cookies. COOKIES_ ENABLED = False
• set download delay Download_ DELAY = 3
• finally set setting Downloader in PY_ Middlewares, add your own download middleware class.
DOWNLOADER_MIDDLEWARES = {

#'mySpider.middlewares.MyCustomDownloaderMiddleware': 543,
    'mySpider.middlewares.RandomUserAgent': 1,
    'mySpider.middlewares.ProxyMiddleware': 100
}

If you need protection or other insurance compliance services, you can contact me at / wechat 18039632519

This work adoptsCC agreement, reprint must indicate the author and the link to this article