Scrapy simulation landing of reptiles

Time:2020-6-30

Want to crawl the website data? Log in to the website first! For most large websites, the first barrier to crawling their data is to log in. Now please follow my steps to learn how to simulate the landing site.

Why simulated landing?

There are two kinds of websites on the Internet: need to log in and do not need to log in. (that’s bullshit!)

Then, for the website that does not need to log in, we can directly obtain the data, which is simple and convenient. And for the need to log in to view data or only part of the data can not log in to the site, we have to log in the site obediently. (unless you directly hacked into other people’s database, please use it with caution!)

Therefore, for the website that needs to log in, we need to simulate the login. On the one hand, in order to obtain the information and data of the page after login, on the other hand, to get the cookie after login for the next request.

Thinking of simulated landing

When it comes to simulated landing, everyone’s first reaction must be: cut! Isn’t that easy? Open the browser, enter the URL, find the user name password box, enter the user name and password, and then click login to finish!

There is nothing wrong with this method. This is how our selenium simulation login works.

In addition, our requests can also directly carry cookies that have been logged in, which is equivalent to bypassing the login.

We can also use requests to send a post request to attach the information needed for website login to the post request for login.

The above are three common ways to simulate landing on the website. Then our scrapy also uses the latter two ways. After all, the first is only selenium’s unique way.

The idea of scrapy simulation landing:

1. Request directly with cookies that have been logged in
2. The information required for website login will be attached to the post request for login

Simulated landing example

Simulated login with cookies

Each login method has its advantages and disadvantages as well as usage scenarios. Let’s take a look at the application scenarios of logging in with cookies:

1. Cookie expiration time is very long, we can log in once, don’t worry about login expiration problem, common in some non-standard websites.
2. We can get all the data we need before the cookie expires.
3. We can use it with other programs, such as using selenium to save the cookie after login to the local, and then read the local cookie before scrapy sends the request.

Next, we describe this simulated login method through Renren, which we have forgotten for a long time.

Let’s start by creating a scrapy project:

> scrapy startproject login

In order to crawl smoothly, please set the robots protocol in settings to false

ROBOTSTXT_OBEY = False

Next, we create a crawler:

> scrapy genspider renren renren.com

Let’s open the renren.py The code is as follows:

# -*- coding: utf-8 -*-
import scrapy


class RenrenSpider(scrapy.Spider):
    name = 'renren'
    allowed_domains = ['renren.com']
    start_urls = ['http://renren.com/']

    def parse(self, response):
        pass

We know that,start_urlsThis is the first webpage address we need to crawl. This is the initial web page for data crawling. Suppose I need to crawl the data of Renren’s personal central page, then I log in to Renren and enter the personal central page. The website address is:http://www.renren.com/972990680/profileIf I put this URL directly intostart_urlsInside, then we directly request, we think about it, can we succeed?

No, right! Because we haven’t logged in, we can’t see the personal center page at all.

So where do we add our login code?

What we can be sure of is that we have to request in the frameworkstart_urlsLog in before the page in.

We enter the source code of spider class and find the following code:

def start_requests(self):
        cls = self.__class__
        if method_is_overridden(cls, Spider, 'make_requests_from_url'):
            warnings.warn(
                "Spider.make_requests_from_url method is deprecated; it "
                "won't be called in future Scrapy releases. Please "
                "override Spider.start_requests method instead (see %s.%s)." % (
                    cls.__module__, cls.__name__
                ),
            )
            for url in self.start_urls:
                yield self.make_requests_from_url(url)
        else:
            for url in self.start_urls:
                yield Request(url, dont_filter=True)

    def make_requests_from_url(self, url):
        """ This method is deprecated. """
        return Request(url, dont_filter=True)

We can see from the source code that this method can be used fromstart_urlsThen construct a request object to request. In that case, we can rewrite itstart_requestsMethod to do something, that is, to constructRequestAdd the cookie information to the object.

After rewritingstart_requestsThe method is as follows:

# -*- coding: utf-8 -*-
import scrapy
import re

class RenrenSpider(scrapy.Spider):
    name = 'renren'
    allowed_domains = ['renren.com']
    #Website of personal center page
    start_urls = ['http://www.renren.com/972990680/profile']

    def start_requests(self):
        #After logging in, use Chrome's debug tool to get cookies from the request
        cookiesstr = "anonymid=k3miegqc-hho317; depovince=ZGQT; _r01_=1; JSESSIONID=abcDdtGp7yEtG91r_U-6w; ick_login=d2631ff6-7b2d-4638-a2f5-c3a3f46b1595; ick=5499cd3f-c7a3-44ac-9146-60ac04440cb7; t=d1b681e8b5568a8f6140890d4f05c30f0; societyguester=d1b681e8b5568a8f6140890d4f05c30f0; id=972990680; xnsid=404266eb; XNESSESSIONID=62de8f52d318; jebecookies=4205498d-d0f7-4757-acd3-416f7aa0ae98|||||; ver=7.0; loginfrom=null; jebe_key=8800dc4d-e013-472b-a6aa-552ebfc11486%7Cb1a400326a5d6b2877f8c884e4fe9832%7C1575175011619%7C1%7C1575175011639; jebe_key=8800dc4d-e013-472b-a6aa-552ebfc11486%7Cb1a400326a5d6b2877f8c884e4fe9832%7C1575175011619%7C1%7C1575175011641; wp_fold=0"
        cookies = {i.split("=")[0]:i.split("=")[1] for i in cookiesstr.split("; ")}

        #Request request with cookies
        yield scrapy.Request(
            self.start_urls[0],
            callback=self.parse,
            cookies=cookies
        )

    def parse(self, response):
        #Find the key word "leisure" from the personal center page and print it
        print( re.findall (leisure and joy), response.body.decode ()))

First, I log in to renren.com with my account number correctly. After logging in, I use the debug tool of chrome to get a requested cookie from the request, and then log in theRequestAdd this cookie to the object. Then I was thereparseMethods to find the “leisure” keyword in the web page and print it out.

Let’s run this crawler:

>scrapy crawl renren

In the run log, we can see the following lines:

2019-12-01 13:06:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.renren.com/972990680/profile?v=info_timeline> (referer: http://www.renren.com/972990680/profile)
['xianhuan ','xianhuan','xianhuan ','xianhuan','xianhuan ','xianhuan']
2019-12-01 13:06:55 [scrapy.core.engine] INFO: Closing spider (finished)

We can see that we have printed the information we need.

We can add it in the settings configurationCOOKIES_DEBUG = TrueTo see how cookies are delivered.

After adding this configuration, we can see the following information in the log:

2019-12-01 13:06:55 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET http://www.renren.com/972990680/profile?v=info_timeline>
Cookie: anonymid=k3miegqc-hho317; depovince=ZGQT; _r01_=1; JSESSIONID=abcDdtGp7yEtG91r_U-6w; ick_login=d2631ff6-7b2d-4638-a2f5-c3a3f46b1595; ick=5499cd3f-c7a3-44ac-9146-60ac04440cb7; t=d1b681e8b5568a8f6140890d4f05c30f0; societyguester=d1b681e8b5568a8f6140890d4f05c30f0; id=972990680; xnsid=404266eb; XNESSESSIONID=62de8f52d318; jebecookies=4205498d-d0f7-4757-acd3-416f7aa0ae98|||||; ver=7.0; loginfrom=null; jebe_key=8800dc4d-e013-472b-a6aa-552ebfc11486%7Cb1a400326a5d6b2877f8c884e4fe9832%7C1575175011619%7C1%7C1575175011641; wp_fold=0; JSESSIONID=abc84VF0a7DUL7JcS2-6w

Send post to request simulated Login

We use the simulation login GitHub website as an example to describe this simulated landing method.

Let’s first create a crawler GitHub:

> scrapy genspider github github.com

In order to simulate login with post request, we first need to know the URL address of login and the required parameter information. Through the debug tool, we can see the login request information as follows:

Scrapy simulation landing of reptiles

From the request information, we can find that the login URL is:https://github.com/sessionThe required parameters for login are:

commit: Sign in
utf8: ✓
authenticity_token: bbpX85KY36B7N6qJadpROzoEdiiMI6qQ5L7hYFdPS+zuNNFSKwbW8kAGW5ICyvNVuuY5FImLdArG47358RwhWQ==
ga_id: 101235085.1574734122
login: [email protected]
password: xxx
webauthn-support: supported
webauthn-iuvpaa-support: unsupported
required_field_f0e5: 
timestamp: 1575184710948
timestamp_secret: 574aa2760765c42c07d9f0ad0bbfd9221135c3273172323d846016f43ba761db

This request has enough parameters, sweat!

In addition to our user name and password, everything else needs to be obtained from the login page, and there is onerequired_field_f0e5Please note that the term “parameter” is different every time the page is loaded. It can be seen that it is dynamically generated, but the value is always passed empty. This saves us a parameter and we can not use this parameter.

The position of other parameters on the page is as follows:

Scrapy simulation landing of reptiles

We use XPath to get the parameters. The code is as follows (I use the user name and password respectivelyxxxPlease write down your real user name and password when you are running:

# -*- coding: utf-8 -*-
import scrapy
import re

class GithubSpider(scrapy.Spider):
    name = 'github'
    allowed_domains = ['github.com']
    #Login page URL
    start_urls = ['https://github.com/login']

    def parse(self, response):
        #Get request parameters
        commit = response.xpath("//input[@name='commit']/@value").extract_first()
        utf8 = response.xpath("//input[@name='utf8']/@value").extract_first()
        authenticity_token = response.xpath("//input[@name='authenticity_token']/@value").extract_first()
        ga_id = response.xpath("//input[@name='ga_id']/@value").extract_first()
        webauthn_support = response.xpath("//input[@name='webauthn-support']/@value").extract_first()
        webauthn_iuvpaa_support = response.xpath("//input[@name='webauthn-iuvpaa-support']/@value").extract_first()
        # required_field_157f = response.xpath("//input[@name='required_field_4ed5']/@value").extract_first()
        timestamp = response.xpath("//input[@name='timestamp']/@value").extract_first()
        timestamp_secret = response.xpath("//input[@name='timestamp_secret']/@value").extract_first()

        #Construct post parameter
        post_data = {
            "commit": commit,
            "utf8": utf8,
            "authenticity_token": authenticity_token,
            "ga_id": ga_id,
            "login": "[email protected]",
            "password": "xxx",
            "webauthn-support": webauthn_support,
            "webauthn-iuvpaa-support": webauthn_iuvpaa_support,
            # "required_field_4ed5": required_field_4ed5,
            "timestamp": timestamp,
            "timestamp_secret": timestamp_secret
        }

        #Print parameters
        print(post_data)

        #Send post request
        yield scrapy.FormRequest(
            " https://github.com/session ", ා login request method
            formdata=post_data,
            callback=self.after_login
        )

    #Operation after successful login
    def after_login(self, response):
        #Locate the issues field on the page and print it
        print(re.findall("Issues", response.body.decode()))

We useFormRequestMethod sends a post request. After running the crawler, an error is reported

2019-12-01 15:14:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/login> (referer: None)
{'commit': 'Sign in', 'utf8': '✓', 'authenticity_token': '3P4EVfXq3WvBM8fvWge7FfmRd0ORFlS6xGcz5mR5A00XnMe7GhFaMKQ8y024Hyy5r/RFS9ZErUDr1YwhDpBxlQ==', 'ga_id': None, 'login': '[email protected]', 'password': '54ithero', 'webauthn-support': 'unknown', 'webauthn-iuvpaa-support': 'unknown', 'timestamp': '1575184487447', 'timestamp_secret': '6a8b589266e21888a4635ab0560304d53e7e8667d5da37933844acd7bee3cd19'}
2019-12-01 15:14:47 [scrapy.core.scraper] ERROR: Spider error processing <GET https://github.com/login> (referer: None)
Traceback (most recent call last):
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
    for r in iterable:
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
    for r in iterable:
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
    for r in iterable:
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
    for r in iterable:
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Users/cxhuan/Documents/python_workspace/scrapy_projects/login/login/spiders/github.py", line 40, in parse
    callback=self.after_login
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/http/request/form.py", line 32, in __init__
    querystr = _urlencode(items, self.encoding)
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/http/request/form.py", line 73, in _urlencode
    for k, vs in seq
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/http/request/form.py", line 74, in <listcomp>
    for v in (vs if is_listlike(vs) else [vs])]
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/utils/python.py", line 107, in to_bytes
    'object, got %s' % type(text).__name__)
TypeError: to_bytes must receive a unicode, str or bytes object, got NoneType
2019-12-01 15:14:47 [scrapy.core.engine] INFO: Closing spider (finished)

Looking at this error message, it seems that one of the parameter values is retrievedNoneAs a result, we look at the printed parameter information and find thatga_idyesNoneLet’s revise it again whenga_idbyNoneLet’s try to pass an empty string.

The modification code is as follows:

ga_id = response.xpath("//input[@name='ga_id']/@value").extract_first()
if ga_id is None:
    ga_id = ""

Run the crawler again, this time we’ll see the results:

Set-Cookie: _gh_sess=QmtQRjB4UDNUeHdkcnE4TUxGbVRDcG9xMXFxclA1SDM3WVhqbFF5U0wwVFp0aGV1UWxYRWFSaXVrZEl0RnVjTzFhM1RrdUVabDhqQldTK3k3TEd3KzNXSzgvRXlVZncvdnpURVVNYmtON0IrcGw1SXF6Nnl0VTVDM2dVVGlsN01pWXNUeU5XQi9MbTdZU0lTREpEMllVcTBmVmV2b210Sm5Sbnc0N2d5aVErbjVDU2JCQnA5SkRsbDZtSzVlamxBbjdvWDBYaWlpcVR4Q2NvY3hwVUIyZz09LS1lMUlBcTlvU0F0K25UQ3loNHFOZExnPT0%3D--8764e6d2279a0e6960577a66864e6018ef213b56; path=/; secure; HttpOnly

2019-12-01 15:25:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/> (referer: https://github.com/login)
['Issues', 'Issues']
2019-12-01 15:25:18 [scrapy.core.engine] INFO: Closing spider (finished)

We can see that the information we need has been printed and the login is successful.

Scratch for form requests,FormRequestAnother method is providedfrom_responseTo automatically get the form in the page, we just need to pass in the user name and password to send the request.

Let’s look at the source code of this method:

@classmethod
    def from_response(cls, response, formname=None, formid=None, formnumber=0, formdata=None,
                      clickdata=None, dont_click=False, formxpath=None, formcss=None, **kwargs):

        kwargs.setdefault('encoding', response.encoding)

        if formcss is not None:
            from parsel.csstranslator import HTMLTranslator
            formxpath = HTMLTranslator().css_to_xpath(formcss)

        form = _get_form(response, formname, formid, formnumber, formxpath)
        formdata = _get_inputs(form, formdata, dont_click, clickdata, response)
        url = _get_form_url(form, kwargs.pop('url', None))

        method = kwargs.pop('method', form.method)
        if method is not None:
            method = method.upper()
            if method not in cls.valid_form_methods:
                method = 'GET'

        return cls(url=url, method=method, formdata=formdata, **kwargs)

We can see that there are many parameters of this method, all of which are about the form positioning information. If there is only one form in the login page, scratch can be easily located, but what if the page contains multiple forms? At this time, we need to use these parameters to tell scrapy which is the login form.

Of course, the premise of this method is that the action of the form form on our web page contains the URL address for submitting the request.

In the case of GitHub, our login page only has a login form, so we just need to pass in the user name and password. The code is as follows:

# -*- coding: utf-8 -*-
import scrapy
import re

class Github2Spider(scrapy.Spider):
    name = 'github2'
    allowed_domains = ['github.com']
    start_urls = ['http://github.com/login']

    def parse(self, response):
        yield scrapy.FormRequest.from_response(
            Response, automatically search for form form form from response
            formdata={"login": "[email protected]", "password": "xxx"},
            callback=self.after_login
        )
    #Operation after successful login
    def after_login(self, response):
        #Locate the issues field on the page and print it
        print(re.findall("Issues", response.body.decode()))

After running the crawler, we can see the same results as before.

Is this request much simpler? We don’t have to struggle to find various request parameters. Do you feel that amazing?

summary

This paper introduces several methods of scrapy simulation landing website, you can use the method to practice. Of course, there is no reference to the situation of captcha. Captcha is a complex and difficult topic. I will introduce it later.