Introduction to scratch – climb to Douban movie top250!

Time:2021-11-25

This lesson is only for the version of scripy in the python 3 environment (i.e. scripy1.3 +)

What website to choose to crawl?

For crooked fruit people, the websites for practicing the scratch crawler are generally official practice websiteshttp://quotes.toscrape.com

We Chinese, of course, use watercress!https://movie.douban.com/top250

Introduction to scratch - climb to Douban movie top250!

The first step is to build and prepare

  1. In order to create a clean enough environment to run scratch, virtualenv is a good choice.
>>> mkdir douban250 && cd douban250
>>> virtualenv -p python3.5 doubanenv

First, make sure that you have installed virtualenv and python 3. X. The above command is to create virtualenv of Python 3.5 environment

Virtualenv tutorial:Liao Xuefeng Python tutorial – virtualenv

  1. Open virtualenv and install scratch
>>> source doubanenv/bin/activate
>>> pip install scrapy
  1. Use scratch to initialize a project. For example, we name it doublan_ crawler
>>> scrapy startproject douban_crawler

A directory structure is generated

douban_crawler/
    douban.cfg
    douban_crawler/
        __init__.py
        items.py
        middlewares.py
        piplines.py
        setting.py

And the command line will give you a prompt:

You can start your first spider with:
    cd douban_crawler
    scrapy genspider example example.com
  1. Follow the prompts to perform these two steps!
>>> cd douban_crawler
>>> scrapy genspider douban movie.douban.com/top250

The spider directory is added to the directory structure after genspider

douban_crawler/
    douban.cfg
    douban_crawler/
        spiders/
            __init__.py
            douban.py
        __init__.py
        items.py
        middlewares.py
        piplines.py
        setting.py

Set the project in pycharm

  1. First, open doublan in pycharm_ crawler/
  2. Then set the virtual environment of pycharm:

Perference > Project:douban_crawler > Project Interpreter
Click the settings Icon > add local > existing environment; Switch the preset Python parser to doublenenv > bin > Python 3.5 under the newly created virtualenv

Start crawling

The preparations are finished and you can start crawling!

Open the doublan.py file in the spiders / directory

# douban_crawler/ > spiders/ > douban.py

import scrapy

class DoubanSpider(scrapy.Spider):
    name = 'douban'
    allowed_domains = ['movie.douban.com/top250']
    start_urls = ['http://movie.douban.com/top250/']

    def parse(self, response):
        pass

start_ URLs is the URL we need to climb!

Put start_ In URLshttp://movie.douban.com/top250/

Change tohttps://movie.douban.com/top250/

Next, we will rewrite the parse () function for parsing.

Parse Douban 250 entries

Open with chrome or Firefox browserhttps://movie.douban.com/top250/

Using the check analysis element in the right-click menu, you can see:

The ‘. Item’ is wrapped with entries

Under each entry

The SRC attribute of ‘. Pic a img’ contains the cover picture address
The SRC attribute of ‘. Info. HD a’ contains the Douban link
The text in ‘. Info. HD a. Title’ contains the title, because each movie will have multiple aliases, we only need to take the first title.
‘.info .bd .star .rating_ Num ‘contains scores
‘. Info. BD. Quote span. INQ’ contains short comments

In addition, the director, age, starring, profile and other information can only be crawled by clicking the entry. Let’s climb the above five information first!

Fill in the parse () function according to the analysis just now

# douban_crawler/ > spiders/ > douban.py

# -*- coding: utf-8 -*-
import scrapy

class DoubanSpider(scrapy.Spider):
    name = 'douban'
    allowed_domains = ['movie.douban.com/top250']
    start_urls = ['https://movie.douban.com/top250/']

    def parse(self, response):
        items = response.css('.item')
        for item in items:
            yield {
                'cover_pic': item.css('.pic a img::attr(src)').extract_first(),
                'link': item.css('.info .hd a::attr(href)').extract_first(),
                'title': item.css('.info .hd a .title::text').extract_first(),
                'rating': item.css('.info .bd .star .rating_num::text').extract_first(),
                'quote': item.css('.info .bd .quote span.inq::text').extract_first()
            }

. CSS () runs a parser similar to jQuery or pyquery, but it can use:: text or:: attr (href) to directly obtain attributes or text, which should be more convenient than jQuery parser.

Of course,. CSS can only return one object. If you need specific text or attributes, you need. Extract () or. Extract_ first()

Where. Extract () returns an array of all qualified text or attributes, and. Extract_ First () returns only the first text or property found in the query.

Verify whether the parsing statement is correct in the shell

Here we need to give you a trick, which is to verify the CSS just written with the shell first

Open terminal at the bottom left of the pycharm window
Command line input:

>>> scrapy shell https://movie.douban.com/top250

A pile of return information will come out, and finally a pile of tips will come out

[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x10d276f28>
[s]   item       {}
[s]   request    <GET https://movie.douban.com/top250>
[s]   response   <403 https://movie.douban.com/top250>
[s]   settings   <scrapy.settings.Settings object at 0x10e543128>
[s]   spider     <DefaultSpider 'default' at 0x10fa99080>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

We see a response item

——What? 403 error returned?

It turned out that we didn’t set the user agent request header for crawling Douban, but Douban didn’t accept the get request without request header, and finally returned 403 error.

At this time, quickly add a line in settings.py

#  spider_crawler/ > settings.py

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36' 

After saving the file, run again

>>> scrapy shell https://movie.douban.com/top250

Tips:

[s]   response   <200 https://movie.douban.com/top250>

At this time, you can start to check whether the parsing statement can get the desired result:

response.css('.item')

An array of < selector > objects is returned. We take the first selector

>>> items = response.css('.item')
>>> item = item[0]

#Cover_ Copy the parsing statement corresponding to pic
>>> item.css('.pic a img::attr(src)').extract_first()

return

'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p480747492.jpg'

That is to prove that the parsing statement is correct, and the other four items can be verified one by one

>>> item.css('.pic a img::attr(src)').extract_first()
'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p480747492.jpg'
>>> item.css('.info .hd a::attr(href)').extract_first()
'https://movie.douban.com/subject/1292052/'
>>> item.css('.info .hd a .title::text').extract_first()
'shawshank redemption '
>>> item.css('.info .bd .star .rating_num::text').extract_first()
'9.6'
>>> item.css('.info .bd .quote span.inq::text').extract_first()
'want to set people free.'

At this point, exit the shell with exit ()
Rerun crawler

>>> scrapy crawl douban

You can see the parsed data output!

Page up and crawl all 250 pieces of data

We just crawled through it, but only 25 items are displayed on this page. How can we climb to all 250 items by turning the page?

Through the “check” function of Chrome browser, we can find the link corresponding to “next page” on the Douban page:

response.css('.paginator .next a::attr(href)')

Now let’s rewrite a double. Py

# douban_crawler/ > spiders/ > douban.py
class DoubanSpider(scrapy.Spider):
    name = 'douban'
    allowed_domains = ['movie.douban.com/top250']
    start_urls = ['https://movie.douban.com/top250']

    def parse(self, response):
        items = response.css('.item')
        for item in items:
            yield {
                'cover_pic': item.css('.pic a img::attr(src)').extract_first(),
                'link': item.css('.info .hd a::attr(href)').extract_first(),
                'title': item.css('.info .hd a .title::text').extract_first(),
                'rating': item.css('.info .bd .star .rating_num::text').extract_first(),
                'quote': item.css('.info .bd .quote span.inq::text').extract_first()
            }
        next_page = response.css('.paginator .next a::attr(href)').extract_first()
        if next_page:
            next_page_real = response.urljoin(next_page)
            yield scrapy.Request(next_page_real, callback=self.parse,dont_filter=True)

The above returns the real URL after splicing through response. Urljoin()
Then the request object is returned through a recursion, and all pages are traversed.

be careful! Climbing beans must join dont_ Filter = true option, because as long as the URL of the website parsed by the script has’ filter = ‘, it will automatically filter and allocate the processing results to the corresponding category, but the filter in the Douban URL is empty and does not need to be allocated, so this option must be turned off.

Run the crawler again

>>> scrapy crawl douban

250 pieces of data have been crawled out!

Store data to file

Very simple, when running the crawler, add – O to output!

>>> scrapy crawl douban -o douban.csv

So you can see that there is a. CSV file in the current directory, which stores the result we want!

You can also use. JSON. XML. Pickle. JL and other data types to store, and scratch is also supported!

Use items to structure the crawled data

The item function of scrapy is very similar to the model function in Django or other MVC frameworks, that is, to convert data into a fixed structure, so that it can be saved and displayed.

We open the items.py file and define a data type named doublanitem as follows.

# douban_clawler > items.py
import scrapy

class DoubanItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    rating = scrapy.Field()
    cover_pic = scrapy.Field()
    quote = scrapy.Field()

With doublanitem, you can modify the parse function in doublan.py to convert all the crawled information into item form

# -*- coding: utf-8 -*-
import scrapy
from douban_crawler.items import DoubanItem

class DoubanSpider(scrapy.Spider):
    name = 'douban'
    allowed_domains = ['movie.douban.com/top250']
    start_urls = ['https://movie.douban.com/top250']

    def parse(self, response):
        items = response.css('.item')
        for item in items:
            yield DoubanItem({
                'cover_pic': item.css('.pic a img::attr(src)').extract_first(),
                'link': item.css('.info .hd a::attr(href)').extract_first(),
                'title': item.css('.info .hd a .title::text').extract_first(),
                'rating': item.css('.info .bd .star .rating_num::text').extract_first(),
                'quote': item.css('.info .bd .quote span.inq::text').extract_first()
            })

        next_page = response.css('.paginator .next a::attr(href)').extract_first()
        if next_page:
            next_page_real = response.urljoin(next_page)
            yield scrapy.Request(next_page_real, callback=self.parse,dont_filter=True)

Very simple, only two lines have been modified:

  1. Introducing doublanitem
  2. The original yield is a dict format. Now you can convert the dict into a doublanitem object by directly passing in the dict in the doublanitem!

Now you canscrapy crawl doubanTry crawling again to see if it has been converted to the form of double item?

Store data to mongodb

With the doublanitem data structure, we can save it into mongodb!

To save to mongodb, we need to use the piplines component. Yes, it is piplines.py in the above pile of files

Before that, we made sure of two things:

  1. Mongodb service has been started. If it is not enabled, please enable sudo mongod locally.
  2. Pymongo package installed. If not, please PIP install pymongo

When using pipline, please remember four words!

Rev!

(open crawler) corresponds to open_ Spider, called when the spider is turned on

Start mongodb here

Yes!

(undertake crawling task) corresponds to from_ Crawler, which has several characteristics:

  1. It is a class object, so @ classmethod must be added.
  2. As long as there is this function, it will be called.
  3. It must return an instance of our pipline own object.
  4. It has the ability to access all core components of scripy, that is, it can access the settings object.

Turn!

(conversion object) corresponds to process_ Item, which has several characteristics:

  1. An item object or raise dropitem() error must be returned
  2. In this function, the crawled item will be passed in and other operations will be performed.

Close!

(closed reptile)_ Spider, which is called when the spider is closed

We can close mongodb in this function

With the above know, we have the code!

# douban_crawler > piplines.py
import pymongo


class DoubanCrawlerPipeline(object):
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri = crawler.settings.get('MONGO_URI'),
            mongo_db = crawler.settings.get('MONGO_DB')
        )

    def process_item(self, item, spider):
        self.db['douban250'].insert_one(dict(item))
        return item

    def close_spider(self, spider):
        self.client.close()

At the same time, add mongodb configuration in setting

MONGO_URI = "localhost"
MONGO_DB = "douban"

There is a very important step!
Open the comment of pipline in setting!!

ITEM_PIPELINES = {
   'douban_crawler.pipelines.DoubanCrawlerPipeline': 300,
}

Now open crawler

scrapy crawl douban

The crawled information has been saved to the database!

Introduction to scratch - climb to Douban movie top250!

Recommended Today

Apache sqoop

Source: dark horse big data 1.png From the standpoint of Apache, data flow can be divided into data import and export: Import: data import. RDBMS—–>Hadoop Export: data export. Hadoop—->RDBMS 1.2 sqoop installation The prerequisite for installing sqoop is that you already have a Java and Hadoop environment. Latest stable version: 1.4.6 Download the sqoop installation […]