Python implements crawler data stored in MongoDB

Time:2019-10-24

In the above two articles have been introduced to Python crawler and mongo, so here I will data storage to directing a crawler to climb down, the first to introduce we are going to crawl sites, readfree website, this website is very good, we just need to sign in can be downloaded for free every day three books, conscience website, below I will get down the site’s daily recommended books.

Using the methods described in the previous articles, we can easily find the name and author of the book in the source code of the web page.

Once we find it, we copy the XPath and extract it. The source code is shown below


# coding=utf-8

import re
import requests
from lxml import etree
import pymongo
import sys

reload(sys)
sys.setdefaultencoding('utf-8')

def getpages(url, total):
  nowpage = int(re.search('(\d+)', url, re.S).group(1))
  urls = []

  for i in range(nowpage, total + 1):
    link = re.sub('(\d+)', '%s' % i, url, re.S)
    urls.append(link)

  return urls

def spider(url):
  html = requests.get(url)

  selector = etree.HTML(html.text)

  book_name = selector.xpath('//*[@id="container"]/ul/li//div/div[2]/a/text()')
  book_author = selector.xpath('//*[@id="container"]/ul/li//div/div[2]/div/a/text()')

  saveinfo(book_name, book_author)

def saveinfo(book_name, book_author):
  connection = pymongo.MongoClient()
  BookDB = connection.BookDB
  BookTable = BookDB.books

  length = len(book_name)

  for i in range(0, length):
    books = {}
    books['name'] = str(book_name[i]).replace('\n','')
    books['author'] = str(book_author[i]).replace('\n','')
    BookTable.insert_one(books)

if __name__ == '__main__':
  url = 'http://readfree.me/shuffle/?page=1'
  urls = getpages(url,3)

  for each in urls:
    spider(each)

Notice that when you’re writing to the database, you don’t want to write the dictionary to the database all at once, which is what I started with, but I found that there were only three pieces of information in the database, and everything else was missing. So write one by one.

There is also the beginning of the source code, the default encoding Settings must not be omitted, otherwise may report coding errors (really feel Python in coding this easy to make mistakes, embarrassing).

Some of you may have noticed that I converted the extracted information into a string and then used the replace() method to get rid of the \n because I found that the line breaks before and after the extracted book information were quite obnoxious.

As a warm reminder, don’t forget to get your Mongo DB running while the program is running. Check out the results

Ok, that’s it. If you find any errors or improvements in the code, please leave me a message. Thank you.