Python crawls Douban book information

Time:2019-11-29

After crawling to the top 100 list of cat’s eye movies, let’s go back to Douban’s book information (mainly the book information, score and proportion, but the review did not). Original, reprint please contact me.


demandCrawling the details and scores of all books under a certain type of label of Douban

languagepython

support library

  • Regular, parse and search: re, requests, BS4, lxml (the latter three need to be installed)
  • Random number: time, random

stepThree steps

  1. Visit the label page for links to all books under the label
  2. Visit the book links one by one to get the book information and scores
  3. Persistent storage of book information (excel is used here, database can be used)

I. visit the label page to get the links of all books under the label

As usual, let’s first look at the robots.txt of Douban. We can’t crawl the forbidden content.

For the tag page to be crawled in this step, take the novel as an example: https://book.double.com/tag/% E5% B0% 8F% E8% AF% B4

Let’s take a look at its HTML structure

Every book is in one

  • In the tag, all we need is a link to the picture (that is, a link to the book page)

    In this way, you canWrite regular or use BS4To get a link to the book.

    As you can see,Each page only displays 20 books, so you need to traverse all pages, and its page links are regular

    Page 2: https://book.double.com/tag/% E5% B0% 8F% E8% AF% B4?start=20&type=T

    Page 3: https://book.double.com/tag/% E5% B0% 8F% E8% AF% B4?start=40&type=T

    That is to say: start will increase by 20 every time.

    Here’s the code:

    # -*- coding: utf-8 -*-
     # @Author  : yocichen
     # @Email   : [email protected]
     # @File    : labelListBooks.py
     # @Software: PyCharm
     # @Time    : 2019/11/11 20:10
     
     import re
     import openpyxl
     import requests
     from requests import RequestException
     from bs4 import BeautifulSoup
     import lxml
     import time
     import random
     
     src_list = []
     
     def get_one_page(url):
         '''
         Get the html of a page by requests module
         :param url: page url
         :return: html / None
         '''
         try:
             head = ['Mozilla/5.0', 'Chrome/78.0.3904.97', 'Safari/537.36']
             headers = {
                 'user-agent':head[random.randint(0, 2)]
             }
             Response = requests. Get (URL, headers = headers, proxies = {'HTTP':'171.15.65.195: 9999 '}) ා the proxy here can be set or not added. If it fails, it can be replaced or not added
             if response.status_code == 200:
                 return response.text
             return None
         except RequestException:
             return None
     
     def get_page_src(html, selector):
         '''
         Get book's src from label page
         :param html: book
         :param selector: src selector
         :return: src(list)
         '''
         # html = get_one_page(url)
         if html is not None:
             soup = BeautifulSoup(html, 'lxml')
             res = soup.select(selector)
             pattern = re.compile('href="(.*?)"', re.S)
             src = re.findall(pattern, str(res))
             return src
         else:
             return []
     
     def write_excel_xlsx(items, file):
         '''
         Write the useful info into excel(*.xlsx file)
         :param items: book's info
         :param file: memory excel file
         :return: the num of successful item
         '''
         wb = openpyxl.load_workbook(file)
         ws = wb.worksheets[0]
         sheet_row = ws.max_row
         item_num = len(items)
         # Write film's info
         for i in range(0, item_num):
             ws.cell(sheet_row+i+1, 1).value = items[i]
         # Save the work book as *.xlsx
         wb.save(file)
         return item_num
     
     if __name__ == '__main__':
         total = 0
         For page? Index in range (0, 50):? Why is this 50 page? Douban seems to have many pages, but there is no data after it is accessed. Currently, there are only 50 pages to access.
             # novel label src : https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=
             # program label src : https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=
             # computer label src : https://book.douban.com/tag/%E8%AE%A1%E7%AE%97%E6%9C%BA?start=
             # masterpiece label src : https://book.douban.com/tag/%E5%90%8D%E8%91%97?start=
             Url ='https://book.double.com/tag/% E5% 90% 8D% E8% 91% 97? Start = '+ str (page_index * 20) +' & type = t 'ා all you have to do is replace the front part of the URL with the corresponding part of all the tags you climb, specifically the red bold text part.
             one_loop_done = 0
             # only get html page once
             html = get_one_page(url)
             for book_index in range(1, 21):
                 selector = '#subject_list > ul > li:nth-child('+str(book_index)+') > div.info > h2'
                 src = get_page_src(html, selector)
                 Row = write? Excel? Xlsx (SRC, 'masterpiece? Books? Src. Xlsx')? The file to be stored needs to be created first
                 one_loop_done += row
             total += one_loop_done
             print(one_loop_done, 'done')
         print('Total', total, 'done')

    The annotation is clear. First get the page HTML, regular or BS4 traverse to get the book links in each page, and save them in Excel file.

    Note: if you need to use my code directly, you only need to look at the link of that label page, and then replace the red bold part (Chinese label code), as well as create an excel file to store the crawled Book link.


    2. Visit the book links one by one and crawl the book information and scores

    In the previous step, we have climbed to the SRC of all books under the novel label. This step is to visit the SRC of books one by one, and then crawl the specific information of books.

    First look at the HTML structure of the information to be crawled

    Here is the book information page structure

    The structure of scoring page

    In this way, we can use regular expressions and BS4 library to match the data we need. (try to be pure and regular. It’s hard to write. It doesn’t work.)

    Look at the code below

    # -*- coding: utf-8 -*-
     # @Author  : yocichen
     # @Email   : [email protected]
     # @File    : doubanBooks.py
     # @Software: PyCharm
     # @Time    : 2019/11/9 11:38
     
     import re
     import openpyxl
     import requests
     from requests import RequestException
     from bs4 import BeautifulSoup
     import lxml
     import time
     import random
     
     def get_one_page(url):
         '''
         Get the html of a page by requests module
         :param url: page url
         :return: html / None
         '''
         try:
             head = ['Mozilla/5.0', 'Chrome/78.0.3904.97', 'Safari/537.36']
             headers = {
                 'user-agent':head[random.randint(0, 2)]
             }
             response = requests.get(url, headers=headers) #, proxies={'http':'171.15.65.195:9999'}
             if response.status_code == 200:
                 return response.text
             return None
         except RequestException:
             return None
     
     def get_request_res(pattern_text, html):
         '''
         Get the book info by re module
         :param pattern_text: re pattern
         :param html: page's html text
         :return: book's info
         '''
         pattern = re.compile(pattern_text, re.S)
         res = re.findall(pattern, html)
         if len(res) > 0:
             return res[0].split(' h1 > span', html)
         # print('Book-name', book_name)
         book_info['Book_name'] = book_name
         # info > a:nth-child(2)
         author = get_bs_res('div > span:nth-child(1) > a', html)
         if author is None:
             author = get_bs_res('#info > a:nth-child(2)', html)
         # print('Author', author)
         author = author.replace(" ", "")
         author = author.replace("\n", "")
         book_info['Author'] = author
     
         Publisher = get request res (U 'Press: (. *?)', HTML)
         # print('Publisher', publisher)
         book_info['publisher'] = publisher
     
         Publish ﹐ time = get ﹐ request ﹐ res (U 'year of publication: (. *?)', HTML)
         # print('Publish-time', publish_time)
         book_info['publish_time'] = publish_time
     
         ISBN = get_request_res(u'ISBN:(.*?)', html)
         # print('ISBN', ISBN)
         book_info['ISBN'] = ISBN
     
         img_label = get_bs_img_res('#mainpic > a > img', html)
         pattern = re.compile('src="(.*?)"', re.S)
         img = re.findall(pattern, img_label)
         if len(img) is not 0:
             # print('img-src', img[0])
             book_info['img_src'] = img[0]
         else:
             # print('src not found')
             book_info['img_src'] = 'NULL'
     
         book_intro = get_bs_res('#link-report > div:nth-child(1) > div > p', html)
         # print('book introduction', book_intro)
         book_info['book_intro'] = book_intro
     
         author_intro = get_bs_res('#content > div > div.article > div.related_info > div:nth-child(4) > div > div > p', html)
         # print('author introduction', author_intro)
         book_info['author_intro'] = author_intro
     
         grade = get_bs_res('div > div.rating_self.clearfix > strong', html)
         if len(grade) == 1:
             # print('Score no mark')
             book_info['Score'] = 'NULL'
         else:
             # print('Score', grade[1:])
             book_info['Score'] = grade[1:]
     
         comment_num = get_bs_res('#interest_sectl > div > div.rating_self.clearfix > div > div.rating_sum > span > a > span', html)
         # print('commments', comment_num)
         book_info['commments'] = comment_num
     
         five_stars = get_bs_res('#interest_sectl > div > span:nth-child(5)', html)
         # print('5-stars', five_stars)
         book_info['5_stars'] = five_stars
     
         four_stars = get_bs_res('#interest_sectl > div > span:nth-child(9)', html)
         # print('4-stars', four_stars)
         book_info['4_stars'] = four_stars
     
         three_stars = get_bs_res('#interest_sectl > div > span:nth-child(13)', html)
         # print('3-stars', three_stars)
         book_info['3_stars'] = three_stars
     
         two_stars = get_bs_res('#interest_sectl > div > span:nth-child(17)', html)
         # print('2-stars', two_stars)
         book_info['2_stars'] = two_stars
     
         one_stars = get_bs_res('#interest_sectl > div > span:nth-child(21)', html)
         # print('1-stars', one_stars)
         book_info['1_stars'] = one_stars
     
         return book_info
     
     def write_bookinfo_excel(book_info, file):
         '''
         Write book info into excel file
         :param book_info: a dict
         :param file: memory excel file
         :return: the num of successful item
         '''
         wb = openpyxl.load_workbook(file)
         ws = wb.worksheets[0]
         sheet_row = ws.max_row
         sheet_col = ws.max_column
         i = sheet_row
         j = 1
         for key in book_info:
             ws.cell(i+1, j).value = book_info[key]
             j += 1
         done = ws.max_row - sheet_row
         wb.save(file)
         return done
     
     def read_booksrc_get_info(src_file, info_file):
         '''
         Read the src file and access each src, parse html and write info into file
         :param src_file: src file
         :param info_file: memory file
         :return: the num of successful item
         '''
         wb = openpyxl.load_workbook(src_file)
         ws = wb.worksheets[0]
         row = ws.max_row
         done = 0
         for i in range(868, row+1):
             src = ws.cell(i, 1).value
             if src is None:
                 continue
             html = get_one_page(str(src))
             book_info = parse_one_page(html)
             done += write_bookinfo_excel(book_info, info_file)
             if done % 10 == 0:
                 print(done, 'done')
         return done
     
     if __name__ == '__main__':
         # url = 'https://book.douban.com/subject/1770782/'
         # html = get_one_page(url)
         # # print(html)
         # book_info = parse_one_page(html)
         # print(book_info)
         # res = write_bookinfo_excel(book_info, 'novel_books_info.xlsx')
         # print(res, 'done')
         Res = read? Booksrc? Get? Info ('masterpiece? Books? Src. Xlsx ','masterpiece? Books? Info. Xlsx')? Read SRC file, storage file to write book information
         print(res, 'done')

    Note: if you want to use it directly, all you need to do is give parameters. The first is the SRC file obtained in the previous step. The second is the file that needs to store book information (you need to create it in advance)


    III. persistent storage of book information (Excel)

    Using Excel to store the SRC list of books and the specific information of books needs to use openpyxl library to read and write excel. The code is in the write * / read * function above.


    Effect

    SRC of Novels

    Crawled book details

    Epilogue

    It took about two days before and after writing this. The work that the crawler has to do is more detailed. It needs to analyze HTML pages and write regular expressions. In other words, it’s really simple to use BS4. You only need to copy the selector, which can greatly improve the efficiency. In addition, single thread crawlers are stupid. There are many deficiencies (such as irregular code, not robust enough), welcome to correct.

    Reference material

    [1] Douban robots.txt https://www.douban.com/robots.txt

    【2】https://blog.csdn.net/jerrygaoling/article/details/81051447

    【3】https://blog.csdn.net/zhangfn2011/article/details/7821642

    【4】https://www.kuaidaili.com/free