Python crawlers go step by step, from crawling a chapter novel to crawling a whole station novel



Many good-looking novels can only be read but not downloaded, which can teach you how to crawl all novels of a website

Knowledge points:

  1. requests

  2. xpath

  3. The whole station novel crawls the thought

Development environment:

  1. Version: Anaconda 5.2.0 (Python 3.6.5)

  2. Editor: pycharm

Third party Library:

  1. requests

  2. parsel

Perform web page analysis

Target site: 在这里插入图片描述

  • Use of developer tools

    • network

    • element

Crawl a chapter of a novel

  • Use of requests Library (request web page data)

  • Encapsulate the request web page data steps

  • Use of CSS selector (parsing web page data)

  • Operation file (data persistence)

# -*- coding: utf-8 -*-
import requests
import parsel

"" "crawl through a chapter of a novel" ""

#Request web data
headers = {
   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'

response = requests.get('', headers=headers)
response.encoding = response.apparent_encoding
html = response.text

#Extract content from web pages
sel = parsel.Selector(html)

title = sel.css('.content h1::text').extract_first()
contents = sel.css('#content::text').extract()
contents2 = []
for content in contents:



#Write content to text
with open(title+'.txt', mode='w', encoding='utf-8') as f:

Crawling through a novel

  • Reconstruction of reptiles

    There are many chapters to climb. The stupidest way is to use the for loop directly.

  • Crawl index page

    You need to crawl all the chapters, just get the website of each chapter.

import requests
import parsel

"" "get page source code" ""

#Send request by simulated browser
headers = {
   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'

def download_one_chapter(target_url):
   #URL to request
   # target_url = ''
   #Content object returned by response service
   #Pycharm Ctrl + left mouse button
   response = requests.get(target_url, headers=headers)

   #Decoding universal decoding
   response.encoding = response.apparent_encoding

   #Text method to get the text content of web page
   # print(response.text)
   String string
   html = response.text

   "" "get information from the source code of the web page" ""
   #Using Parse to change a string into an object
   sel = parsel.Selector(html)

   # scrapy
   #Extract extracts the contents of a label
   #Pseudo class selector (select attribute) CSS selector (select label)
   #Extract first content
   title = sel.css('.content h1::text').extract_first()
   #Extract everything
   contents = sel.css('#content::text').extract()

   "" "data clear clear empty string" ""
   # contents1 = []
   # for content in contents:
   #Remove the blank characters at both ends
   #Operation of string operation list
   #     contents1.append(content.strip())
   # print(contents1)
   #List derivation
   contents1 = [content.strip() for content in contents]
   #Program list string
   text = '\n'.join(contents1)
   "" "save novel content" ""
   #Open operation file (write, read)
   file = open(title + '.txt', mode='w', encoding='utf-8')

   #Only strings can be written

   #Close file

#A catalogue of a novel was introduced
def get_book_links(book_url):
   response = requests.get(book_url)
   response.encoding = response.apparent_encoding
   html = response.text
   sel = parsel.Selector(html)
   links = sel.css('dd a::attr(href)').extract()
   return links

#Download a novel
def get_one_book(book_url):
   links = get_book_links(book_url)
   for link in links:
       print('' + link)
       download_one_chapter('' + link)

if __name__ == '__main__':
   # target_url = ''
   #Keywords and position parameters
   # download_one_chapter(target_url=target_url)
   #Download other novels and change the URL directly
   book_url = ''

Crawling the whole station novel

  • Crawl index page

    You need to crawl all the novels, just get the index page of each book

