Python crawlers go step by step, from crawling a chapter novel to crawling a whole station novel

Time:2020-2-8

Preface

The text and pictures of the article are from the Internet, only for learning and communication, and do not have any commercial use. The copyright belongs to the original author. If you have any questions, please contact us in time for handling.

PS: if you need Python learning materials, you can click the link below to get them by yourself

[http://note.youdao.com/noteshare?id=3054cce4add8a909e784ad934f956ce]

Many good-looking novels can only be read but not downloaded, which can teach you how to crawl all novels of a website

Knowledge points:

  1. requests

  2. xpath

  3. The whole station novel crawls the thought

Development environment:

  1. Version: Anaconda 5.2.0 (Python 3.6.5)

  2. Editor: pycharm

Third party Library:

  1. requests

  2. parsel

Perform web page analysis

Target site: 在这里插入图片描述

  • Use of developer tools

    • network

    • element

Crawl a chapter of a novel

  • Use of requests Library (request web page data)

  • Encapsulate the request web page data steps

  • Use of CSS selector (parsing web page data)

  • Operation file (data persistence)

# -*- coding: utf-8 -*-
import requests
import parsel

"" "crawl through a chapter of a novel" ""

#Request web data
headers = {
   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
}

response = requests.get('http://www.shuquge.com/txt/8659/2324752.html', headers=headers)
response.encoding = response.apparent_encoding
html = response.text
print(html)


#Extract content from web pages
sel = parsel.Selector(html)

title = sel.css('.content h1::text').extract_first()
contents = sel.css('#content::text').extract()
contents2 = []
for content in contents:
   contents2.append(content.strip())

print(contents)
print(contents2)

print("\n".join(contents2))

#Write content to text
with open(title+'.txt', mode='w', encoding='utf-8') as f:
   f.write("\n".join(contents2))

Crawling through a novel

  • Reconstruction of reptiles

    There are many chapters to climb. The stupidest way is to use the for loop directly.

  • Crawl index page

    You need to crawl all the chapters, just get the website of each chapter.

import requests
import parsel

"" "get page source code" ""

#Send request by simulated browser
headers = {
   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
}


def download_one_chapter(target_url):
   #URL to request
   # target_url = 'http://www.shuquge.com/txt/8659/2324753.html'
   #Content object returned by response service
   #Pycharm Ctrl + left mouse button
   response = requests.get(target_url, headers=headers)

   #Decoding universal decoding
   response.encoding = response.apparent_encoding

   #Text method to get the text content of web page
   # print(response.text)
   String string
   html = response.text

   "" "get information from the source code of the web page" ""
   #Using Parse to change a string into an object
   sel = parsel.Selector(html)

   # scrapy
   #Extract extracts the contents of a label
   #Pseudo class selector (select attribute) CSS selector (select label)
   #Extract first content
   title = sel.css('.content h1::text').extract_first()
   #Extract everything
   contents = sel.css('#content::text').extract()
   print(title)
   print(contents)

   "" "data clear clear empty string" ""
   # contents1 = []
   # for content in contents:
   #Remove the blank characters at both ends
   #Operation of string operation list
   #     contents1.append(content.strip())
   #
   # print(contents1)
   #List derivation
   contents1 = [content.strip() for content in contents]
   print(contents1)
   #Program list string
   text = '\n'.join(contents1)
   print(text)
   "" "save novel content" ""
   #Open operation file (write, read)
   file = open(title + '.txt', mode='w', encoding='utf-8')

   #Only strings can be written
   file.write(title)
   file.write(text)

   #Close file
   file.close()


#A catalogue of a novel was introduced
def get_book_links(book_url):
   response = requests.get(book_url)
   response.encoding = response.apparent_encoding
   html = response.text
   sel = parsel.Selector(html)
   links = sel.css('dd a::attr(href)').extract()
   return links


#Download a novel
def get_one_book(book_url):
   links = get_book_links(book_url)
   for link in links:
       print('http://www.shuquge.com/txt/8659/' + link)
       download_one_chapter('http://www.shuquge.com/txt/8659/' + link)



if __name__ == '__main__':
   # target_url = 'http://www.shuquge.com/txt/8659/2324754.html'
   #Keywords and position parameters
   # download_one_chapter(target_url=target_url)
   #Download other novels and change the URL directly
   book_url = 'http://www.shuquge.com/txt/8659/index.html'
   get_one_book(book_url)

Crawling the whole station novel

  • Crawl index page

    You need to crawl all the novels, just get the index page of each book

Recommended Today

My guest, this is a carefully prepared reading experience of redis design and Implementation (Part 2)

We have built a technology exchange group. You can join the group to exchange and learn Chapter 15 reproduction In redis, you can perform the slaveof command or set the slaveof option to let the slave server back up the data on the master server.Redis’s replication function is mainly divided into synchronization and command propagation.Synchronization […]