Reptile case in the simplest case

Time:2022-5-21

Document establishment date: March 6, 2020

Last revision date: None

Relevant software information:

Win 10 Python 3.8.2 PySimpleGUI 4.16.0 BS4 4.8.2

Note: please quote or change this article at will. Just indicate the source and author. The author does not guarantee that the content is absolutely correct. Please be responsible for any consequences

Title: reptile case in the simplest case

I often read some novels on the Internet, but I often encounter some problems such as not downloading or advertising in some chapters, so I wrote a simple crawler, which only aims at simple requests and does not need to set headers, cookies, login, authentication or VIP users to obtain the novel web page of web page data

requirement:

  1. The website of the novel catalogue does not need to turn the page tohttps://www.wfxs.org/html/2/take as an example
  2. Get the website of each chapter
  3. Set up a catalogue for the whole novel
  4. Each chapter establishes a text file
  5. To speed up completion, use multithreading
  6. Simple display of threads in progress, completion number and total number

Thread description

After using several methods, we often encounter the end of the main program, but the threads are not completely completed, and the total number of chapters is always wrong. Therefore, we can establish a recording area and manage it by ourselves. Finally, we can solve this problem and make it clear that all threads have been completed

output

Reptile case in the simplest case

Reptile case in the simplest case

Description and code

  1. Libraries used
from pathlib import Path
from bs4 import BeautifulSoup as bs
from copy import deepcopy
import urllib.request as request
import _thread
import PySimpleGUI as sg
  1. Establishment of web page processing class
class WEB():
  • Create directory: if the directory already exists, add a number to distinguish it
def create_subdirectory(self, name):
        i, path = 1, Path(name)
        while path.is_dir():
            i += 1
            path = Path(name+str(i))
        self.root = path
        path.mkdir()
  • Read the content of this chapter

    Because the content is placed inheadIf you directlyhead.textReading will also get the text in the sub tag, so remove other sub tags first, and thenhead.textRead

    < HTML… > < head > < title >… < / Title > < meta… / >… Chapter content text < / head >
def chapter_content(self, html):
        for tag in html.head: tag.extract()
        chapter_text = self.form(html.head.text).strip()
        return chapter_text
  • Place in text<br>And the removal of redundant blank lines
def form(self, text):
        text = text.replace('\xA0', '')
        while '\n\n' in text:
            text = text.replace('\n\n', '\n')
        return text
  • Gets a new unused record thread key
def get_a_key(self):
        for i in range(self.max):
            if i not in self.queue:
                return i
        return None
  • Read the author name in the directory

    find_ All, tag ismeta, parametername:author; readcontentSplit the string from the right and take the rightmost one

    < meta content = “ye Tiannan” name = “author” / >
def get_auther(self, html):
        return html.find_all(
            name ='meta', attrs={'name':'author'}
            )[0].get('content').rsplit(sep=None, maxsplit=1)[-1]
  • Read all chapter names and links in the directory

    find_ All, tag is<dd>The chapter is called tag<a>oftext, the link is tag<a>ofhrefValue, if the chapter name is an empty string, skip

    < DD > < a href = “/ HTML / 2 / 3063. HTML” > Chapter 15 Qingyuan clinic < / a > < / DD >
def get_chapters(self, html):
        chapters = html.find_all(name='dd')
        result = []
        for chapter in chapters:
            title = chapter.a.text.split('(')[0]
            if title != '':
                link = self.base + chapter.a.get('href')
                result.append([self.valid(title), link])
        return result
  • Read the introduction to the book in the catalogue

    Introduction in tag<p>, parameterclass="tl pd8 pd10"in<br>Latertext

    < p class = “TL pd8 PD10” > Author: the novel “top quality natural medicine” written by Ye Tiannan… < br / >
def get_description(self, html):
        return self.form(html.find(
            name='p',
            attrs={'class':"tl pd8 pd10"}).br.text)
  • Read book titles in the catalog

    The title of the book is in tag<h1>Mediumtext, because you want to create a directory with the title of the book, you need to remove the illegal letters from the title of the book

    < H1 class = “TC H10” > excellent natural medicine</h1>
def get_name(self, html):
        return self.valid(html.h1.text)
  • Load the title, author, introduction, chapter names and links in the directory
def load_catalog(self, url):
        status, html     = self.load_html(url)
        if status != 200:
            return None, None, None, None, None
        name        = self.get_name(html)
        auther      = self.get_auther(html)
        description = self.get_description(html)
        chapters    = self.get_chapters(html)
        return status, name, auther, description, chapters
  • Load the novel text of the chapter, and then put it into the buffer for saving, as long as the result of web page loading is not200The code is removed from the thread record area and put into the thread again
def load_chapter(self, key, chapter, url):
        status, html = self.load_html(url)
        if status != 200:
            self.temp.append([chapter, url])
            del self.queue[key]
        else:
            chapter_text = self.chapter_content(html)
            self.buffer[key] = [chapter, chapter_text]
        return
  • Read the HTML file according to the web address. If there is an error or the status code is not 200, it will return none, indicating an error, and then read it again later

    The code of this page isbig5, if decoding is wrong,ignore, the word will be skipped

    <meta http-equiv=”Content-Type” content=”text/html; charset=big5″ />
def load_html(self, url):
        try:
            response = request.urlopen(url)
            status   = response.getcode()
        except:
            return None, ''
        else:
            if status == 200:
                data = str(response.read(), encoding='big5', errors='ignore')
                html = bs(data, features='html.parser')
                return status, html
            else:
                return None, ''
  • Delete thread record
def queue_delete(self, key):
        del self.queue[key]
  • The thread is added to the record and started. It is a non thread method in the annotation
def queue_insert(self, chapter, url):
        key = self.get_a_key()
        self.queue[key] = [chapter, url]
        # self.load_chapter(key, chapter, url)
        _thread.start_new_thread(self.load_chapter, (key, chapter, url))
  • Check whether the thread record has reached the upper limit, which is used to limit the maximum number of threads, and no new limit will be added
def queue_is_full(self):
        return True if len(self.queue) == self.max else False
  • Check whether the thread record is empty to confirm that all threads have completed
def queue_not_empty(self):
        return True if len(self.queue) != 0 else False
  • Store the description file of the novel book, including the title, author and brief introduction. If the file already exists, attach additional numbers to distinguish it
def save_ book(self, name, auther, description):
  • Store the chapter text of the novel. If the file name exists, attach additional numbers to distinguish it. Delete the chapter in the save buffer and thread record
def save_chapter(self):
        buffer = deepcopy(self.buffer)
        for key, value in buffer.items():
            i, path = 1, self.root.joinpath(value[0]+'.txt')
            while path.is_file():
                i += 1
                path = self.root.joinpath(value[0]+str(i)+'.txt')
            with open(path, 'wt', encoding='utf-8') as f:
                f.write(value[1])
            self.count += 1
            del self.buffer[key]
            del self.queue[key]
  • Remove the illegal letters in the file name to avoid saving errors
def valid(self, text):
        return ''.join((char for char in text if char not in self.not_allow))
  • main program
    • If the directory cannot be loaded, end the program
    • Save the description file of the novel book
    • Establish a simple GUI to display the progress and control that can be completed at any time, or ensure that all trips have been completed
url = ' https://www.wfxs.org/html/2/ '# novel catalogue website

This work adoptsCC agreement, reprint must indicate the author and the link to this article

Jason Yang