Fast Construction of Novel Search Station: 2. Content Page Analysis

Time:2019-9-11

Three party framework

  1. JSOUP
  2. okhttp

Analytic elements

  1. Chapter Two: Chapter One
  2. Chapter Two: The Next Chapter
  3. Catalog
  4. content

Fast Construction of Novel Search Station: 2. Content Page Analysis

Table design

/**
     * content
     */
    private String content;
    @Field("content_title")
    private String contentTitle;
    @Field("chapter_url")
    private String chapterUrl;
    @Field("next_chapter_url")
    private String nextChapterUrl;
    @Field("last_chapter_url")
    private String lastChapterUrl;

Parsing code

public BookChapter content(String url) {
        BookChapter bookChapter = new BookChapter();

        BookSite bookSite = getSite(url);
        try {
            Document document = download(url);

            Element titleElement = document.selectFirst(bookSite.getContentTitle());
            if (titleElement != null) {
                bookChapter.setName(titleElement.text());
            }

            Element chapterElement = document.selectFirst(bookSite.getChapterUrl());
            if (chapterElement != null) {
                bookChapter.setChapterUrl(chapterElement.absUrl("href"));
            }

            Element nextElement = document.selectFirst(bookSite.getNextChapterUrl());
            if (nextElement != null) {
                bookChapter.setNextChapterUrl(nextElement.absUrl("href"));
            }

            Element lastElement = document.selectFirst(bookSite.getLastChapterUrl());
            if (lastElement != null) {
                bookChapter.setLastChapterUrl(lastElement.absUrl("href"));
            }

            Element contentElement = document.selectFirst(bookSite.getContent());

            if (contentElement != null) {
                contentElement.select("a").remove();
                contentElement.select("script").remove();
                contentElement.select("style").remove();

                bookChapter.setContent(contentElement.html());
            }

        } catch (IOException e) {
            log.error(e.getMessage(), e);
        }

        return bookChapter;
    }

final result

Fast Construction of Novel Search Station: 2. Content Page Analysis

difficulty

There is no difficulty in technology, it is difficult in daily maintenance.

Recommended Today

Free NLP Learning Resources

Abstract:This paper lists some resources for beginners and practitioners to learn natural language processing. Natural language processing represents the ability of a computer system to understand human language. It is part of artificial intelligence. There are many resources online that can help you learn NLP from scratch. This article lists some relevant resources for beginners […]