Data acquisition practice (IV) — download answers to linear algebra exercises

Time:2022-1-28

1. General

Some time ago, I was reading the third edition of linear algebra textbook “how to learn linear algebra” recommended by many people. There are a lot of exercises in each chapter of this edition.

Although the official website provides the answers to the exercises according to the chapters, first, because the website is foreign and the access is not smooth, and second, the answers are mixed with advertisements, which affects the viewing.
Therefore, I want to try to climb down the answers and make them into PDF, which is convenient to view and will not be affected by the network.

2. Acquisition process

It’s just that it’s easy to get web pages. There’s nothing to say. What’s different here from the previous data acquisition practice is:

  1. There are mathematical formulas in the web page. These formulas can be displayed normally through the conversion of front-end JS. Therefore, it is useless to directly obtain DOM content from HTML. You should obtain all HTML elements
  2. After obtaining the web page, remove unnecessary elements (such as header, footer, menu, advertisement, etc.) and then save the web page, that is, collect the local content of the web page

image.png
The green background part is completed through the puppeter.
The part with blue background is completed through PDF related command-line gadgets after collection.

2.1 remove elements from web pages (green background)

await page.evaluate(() => {
      const domToRemove = [
        "#top-bar-wrap",
        "#site-header",
        "#main> .page-header",
        "#content > article > ul",
        "#content > article > .entry-content > center",
        "#content > article > .entry-content > .google-auto-placed",
        "#content > article > .entry-content > #amzn_assoc_ad_div_adunit0_0",
        "#content > article > .entry-content > #related_posts",
        ".post-tags",
        "nav",
        "section",
        ".addthis-smartlayers",
        "#right-sidebar",
        "footer",
      ];
      for (let j = 0; j < domToRemove.length; j++) {
        const doms = document.querySelectorAll(domToRemove[j]);
        for (let k = 0; k < doms.length; k++) {
          // !!! This step is the key to remove yourself from the DOM tree
          doms[k].parentNode.removeChild(doms[k]);
        }
      }
    });

    //The web page is saved as an HTML file so that it can be converted into PDF later
    await savePage(
      page,
      "./output/linearAlgebraExercises",
      exercies[i] + ".html"
    );

2.2 generate PDF document (part with blue background)

There are many tools to convert HTML files into PDF. Python and nodejs have many such libraries. You can choose one you are familiar with.
I use pandoc, the conversion effect is pretty good! Mathematical formulas can be displayed correctly.

#Examples of commands for converting HTML
pandoc input.html -t latex -o output.pdf

The display effect is as follows:
image.png

Merging multiple PDFs also has many gadgets. I use pdftk.

#Example command for merging pdf
pdftk input1.pdf input2.pdf input3.pdf cat output output.pdf

3. Summary

The whole process is very simple. The only technical point worth mentioning may be that unnecessary parts are removed in real time when obtaining web pages.

Although it is very simple, the whole process is complete. After improving its details, it is actually a process of automatically making e-books.

4. Precautions

Crawling data is only for research, learning and use. The code in this article follows:

  1. If the website has robots Txt, follow the Convention
  2. The crawl speed simulates the normal access rate without increasing the burden on the server
  3. Only obtain fully public data, and never touch data that may involve privacy