How to use Node.js to crawl any web page resources and output PDF files to the local area

Time:2019-10-8

Demand:

  • Using Node.js to crawl web resources, out-of-the-box configuration
  • Output the crawled web page content in PDF format

If you’re a technician, you can read my next article. Otherwise, please go directly to my GitHub repository and use the document directly.

Warehouse addressIncidental documentation and source code

Technology used in this requirementNode.js and Puppeteer

  • Puppeteer official address: puppeteer address
  • Node.js Official Address: Link Description
  • Puppeteer is an official production of Node library controlled by headless protocol through DevTools protocol. Headless You can use Puppeteer’s API to directly control Chrome to simulate most user actions for UI Tests or to access pages as crawlers to collect data.
  • Environment and installation
  • Puppeteer itself relies on Node over 6.4, but for asynchronous super-easy-to-use async/await, Node over 7.6 is recommended. In addition, headless Chrome itself has a high requirement for the version of server-dependent libraries. Centos server dependency is stable. It is difficult for V6 to use headless Chrome. Upgrading dependent versions may cause various server problems (including and not limited to ssh), and it is better to use high-version servers. (It is recommended to use the latest version of Node.js)

Try small knife and crawl for resources.

Const puppeteer = require ('puppeteer'); // introducing dependencies 
(async () => {// Perfect asynchrony using async functions 
Const browser = await puppeteer. launch (); // open a new browser
Const page = await browser. newPage (); // Open a new page 
Await page.goto ('https://www.jd.com/'); go to the'url'page inside
Const result = await page. evaluation () => {// This result array contains all the SRC addresses of the pictures
Let arr = []; // The logic of write processing inside the arrow function 
const imgs = document.querySelectorAll('img');
imgs.forEach(function (item) {
arr.push(item.src)
})
return arr 
});
//'The result at this point is the crawler data, which can be saved through the'fs'module'.
})()

Copying the Puppeteer package that used to run to get crawler data using the command line command `node filename’, actually opened another browser for us to reopen the web page and get their data.

  • It only crawls the picture content of the home page, assuming that my needs are further expanded, we need to crawl all the title contents of all the <a> tags in the homepage of the homepage, and finally put them in an array.
  • Our async function is divided into five steps. Only puppeteer. launch (), browser. newPage (), and browser. close () are fixed.
  • Page. goto specifies which web page we want to crawl data. You can change the internal URL address, or you can call this method many times.
  • The page. evaluate function, which deals with the data logic that we want to crawl into the web page
  • The two methods of page.goto and page.evaluate can be called many times inside async, which means that we can enter the web page first, and then call the page.goto function again after processing logic.

Notice that all of this logic is the Puppeteer package that opens another browser for us where we can’t see it, then processes the logic, so we finally call the browser.close () method to close the browser.

At this time, we optimize the code of the previous article and crawl the corresponding resources.

const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.jd.com/');
const hrefArr = await page.evaluate(() => {
let arr = [];
const aNodes = document.querySelectorAll('.cate_menu_lk');
aNodes.forEach(function (item) {
arr.push(item.href)
})
return arr
});
let arr = [];
for (let i = 0; i < hrefArr.length; i++) {
const url = hrefArr[i];
Console. log (url) // Here you can print 
await page.goto(url);
Const result = await page. evaluation () => {// This method has an invalid internal console. log 
Return $('title'). text (); // Return title text content for each interface
});
Arr. push (result) // Add the corresponding value to the array every time you loop
}
Console.log(arr)// Get the corresponding data can be saved locally through the FS module of Node.js
await browser.close()
})()

The console. log inside the skyhole page. evaluate function above can’t be printed, and the external variables can’t be retrieved internally, and can only be returned.
The selector used must first go to the console of the corresponding interface to test whether DOM can be selected before use, such as Jingdong can not use query Selector. Here because
The jQuery interface is used in the interface, so we can use jQuery. In short, they can develop selectors that can be used. We can all use them, otherwise we won’t be able to do so.

Next, let’s crawl Node. js’s official home page and generate PDF directly.

Whether you know Node. JS or puppeteer’s reptile crew can operate, please read this document carefully and execute each step in sequence.

Requirements of this project: Give us a web address, crawl his web content, and then output the PDF format document we want. Please note that it is a high quality PDF document.

  • The first step is to install Node.js, and recommend http://nodejs.cn/download/, Node.js’s Chinese official website to download the corresponding operating system package.
  • Second, after downloading and installing Node.js, start the Windows command line tool (start the system search function under windows, enter cmd, return, come out)
  • The third step is to check whether the environment variables have been automatically configured, enter node-v in the command-line tool, and if the v10. *** field appears, the successful installation of Node.js will be indicated.
  • The fourth step, if you find out the input node -v in the third step or there is no corresponding field, please restart the computer.
  • The fifth step is to open the project folder, open the command line tool (cmd can be opened directly in the URL address bar of the file in Windows system), and input NPM I cnpm nodemon-g.
  • Step 6 Download the Puppeteer crawler package. After step 5, you can download it with the cnpm I puppeteer — save command.
  • After Step 7 completes Step 6 download, open the url.js of this project and replace the web address you need to crawl with (default is http://nodejs.cn/).
  • The eighth step is to enter nodemon index. JS on the command line to crawl the corresponding content and output it to the index. PDF file under the current folder automatically.

TIPS: The design idea of this project is a web page and a PDF file. So every time you crawl a single page, please copy index.pdf out, then continue to change URL address, continue to crawl and generate new PDF files. Of course, you can also crawl multiple pages to generate multiple PDF files at one time through circular compilation.

Corresponding to the home page, which opens the lazy loading pages of the picture, crawling the content is the content of the loading state. For some web pages that have anti reptile mechanism, the crawler will also have problems, but most websites can do it.

const puppeteer = require('puppeteer');
const url = require('./url');
(async () => {
const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()
// Select the page to open 
await page.goto(url, { waitUntil: 'networkidle0' })
// Select the PDF file path you want to output and export the crawled content to PDF. It must be an existing PDF, which can be empty content. If it is not empty content PDF, it will overwrite content.
let pdfFilePath = './index.pdf';
// According to your configuration options, we choose the specification of A4 paper to output PDF for printing conveniently.
await page.pdf({
path: pdfFilePath,
format: 'A4',
scale: 1,
printBackground: true,
landscape: false,
displayHeaderFooter: false
});
await browser.close()
})()

Document Deconstruction Design

Data is very precious in this era. According to the design logic of web pages, selecting the address of a specific href can obtain the corresponding resources directly, or can be accessed by using page. goto method again, then call page. evaluation () processing logic, or output the corresponding PDF files, of course, can also output multiple PDF files at one go.~

There is not much to introduce here. After all, Node.js can be made to heaven. Maybe it can do anything in the future. For such a short and high quality course, please collect it.
Or forward it to your friends, thank you.

The above is the whole content of this article. I hope it will be helpful to everyone’s study, and I hope you will support developpaer more.