Nightware: a crawler framework based on electron

Time:2020-7-13

Author: William
This article is an original article, please indicate the author and source

Electron allows you to use pure JavaScript to call Chrome’s rich native interface to create desktop applications. You can think of it as a desktop focused app Node.js Instead of a web server. Its browser based application mode can make all kinds of responsive interaction very convenient. Next, we will introduce nightware, a framework derived from electron.

Nightware is an electron based framework for automated testing and crawlers on the web (in fact, the crawler is a function XD added to the framework by ourselves), because it has the same automated testing function as plantomjs. It can simulate the user’s behavior on the page, trigger some asynchronous data loading, and directly access the url like the request library To grab the data and set the delay time of the page, it is easy to trigger the script manually or trigger the behavior (note here, if the event is checked by istrusted, it cannot be triggered).

Using nightware

In order to use NPM download more quickly, you can use Taobao’s image address. Install nightware directly with NPM (the binary electron dependence is a little large, and the installation time may be relatively long).

Write a simple startup app.js ;

const Nightmare = require('nightmare')
const nightmare = new Nightmare({
     show: true,
     openDevTools: {
         mode: 'detach'
     }
 })

 nightmare.goto('https://www.hujiang.com')
   .evaluate(function() {
       //Any object window / document in the browser can be used in this environment, and a promise is returned
     console.log('hello nightmare')
     console.log('5 second close window')
   })
   .wait(5000)
   .end()
   .then(()=> {
     console.log('close nightmare')
   })

This script prints Hello nightware in the debugging console of the open browser and closes it after five seconds, and then outputs close nightware in the running script.

Nightware principle

It takes advantage of the browser environment provided by electron Node.js I / O ability, so it can be very convenient to implement a crawler application. Nightcare’s official website has a more detailed introduction:

General operation:

  • Browser events: goto, back, forward, refresh,
  • User events: click, MouseDown, mouseup, mouseover, type, insert, select, check, uncheck, selectscrollto
  • Inject script into web page: the principle of file type of. JS. CSS is similar to that of oil monkey. You can write your own JS code and inject it very conveniently
  • The wait function can be based on the delay time or the appearance of a DOM element
  • Evaluate the script function that runs in the browser’s environment and returns a promise function

A complete application of nightcare crawler

We use grabbing the topic of knowing the application as the application scenario. The data we need is the topic information that contains the following fields: topic names / topic pictures / number of people / topics / number of elite topics, but because the latter three can only be included in their father’s topic, so we must first grasp the topic of the father to catch the sub topics, and these sub topics are based on hover. If request / super agent is used, it needs HTTP to pass its resolved ID to get it, but nightware can directly call its hover event to trigger the loading of data.

The first step is to get the depth of the topic to be grasped. The default root is the current root topic;

/** 
*Grab the URL and depth of the corresponding topic page and save them to the specified file name
*@ param {string} rooturl - the URL of the top level 
*@ param {int} deep - the depth of the page to grab 
*@ param {string} tofile - saved file name
*@ param {function} CB - callback after completion 
*/
async function crawlerTopicsFromRoot (rootUrl, deep, toFile, cb) {
  rootUrl = rootUrl ||'https://www.zhihu.com/topic/19776749/hot'
  toFile = toFile || './topicsTree.json'
  console.time()
  const result = await interactive
      .iAllTopics(rootUrl, deep)
  console.timeEnd()
  util.writeJSONToFile(result['topics'], toFile, cb)
}

crawlerTopicsFromRoot('', 2, '', _ => {
  console.log ('finish grabbing ')
})

Then, the core function of the interaction function is carried out, and attention should be paid to the knowledge before starting to grab robots.txt File to see what can grab and grab the interval, otherwise it is easy to timeout error.

//Get the information of the corresponding topic
const cntObj = queue.shift()
const url = `https://www.zhihu.com/topic/${cntObj['id']}/hot`
const topicOriginalInfo = await nightmare
  .goto(url)
  . wait ('. Zu main sidebar') // wait for the element to appear
  .evaluate(function () {
   //Get this data
      return document.querySelector('.zu-main-sidebar').innerHTML
  })
//After several steps of operation
//Get the numerical information of its sub topics
const hoverElement = `a.zm-item-tag[href$='${childTopics[i]['id']}']`
const waitElement = `.avatar-link[href$='${childTopics[i]['id']}']`
const topicAttached = await nightmare
  . mouseover (hoverelement) // trigger hover event
  .wait(waitElement)
  .evaluate(function () {
      return document.querySelector('.zh-profile-card').innerHTML
  })
  .then(val => {
      return parseRule.crawlerTopicNumbericalAttr(val)
  })
  .catch(error => {
      console.error(error)
  })

Cherio is a jQuery selector library, which can be applied to HTML fragments and obtain the corresponding DOM elements. Then we can perform the corresponding DOM operations, including adding, deleting, modifying and querying. Here, it is mainly used to query Dom and obtain data.

const $ = require('cheerio')
* * * number of questions / essences, number of followers / number of followers.
const crawlerTopicNumbericalAttr = function (html) {
  const $ = cheerio.load(html)
  const keys = ['questions', 'top-answers', 'followers']
  const obj = {}
  obj['avatar'] = $('.Avatar.Avatar--xs').attr('src')
  keys.forEach(key => {
      obj[key] = ($(`div.meta a.item[href$=${key}] .value`).text() || '').trim()
  })
  return obj
}
/*** capture topic information*/
const crawlerTopics = function (html) {
  const $ = cheerio.load(html)
  const  obj = {}
  const childTopics = crawlerAttachTopic($, '.child-topic')  
  obj['desc'] = $('div.zm-editable-content').text() || ''
  if (childTopics.length > 0) {
      obj['childTopics'] = childTopics
  }
  return obj
}

/*** capture the information ID / name of the sub topic*/
const crawlerAttachTopic = function ($, selector) {
  const topicsSet = []
  $(selector).find('.zm-item-tag').each((index, elm) => {
      const self = $(elm)
      const topic = {}
      topic['id'] = self.attr('data-token')
      topic['value'] = self.text().trim()
      topicsSet.push(topic)
  })
  return topicsSet
}

Then a simple crawler is completed. Finally, we can get some data formats

{
  "value": "rootValue",
  "id": "19776749",
  "fatherId": "-1",
  "Desc": "all topics of Zhihu constitute a rooted and acyclic digraph through the parent-child relationship. The "root topic" is the parent topic at the top of all topics. The essence of the topic is known as Top1000 high vote answer. Please do not bind "root topic" directly to the question. This will make the topic too broad. "
  "cids": [
      "19778317",
      "19776751",
      "19778298",
      "19618774",
      "19778287",
      "19560891"
  ]
},
{
  "id": "19778317",
  "Value": life, art, culture and activities,
  "avatar": "https://pic4.zhimg.com/6df49c633_xs.jpg",
  "questions": "3.7M",
  "top-answers": "1000",
  "followers": "91K",
  "fid": "19776749",
  "Desc": "the topic with human collective behavior and human social civilization as the main body, its content mainly includes four aspects of life, art, culture and activities. "
  "cids": [
      "19551147",
      "19554825",
      "19550453",
      "19552706",
      "19551077",
      "19550434",
      "19552266",
      "19554791",
      "19553622",
      "19553632"
  ]
},

summary

The biggest advantage of nightware as a crawler is that it only needs to know the URL of the page where the data is located to get the corresponding synchronous / asynchronous data, and there is no need to analyze the parameters to be passed by HTTP in detail. Only need to know what operation can make the page data update, you can get the corresponding data by getting the updated HTML fragment. In demo, the nightware is opened for operation, but it can be closed in actual operation, and the operation speed will increase to a certain extent after it is closed. The following project also contains another dynamic of crawling Zhihu.

Demo source code address: https://github.com/williamsta…

Nightware: a crawler framework based on electron

Ikcamp’s original new book “practical combat of efficient development of mobile web front end” has been sold in Amazon, Jingdong and Dangdang.

>>Shanghai team of Hujiang web front end is looking for [web front end architect], resume to: [email protected]


Registration address: http://www.huodongxing.com/ev…