Nodejs crawler real combat project chain home

Time:2021-6-14

explain

As a front-end sector of primary school students, always want to do some projects to the whole stack efforts.
The worry is that there is no backstage. After searching, I learned to write nodejs and express as local interfaces to call the front page.
But where does the data come from?
Someone said, “mockjs to generate!”
OK, introduce mock to generate some random data circularly,
After the list is formed, you can load more, the table row data is filled, you can add, delete, and query, and the curve drawing has changed from the original parallel line to ups and downs.
But, after all, that’s false data.
If you want to get real data, you still have to “spider web” and have a little experience in real combat.
Very basic, do not like spray
Reprinted with reference:Nodejs crawler real combat project chain home

design sketch

thinking

1. How is crawler realized?
By visiting the website address to be crawled, we can get the HTML document content of the page, find the data we need to save, and further check the element node where the data is located. They must have rules in some aspects, follow the rules, operate DOM, and save the data. For example: visitChain house

First of all, to see such a list, the data we need is nothing more than the price of real estate
Picture, link, address, name, location, room type, floor area, feature, type, price

Next, press F12 to see where the data is

You can see that they are all stored in one li,
Find the location of the picture again

For such a page of ten li, we can find every picture through ‘. House lst. Pic panel img’, so we can find picture elements in this way in the code, and we can find ten at a time, traverse and store them.
If you can’t find the others, you can contact the source code and think about why?

2. How to crawl all data of all pages?
According to the method just now, we can completely save the data on the first page, but in order to get to all the data on the page, we still need to find the rules between the page numbers. Now I’ll try to visit the second page to see what’s the difference?



On the basis of the original path, it’s OK to add / PG {I} / in the loop?

Technology stack

  1. Background: nodejs + Express + mongodb (mongoose) + SUPERGEN + cherrio
  2. Front end: react + react router (V4) + ant design + Baidu map + ecarts (added later)
  3. Interaction: Express API + socket.io

step

1、 New project

npm install -g create-react-app
create-react-app nodejs-spider
cd nodejs-spider

2、 Backstage section

1. Install dependency package

npm install –save express mongodb mongoose superagent cherrio socket.io body-parser

2. Create a new server.js to write the background service
After looking at the effect, you will know that the crawling process is page by page crawling, and the next page will be crawling only after this page is finished. If we don’t do this, he will ignore the crawling time, and directly display the crawling prompt of all pages and explain that the crawling is completed. In fact, he will continue crawling in the next time, so we don’t know when it will end. How can we inform the front end to display the crawling progress? So we need to use async / await of ES7 here.
Portal:Experience the ultimate solution of async – ES7’s async / await

3、 Front end section
1. Install dependency package

npm install –save react-router antd

2. Configuration environment
Create react app + antd + less configuration

3. Routing and components
The overall layout is head navigation bar + content + bottom
The head and tail are public, and the content part points to two components through two routes.

//Route export
import Map from '../components/Map';
import Chart from '../components/Chart';

export default [
    {
        path: '/',
        Name: 'map',
        component: Map
    },
    {
        path: '/page/Chart',
        Name: 'data analysis',
        component: Chart
    }
]
//Route rendering
<Content style={{ padding: '0 50px' }}>
      <Switch>
            {routers.map((route, i) => {
                  return <Route key={i} exact path={route.path} component={route.component}/>
            })}
      </Switch>
</Content>
//Route navigation
<Menu
     ...
   >
  {routers.map(function (route, i) {
      return (
          <Menu.Item key={i}>
            <Link to={route.path}>
              {route.name}
            </Link>
          </Menu.Item>
      )
    })}
</Menu>

4. Socket.io communication
Real time communication mode, used to monitor the background capture progress.

//Click grasp to inform the background to grasp the data
        Socket. Emit ('request ','receive grab request...');
        //Monitor the progress information in the background and refresh the progress in real time
        socket.on('progress', function (data) {
            // console.log(data);
            this.setState({
                progress: data.progress,
                loading: true,
            });
            If (data. Progress = = ='crawl complete! "){
                this.setState({
                    loading: false,
                });
            }
        }.bind(this));

5. The use of Baidu map API
Baidu map open interface to developers for free, first of allApplication key
Then use the teleport skill
Portal:When react framework meets Baidu mapBaidu map API example
be careful
VM10127:1 Uncaught TypeError: Failed to execute ‘appendChild’ on ‘Node’: parameter 1 is not of type ‘Node’.
The error means that the information window object required by this.openinfowindow (infowindow) in Baidu map API is the node type of real DOM recognized by appendChild, not the virtual DOM component such as react. So here can only use JQ original string splicing, to be careful, quite complex.

6. The data analysis part will be improved with ecarts when you have time.

summary

Very basic crawler entry example, but can do the basic things not easy.
In the process of explanation, I didn’t mention much code, but mainly focused on some ideas and how to implement them. The code can be downloaded on GitHub to learn from each other.
There are more solutions than problems. Although there are endless bugs, there are corresponding rules.

Source code

Github