Nodejs operation and maintenance scheme under 10 million level PV


How to ensure the lowest SLA, problem tracking efficiency and closed-loop construction in the node scheme with large traffic? How to ensure the transparency of request traffic and business operation? Here is how to do Baidu disk.


  • Node operation and maintenance under 10 million level PV
  • Availability guarantee, minimum guaranteeSLA 99.98%
  • Problem tracking and closed loop construction
  • Transparent traffic and operation


Nodejs operation and maintenance scheme under 10 million level PV

We divide the whole operation and maintenance scheme into six steps, and our purpose is to:Traffic transparencyAndTransparent operation

Step 1: traffic tracking

In order to solve the problem of distributed traffic link caused by microservice, we draw lessons fromspring cloudI’m not thinking about it. DesignedtraceidandtracecodeTwo tracking fields are used to track the node information in the whole traffic link of the landing containerResponse HeaderIt can track and locate the traffic based on the identification information in.

Step 2: gray Publishing

In the case of large traffic, we must provide a gray and ab test scheme, combined within-routerOfupstreamConfiguration capability andBNS(naming service)Node, the user traffic is divided intoWhite listIntranetexternalThe deployment level is divided intoSingle setunilateralandTotal quantityIn order to control the flow.

Explanation of terms:
In Router: NGX cluster for node traffic load processing
BNS:baidu naming serviceThrough virtual domain nameIDCOr cluster logic layout, convenient for traffic location and processing.

Step 3: log grading

The whole log system is divided into two partsGrade 4Each level records different information, detailed records[when, where, who, what mistakes have been made, what problems have been caused]In addition, the event loop is implemented for nodeLagtime monitoringTo monitor the pressure of the node server.

Log level:

  1. Business log
  2. Business framework log
  3. Daemon log
  4. Script log (including environment log)
Step 4: alarm classification

After the log is processed by the parsing service, the data is sent to the monitoring platform, and then through theclassificationandthresholdDetermine if the alarm should be triggered. Alarm means are divided into: email, SMS, telephone, etc.

Step 5: dyeing

After receiving the alarm, the personnel on duty can obtain the specific information from the monitoring platform. If it is a complex problem, it can be solved through theCookie coloringBy specifying the traffic link and operation, most problems can be traced and located.

Step 6: traffic tracing

Better determine the flow direction of users, through thetraceidandtracecodeAnd cooperationrequestidThen we can reproduce the flow direction and internal logic operation of users.

Finally: closed loop construction

Finally, through the integration of these six steps, each part is connected and transformed, and the whole link is connected in series, so as to complete the operation and maintenanceTraffic transparencyandTransparent operationThe purpose of


In general, through these six steps, the requirements of high-performance node application and minimum SLA construction, as well as the transparency of traffic and operation are completed.

This paper gives the operation and maintenance scheme of node under the condition of large traffic, which mainly plays the role of casting a brick to attract jade, and the specific details need to be deeply understood by myself;
It is not a popular science article, so it needs a certain operation and maintenance basis to better understand the content of the article. If you have the needs of node operation and maintenance, I hope this article can give you some inspiration and reference.

Recommended Today

What is “hybrid cloud”?

In this paper, we define the concept of “hybrid cloud”, explain four different cloud deployment models of hybrid cloud, and deeply analyze the industrial trend of hybrid cloud through a series of data and charts. 01 introduction Hybrid cloud is a computing environment that integrates multiple platforms and data centers. Generally speaking, hybrid cloud is […]