How to ensure the lowest SLA, problem tracking efficiency and closed-loop construction in the node scheme with large traffic? How to ensure the transparency of request traffic and business operation? Here is how to do Baidu disk.
- Node operation and maintenance under 10 million level PV
- Availability guarantee, minimum guarantee
- Problem tracking and closed loop construction
- Transparent traffic and operation
We divide the whole operation and maintenance scheme into six steps, and our purpose is to:
Step 1: traffic tracking
In order to solve the problem of distributed traffic link caused by microservice, we draw lessons from
spring cloudI’m not thinking about it. Designed
tracecodeTwo tracking fields are used to track the node information in the whole traffic link of the landing container
Response HeaderIt can track and locate the traffic based on the identification information in.
Step 2: gray Publishing
In the case of large traffic, we must provide a gray and ab test scheme, combined with
upstreamConfiguration capability and
BNS(naming service)Node, the user traffic is divided into
externalThe deployment level is divided into
Total quantityIn order to control the flow.
Explanation of terms:
In Router: NGX cluster for node traffic load processing
baidu naming serviceThrough virtual domain name
IDCOr cluster logic layout, convenient for traffic location and processing.
Step 3: log grading
The whole log system is divided into two parts
Grade 4Each level records different information, detailed records[when, where, who, what mistakes have been made, what problems have been caused]In addition, the event loop is implemented for node
Lagtime monitoringTo monitor the pressure of the node server.
- Business log
- Business framework log
- Daemon log
- Script log (including environment log)
Step 4: alarm classification
After the log is processed by the parsing service, the data is sent to the monitoring platform, and then through the
thresholdDetermine if the alarm should be triggered. Alarm means are divided into: email, SMS, telephone, etc.
Step 5: dyeing
After receiving the alarm, the personnel on duty can obtain the specific information from the monitoring platform. If it is a complex problem, it can be solved through the
Cookie coloringBy specifying the traffic link and operation, most problems can be traced and located.
Step 6: traffic tracing
Better determine the flow direction of users, through the
requestidThen we can reproduce the flow direction and internal logic operation of users.
Finally: closed loop construction
Finally, through the integration of these six steps, each part is connected and transformed, and the whole link is connected in series, so as to complete the operation and maintenance
Transparent operationThe purpose of
In general, through these six steps, the requirements of high-performance node application and minimum SLA construction, as well as the transparency of traffic and operation are completed.
This paper gives the operation and maintenance scheme of node under the condition of large traffic, which mainly plays the role of casting a brick to attract jade, and the specific details need to be deeply understood by myself;
It is not a popular science article, so it needs a certain operation and maintenance basis to better understand the content of the article. If you have the needs of node operation and maintenance, I hope this article can give you some inspiration and reference.