Reading Guide:It’s easy for officials to prevent trouble before it happens, but it’s difficult to get rid of it already — adding wind to the constitution to comfort the exiles by Ma Wensheng of the Ming Dynasty
As a programmer, have you noticed that programmers turn on computers to deal with online problems in major scenic spots every holiday? If there are a large number of alarms on the line, how can we judge whether it is our own service problem or dependent service problem? How can we wake up relevant personnel to respond quickly when major problems occur online at midnight? I believe these problems are familiar to many students, and the importance of monitoring is self-evident. How to build a perfect monitoring system to help programmers find and locate problems efficiently? This article will introduce the monitoring practice of Baidu game microservice. Based on the implementation of Baidu’s perfect monitoring foundation, we have built a relatively perfect monitoring system. Next, we will introduce our practice to you.
The full text is 4583 words and the expected reading time is 9 minutes.
With the rapid development of business, the game server R & D students have to maintain an average of 2 ~ 3 micro services per person. The increase of subsequent business scenarios may introduce more micro services. How to efficiently learn the operation status of the whole micro service system and how to quickly find problems and remove faults in case of abnormal business, the game server R & D students have done a lot of work and attempts in monitoring practice.
The initial monitoring is based on Argus monitoring (log server related monitoring), monitor monitoring platform (business monitoring) and SIA monitoring (visual monitoring), covering some basic monitoring. However, due to the lack of system and combination with business, the overall effect is not ideal, and many problems are still feedback from customer service and product students, At the same time, one of the most troublesome points of R & D in the process of follow-up is that it often takes a long time to locate the problem, which has a certain negative impact on the business. In this case, we systematically sorted out the problems we faced, systematically designed and optimized the monitoring system, and focused on the in-depth combination of problem positioning and business, which greatly improved the problem positioning efficiency.
The following will introduce the construction process of our monitoring system as a whole, hoping to be helpful to readers.
2、 Discussion on microservice monitoring
In the early stage of monitoring construction, we mainly added various monitoring based on Baidu’s monitoring infrastructure, but the effect was not ideal due to the lack of system. Although our monitoring ability is not perfect and weak in the initial stage, these decentralized monitoring measures also help R & D students find many system problems, laying a foundation for subsequent systematic and multi-dimensional combined monitoring.
2.1 log and server monitoring
Baidu Argus monitoring platform is used to monitor the machine status and business log. Game microservices monitor and cover online services with the help of machine and log monitoring capabilities.
In the initial stage, our application of Argus monitoring was unidimensional, and the depth of business scenarios was not enough. For example, the monitoring threshold and multi-dimensional alarm capability of some instances of a problem were not designed in the initial stage. The following is an introduction to Baidu Argus monitoring capability and process:
The overall data flow of Argus is as follows, which can support telephone, SMS, SMS and Baidu streaming alarm
Elk stack is familiar in the log related monitoring industry The solution (elasticsearch + logstash + kibana) uses beats (optional) to be installed on each server as a log client collector, and then performs unified log collection, parsing, filtering and other processing through logstash, and then sends the data to elasticsearch for storage and analysis. Finally, kibana is used to display the data.
2.2. Service polling monitoring
Using Baidu monitor monitoring platform, regular polling detection mechanism is adopted for the core interface to assist in monitoring the online service quality. The monitor platform supports visual configuration, but customized configuration needs to be made for each scenario. With the rapid iteration of the business, the efficiency and ease of use of this monitoring can no longer meet the needs of the business.
2.3 service visual monitoring
Using the company’s SIA intelligent monitoring system, the monitoring visualization of service flow, availability, performance and other indicators is realized, which can assist business R & D to visually observe the service online status and alarm based on the online abnormal status. However, the business does not make full use of SIA intelligent monitoring capability, resulting in limited auxiliary role of visualization and no embodiment of intelligent capability.
Figure 3 Monitoring visualization
For the visual monitoring tools in the industry, such as kibana and grafana, the relevant capabilities have been very perfect, which can basically meet the various display needs of the business. You can refer to them.
3、 Evolution of microservice monitoring
As described above, although the monitoring measures in the preliminary monitoring stage can assist the R & D to find and locate some problems, there are still many problems, mainly in the following four aspects:
- The risk exposure lags behind, and most of them have been affected when the alarm occurs;
- Lack of unified planning for monitoring, confusion and incomplete coverage of relevant monitoring items;
- Weak monitoring ability, unable to provide effective exception information;
- Alarm confusion, R & D bombed by alarm information;
From the perspective of overall monitoring system construction costs and benefits, we will not overthrow all the previous monitoring, but improve the existing basic monitoring capacity. Firstly, we make a comprehensive design of the monitoring system from a systematic perspective, and then strengthen the ability of each part of the monitoring system based on the design.
3.1. Monitoring system design
Objective: effective prevention, timely discovery and quick stop loss;
Landing: Based on the systematic design goal, the following landing ideas are made.
In terms of implementation, the monitoring systematization of the micro service system is designed from four aspects: risk control, intelligent monitoring, intelligent alarm and efficient positioning. The overall process is as follows:
The following is an introduction from four aspects: risk control, intelligent monitoring, intelligent alarm and efficient positioning.
3.1.1 risk control design
The earlier the online problem is found, the better. Due to the objective differences in the level of R & D students, and the occurrence of online problems cannot be effectively avoided through the cooler review, more work has been done in the automation case and release links of game business R & D to reduce the occurrence of problems. The following are the main risk control items of R & D. through the implementation of these risk control items, more than 95% of on-line problems can be reduced.
3.1.2 intelligent monitoring design
The initial monitoring of the game business is a decentralized monitoring addition: the log monitoring uses Argus, the visual monitoring experiment SIA intelligent monitoring platform, and the monitoring coverage and the synergy between the monitoring systems are not considered as a whole, which exposes some problems, such as:
The monitoring divided by monitoring objects is effective coverage in a single dimension, but how to detect the global fluctuation anomaly of the system?
An instance has a sudden increase in pvlost due to an accidental failure of the network or machine disk. How to obtain information efficiently?
Is the fluctuation of system availability a problem in a computer room, a problem with a specific interface, or an exception in accessing the downstream?
(1) Intelligent anomaly detection
Using the intelligent anomaly detection algorithm of SIA system, time-consuming, traffic, SLA index, revenue and other indicators are incorporated into the monitoring system, which can effectively detect the periodic / aperiodic fluctuation anomaly of the system. The main algorithms are briefly introduced below.
By combining the above indicators with the traffic, time-consuming, revenue and other indicators of the game business, even a relatively slow decline can pass these periodic detection tools in case of periodic or non periodic fluctuations of the system
Effective detection and greatly improve the coverage of anomaly detection.
(2) Full scene monitoring coverage
We cover the monitoring from four quadrants to ensure that problems are exposed without dead corners. At the same time, for the monitoring of service dimensions, we also refine the multi-dimensional screening ability, so as to facilitate the discovery of problems from a macro perspective and assist in the efficient positioning of problems in the micro world.
Here we focus on data monitoring. For the special scenarios of game business, we have refined the data and scenarios to be monitored to ensure the complete coverage of monitoring. Here are some monitoring items related to data.
(3) Multi-dimensional monitoring visualization assistant
Multidimensional filtering capability: service, interface, error code, machine room and machine instance;
Abnormal multi-dimensional visualization ： For example, pvlost is based on the distribution of interfaces, machines and machine rooms;
Visualization of error distribution: by interface and error code;
Figure 6 multi dimensional monitoring visualization
3.1.3 intelligent alarm design
Hierarchical alarm design is made for the alarm as a whole. Different alarm ranges and alarm methods are set based on different scenarios to reduce the overflow of non important alarm information. At the same time, the overall design of alarm application is as follows:
(1) Intelligent merge filtering and automatic upgrade
Intelligent filtering: reduce the overflow of alarm information and do some information screening;
Intelligent alarm merging: through the combination of information, improve the information profile of alarm and further reduce the overflow of alarm information;
Automatic alarm upgrade: it solves the problem that the alarm can not reach the person on duty. By setting different thresholds, it expands to different ranges, and upgrades the form of alarm from mail – > e.g. flow – > short message – > telephone, and the alarm telephone can be set to dial continuously until someone responds, so as to solve the problem of reaching;
(2) Style content customization
For common instance alarms or service alarms, the corresponding alarm information is output in a fixed format;
The core logic part adds the alarm content definition based on rich text to completely display the alarm information and alarm problems, and provides the context semantics of the problems, which greatly improves the amount of information and provides sufficient and effective information for positioning problems.
Figure 8 alarm content style customization
3.1.4 efficient positioning capability support
Efficient alarm exposure information: for key core logic, trace link + robot mode is adopted to realize efficient touch and customized output of alarm and efficient transmission of information;
Efficient alarm information confirmationNote: after the abnormal information alarm, in order to confirm the relevant complete log data on the line and request the rapid data retrieval of the data at that time, the real-time trace system effectively solves this problem;
(1) Core logic robot trace link information
The alarm exposure information has basically reached the minute level problem ALARM + automatic problem location in the core logic. Based on the alarm information, the number of corresponding problem code lines and error causes can be seen, which greatly improves the problem location efficiency.
Of course, the implementation cost of this method of alarm is still high. For example, if it exists once in the process of sending props to users after the recharge of the game business, we will expose the request parameters, error functions and the specific causes of the error. Based on this data, we can intuitively clarify the specific problems, but this needs more customized implementation, There is a certain access cost.
(2) Real time trace system access
Using the ability of Baidu trace, the business can be collected in a non-invasive manner, and the access cost is very low. For timeliness, Baidu datahub message queue is adopted, and dstream is used to build index in real time, so that the retrieval timeliness based on key information from data source to fault location platform can be within 5 minutes, which greatly improves the efficiency of R & D and location.
4、 Micro service monitoring panorama
4.1 user touch
Through multi-dimensional visual monitoring, assist R & D, and quickly analyze the general causes of problems based on the visual interface; Based on intelligent alarm and business report, it can meet the comprehensive detection of timeliness and detailed business health, so that the R & D students can fully perceive the state of the system;
4.2 monitoring tools
Based on Argus monitoring, SIA intelligent monitoring and robot monitoring auxiliary tools provided by the company, the system can be fully covered; For some long-term business data, such as application daily activity, Download success rate, white screen rate and other indicator data, customized monitoring is provided to cover the monitoring of such scenarios;
4.3 monitoring indicators
The monitoring indicators are roughly divided into the above categories, based on which the monitoring can be effectively covered;
4.4 monitoring objects
From server, business log and service status to business data, business core logic and core scenarios, the monitoring objects have been fully controlled through comprehensive sorting of monitoring objects.
5、 Summary and Prospect
Through the systematic monitoring capacity-building, it has reached a relatively ideal state in terms of timeliness, positioning efficiency and coverage. R & D can perceive major online problems at the first time, and has perfect auxiliary positioning information to assist in efficient positioning problems. In summary, the practical process of overall monitoring mainly includes the following aspects.
(1) Systematic design landing
The monitoring system should first clarify what problems to solve and what goals to achieve. After understanding the problems and goals clearly, we should think about how to fully solve the problems and achieve the goals. Based on such a systematic analysis and disassembly process, we realize our monitoring system from the aspects of risk control, intelligent monitoring, intelligent alarm and efficient positioning, To achieve the desired goal.
(2) The hierarchical thinking mode is applied in monitoring and alarm, and the core logic focuses on fire
Whether it is monitoring or alarm, we focus on important functions and core logic. If the existing tools cannot achieve the goal, we will consider multiple tool combinations to meet the monitoring goal. For general logic functions, the coverage degree is emphasized, which is completely covered by existing tools.
(3) Easy implementation and landing
SIA intelligent monitoring and Argus monitoring provided by the company have the ability to provide aggregation, so as to achieve one-step monitoring of homogeneous content. For heterogeneous or differentiated services, it can support access in the existing form of the business party with non-invasive capability, which greatly improves the addition efficiency of monitoring.
(4) Fully combine the company’s existing capabilities, innovate combination applications and improve efficiency
When using the monitoring foundation, different monitoring tools have their own advantages and disadvantages. Make full use of the advantages of different monitoring tools to achieve the optimal overall monitoring effect. At the same time, for the monitoring of some core logic, innovatively use the content customization ability of robot ALARM + trace to realize the efficient feedback and positioning of core logic problems.
Although the practice of monitoring system has achieved ideal results, the automation mechanism of system fault handling and disaster recovery needs to be further improved, and the use of system resources has not been intellectualized. At present, the increase and decrease of resources still depend on manual intervention. The subsequent optimization goal is to achieve comprehensive automation in fault automatic processing and intelligent expansion and contraction of resources, so as to provide the overall maintainability and availability of the system.
———- END ———-
Baidu geek said
The official account of Baidu technology is on the line.
Technology dry goods, industry information, online salon, industry conference
Recruitment Information · internal push information · technical books · Baidu peripheral
Welcome to pay attention