What is SRE?
When I first got into contact with SRE, many people thought that it was a post with full stack capability in Google and could solve many problems independently.
After in-depth exploration, it is found that SRE can solve many problems, but there are too many problems. It is difficult for a post or a person to solve them efficiently and quickly.
For example, how to do capacity assessment, how to conduct fault drills, how to achieve service current limiting, how to achieve abnormal fusing, how to make monitoring alarm more effective
Therefore, in order to solve these problems, it is not difficult to see that the personnel of testing, development, operation and maintenance and other relevant posts have to cooperate in construction. Therefore, we can find that SRE is a systematic method to guide the construction.
What is the goal of SRE?
The goal of building SRE system is to “improve stability”
In SRE, there are two indicators to measure the goal of “improving stability”
|MTBF（Mean Time Between Failure）||Mean time to failure|
|MTTR（Mean Time To Repair）||Mean time to repair|
From their definitions, we can see that the relationship between the two indicators and the system operation state is as follows
|index||System operation status|
In fact, our understanding of system stability is to maintain the normal operation of the system for a long time, and when there is a fault, it can be quickly recovered.
Increase MTBF and reduce MTTRIt becomes the goal of “improving stability”
This allows us to build SRE based on whether
Increase MTBF and reduce MTTRTo judge the effectiveness of the work
With this goal in mind, the question arises whether the two indicators of MTBF and MTTR are a little too large. Even if the time data of the two indicators can be sorted out by warning, notification or other means, it is not clear how to implement the improvement.
In fact, MTTR can also be subdivided into four indicators, corresponding to the four stages of system failure, as follows
|MTTI（Mean Time To Identify）||Mean time to find fault||Fault discovery: fault occurs to response|
|MTTK（Mean Time To Know）||Mean failure recognition time||Fault location: until the root cause or the range of root cause is located|
|MTTF（Mean Time To Fix）||Mean time to resolution||Fault recovery: take measures to recover business|
|MTTV（Mean Time To Verify）||Mean time to repair verification||Fault recovery verification: verify the service recovery time after the fault is solved|
MTBF can also be subdivided into two stages, as follows
Therefore, with the specific stage segmentation, we can do the work targeted at, such as reference to
Mr. Zhao Cheng's SRE stability assurance planas follows
|Pre MTBF (fault prevention)||MTTI (fault finding)||Mttk (fault location)||MTTF & mttv (fault recovery)||Post MTBF (fault improvement)|
|Trouble shooting||AIOps||Log analysis||Disaster recovery switch||Fault recovery|
|Capacity assessment||Public opinion perception||Link tracking||service degradation||Improvement acceptance|
|Continuous delivery||Monitoring alarm||Root cause location||Service current limitation||fault simulation|
|automation||Abnormal fusing||Chaos Engineering|
|architecture design||Capacity pressure measurement|
In the aspect of system construction, it can correspond to each other
|Pre MTBF (fault prevention)||MTTI (fault finding)||Mttk (fault location)||MTTF & mttv (fault recovery)||Post MTBF (fault improvement)|
|Construction drill / on call||Emergency response||Emergency response||Emergency response||Double disk improvement / on call|
In such a clear and clear stage division, our construction phase work is very clear, targeted to do, not afraid of going wrong.
Pre-MTBFWe can do a good job in architecture design, provide design for failure service governance methods such as current limiting, degradation, fusing, etc., and provide conditions for rapid isolation
Post-MTBFWe need to do a good job in troubleshooting, summarize deficiencies and implement improvement measures.
Here, the aiops capability can also be used to improve the problem location efficiency and alarm accuracy, and reduce the
The basis of fault identification: appropriate SLI, corresponding SLO
We know that the goal is
Increase MTBF and reduce MTTRIt is basically measured around the “fault” dimension, but when is the system fault?
Therefore, we need to have a judgment condition, or judgment standard, to define “failure or failure”
Students who are familiar with the monitoring system will know that “alarm” may occur. The “possible” is used here because it is not always necessary in the real scene.
In the SRE system, there are better and more accurate measurement standards
SLO（Service Level Objective）To define “failure or not”.
When SLO is mentioned, we have to mention the relevant
SLI(Service Level Indicator)before
What is SLI
Students who build a monitoring system will know that there will be a large number of indicators when monitoring the target object, but the effect of many indicators is estimated to be very small.
By following the following two principles, SLI stands out from them, and the indicator of luminous heat is SLI.
- It can identify whether the target object is stable or not
- Strongly related to user experience or perceived by users
Therefore, SLI can express “the stability and instability of the target object”.
Valet selection method
When selecting SLI, many students may be a little confused. After all, it is difficult to distinguish only principles.
The industry is also well aware of this problem, so there is also a set of Valet selection methods for classification and screening based on the characteristics of indicators. The classification of indicators is as follows
|Volume||What is the maximum capacity of the service commitment. For example, QPS, TPS, number of sessions, throughput, number of connections, etc|
|Availability (availability)||It represents whether the service is normal, such as the success rate of the non 5xx state of the request call and the success of the task execution|
|Latency (delay)||Whether the response is fast enough, such as time delay, but the distribution of time delay conforms to normal distribution, different confidence intervals should be specified to solve the problem.|
|Error (error rate)||How many error rates, such as 5xx and 4xx, can be customized|
|Ticket (manual intervention)||Whether manual intervention is needed, such as recovery of task failure|
Through the above categories, we can quickly distinguish SLI, which is a very practical skill in actual use scenarios.
However, it is inevitable that there is manual intervention, whether it is the screening of existing indicators or the screening of future access indicators.
What is SLO
Well, from the point that SLI can express “the target object is stable”, we can let
SLI + goal + time dimensionCan more accurately express the status quo of stability
For example, 90% delay in one hour is less than or equal to 80ms
And it’s SLO.
If the value of the above example is higher than 80ms, it means that the SLO is not up to standard, and there may be a fault.
However, we will find that if we simply use SLO as the “stability” criterion, we will fall into the similar warning storm and false alarm dilemma in the monitoring field.
In reality, when we measure the stability of business, we usually decide whether the business is faulty by a number of different judgment criteria and basis
Therefore, we can also combine multiple SLOS and use and operation to express the stability of services more accurately
The formula is as follows:
Availability = SLO1 & SLO2 & SLO3
Therefore, all SLOS must be reached before they can be regarded as meeting the standard!
In simple terms, the appearance of SLO makes the expression of service stability more accurate and reliable.
On time dimension
The time dimension in SLO can be divided into
cycleTo cover the following two scenarios
- Time dimension: assessment from fault Perspective
- Request dimension: from the percentage of successful requests
Time dimension: assessment from fault Perspective
This can be understood as defining whether the SLO is abnormal from the perspective of how long the SLI has failed to reach the set threshold
For example, if the success rate of a request is lower than 95% within one minute and lasts for 10 minutes, it is abnormal
But this approach is not fine-grained in terms of time granularity
For example, the frequency of successful requests within one minute is less than 95%, but it does not last for 10 minutes. In fact, there are exceptions that need attention. Therefore, the request dimension can be used to supplement
Request dimension: from the percentage of successful requests
This can be understood as whether the SLI is lower than the set threshold in the statistical cycle to determine whether SLO is abnormal
For example, a request success rate of less than 95% in one day is abnormal
This way can effectively supplement the lack of time dimension, which is usually the existence of complementary
About SLO and availability
Usability is usually known as several nines, such as four nines and three nines
However, usability has always been criticized for the accuracy of its data, and the combination of SLO to express the availability can guarantee the accuracy
Because its underlying foundation is SLI which can express whether the target object is stable or not + target adjusted according to business characteristics + time adjusted according to business.
Through continuous adjustment, optimization and improvement, the accuracy of availability will continue to improve, and more close to business performance.
About SLO and fault
From the above expression, SLO can effectively express whether the stability meets the standard, so setting an alarm through SLO can effectively tell whether the system is in fault,
Of course, the alarm of the SLO combined with multiple SLOS will be more stable,
Because this can not only achieve the effect of alarm convergence, but also make the alarm more accurate and effective to prevent the wolf from coming.
From this point of view, the data errorbudget, which quantifies SLO, will be introduced in the next step to make this advantage better
Quantitative data for guiding work: errorbudget
When we set SLO, but how to carry out specific work? It’s not that intuitive
So we need to have a quantifiable data that can be used to alert and observe SLO
In SRE, a quantitative data error budget can be obtained through the reverse derivation of SLO
What is errorbudget
Errorbudget, a wrong budget, can be understood as “reminding you how many chances you still have to make mistakes.”
For example, four weeks is a cycle, and the number of application requests is 4653680. The following SLO is inversely deduced and the error budget is as follows
|90% delay < = 80ms||465,368|
|99% delay < = 200ms||46,536|
In this way, the data can be converted into the form of scoring, which is more intuitive and has stronger sensory impact.
Therefore, we can use the error budget to normalize the data to better promote the achievement of the stability goal
Consume errorbudget data
Stability burn out chart
Using the form of errorbudget scoring and bar chart to display its status in real time. Of course, it is necessary to set a cycle as four natural weeks, and the data will be recovered after the cycle.
For special scenarios, you can appropriately increase the errorbudget to rationalize the scenario, but it is still a case-to-case analysis.
When the errorbudget is normalized into times, the percentage of consumption can be used to determine the fault level, so that all different SLOS can use the same rule to do fault grading, so as to achieve the purpose of unified specification.
Generally, the fault level can be divided into five levels (P0 ~ P4), and 0 is the highest.
Common fault levels are set as follows
|Single consumption ratio||Fault level|
|Proportion < = 5%||P4|
|5% < proportion < = 20%||P3|
|20% < proportion < = 30%||P2|
|30% < proportion < = 50%||P1|
|50% < proportion||P0|
For example, if the errorbudget is 25000, the error request generated by the problem exceeds 5000, that is, the consumption is more than 20%, which can be graded to P2 level, and so on.
The specific level setting needs to be formulated according to the business situation and tolerance.
Stability consensus mechanism
The driving license scoring system must be familiar to all of us. When you find that there is 1 point left in the score, you will drive very carefully, so as to avoid re education or license suspension caused by foul.
So you will find that the same is true of errorbudget. Once there is not much left, you will be vigilant and formulate corresponding action measures to avoid SLO failure to meet the stability target.
And how to formulate action measures? Two principles can be considered
1. When the surplus budget is sufficient or not consumed, the occurrence of problems should be tolerated
In daily life, we will encounter network jitter or device instantaneous switching, which leads to very short-term system instability. At this time, a small number of customers give feedback or encounter when using business. As a result, they are complained that the business is unstable. Then the technical personnel immediately put down their work to investigate the problem, and then spend a lot of time to summarize and report on the follow-up.
This consumes a lot of time and energy of technical personnel, and the results of investigation are not of great help to the business. As a result, the work in hand of technical personnel can not be completed, and the time of other assistants is also wasted.
Generally speaking, the price performance ratio is not high, and it is a ripple diffusion effect. If such things happen more, it is estimated that “tsunami” will be triggered!
Now with SLO and wrong budget judgment criteria, there is a clear response: if the budget is adequate, it should be tolerated, should not be complained about, and should not be high priority response.
2. SRE has the right to suspend and reject any online changes when the remaining budget is consumed too fast or is about to be consumed
In this case, it can be understood as a sick engineer who still insists on working. However, his work is not satisfactory at this time, and there is a risk that he may fall down directly
Do you have the heart to assign him new tasks or let him continue to work in this state?
At this time, he should be restored to health, in order to continue to do well!
From this analogy, we can see that the team should give priority to solving problems that affect stability until the problems are solved, and then return to the normal pace of change after the next cycle has a new wrong budget
These two points need to be recognized and implemented by everyone. Because this involves the cooperation of many parties, the same consensus can ensure the smooth and efficient work cooperation.
From the point of multi-party cooperation, if the mechanism is to be implemented, it needs to be “top-down”, such as technology VP or CTO level.
In addition, when there are problems, they can be raised gradually and the decision can be made from the perspective of CTO.
Warning based on wrong budget
In the past, we often received a large number of warning messages, but their value was very low, resulting in the wolf came, and everyone began to distrust the alarm.
In fact, such consequences are very serious, because it is very likely that useful information will be submerged, resulting in business interests being damaged and multi-party responsibility.
Of course, the industry also has a solution called “alarm convergence”“
The common method is to send the same similar alarms to the notifier after merging, such as the same cluster and the same exception alarm
But this practice will also be filled with a lot of information, can not quickly locate the problem and deal with it, why say so?
Because the information is simply merged, the amount of information remains unchanged, unless the information is refined and calculated by combining other means, such as the so-called alarm decision tree, which will be more accurate.
However, the cost of this construction is not low, involving convergence rule design, object logic hierarchy design, decision logic processing implementation and so on.
The alarm convergence can be achieved naturally by using the method based on error budget alarm, because it is based on the service SLO
This also shows that we only focus on the alarms that affect the stability, and we must respond quickly to the occurrence of such alarms, and the number of such alarms is not large
At the same time, it is very accurate.
The simple way is to set the alarm value of the fault grading, and the more detailed and accurate method will be related to the aiops field,
You can learn from Google’s several alert algorithms based on SLO and error budget
How to measure the effectiveness of SLO
Although we have determined SLO, does SLO effectively reflect the stability of the business, and whether the errorbudget derived from SLO can effectively guide the work?
We still need to do validation testing, and continue to optimize.
Here we need to sort out the scene from three dimensions and deal with the corresponding situation according to three strategies
What are the three dimensions?
We can evaluate it from three dimensions
|Achievement of Slo||Met or missed|
|“Human flesh” input||High or low|
|Customer satisfaction||High or low|
According to these three dimensions, there are eight different kinds of assembly
What are the three strategies?
We can use the following three strategies to deal with it
- Tighten slo
When the user satisfaction is low, but the goal has been achieved. At this time, we need to tighten SLO, narrow the target, and gradually adjust to feedback the real situation
- Relax slo
When the user satisfaction is high, the goal is not achieved. At this time, we can appropriately loosen the binding, increase the target, and appropriately increase the number of releases to accelerate business growth
- Continuous optimization for problems
Here we need to analyze the cause of the problem according to the situation and optimize it
For example, when the three dimensions meet the expectations, increase the number of iterations to improve business production efficiency
When the three dimensions do not meet the expectations, analyze the business characteristics, continuously adjust and optimize SLO, implement optimization and improvement measures, and continuously improve
Combing the specific situation, the response table is as follows
|Achievement of Slo||“Human flesh” input||Customer satisfaction||Execution strategy|
|Met||Low||High||Continuous optimization: product user experience is poor, so relax the release and deployment process and improve the speed, or delay the planning and implementation first, and focus more on improving service reliability|
|Met||High||High||Continuous optimization: if it is an alarm, it will lead to wrong guidance and reduce the sensitivity. Otherwise, SLO will be temporarily relaxed or labor input will be reduced, product repair and fault self-healing ability will be improved.|
|Missed||Low||Low||Continuous optimization: the quality of alarm setting is insufficient, and the sensitivity of alarm needs to be improved|
|Missed||High||Low||Continuous optimization: reducing labor input, repairing products and improving fault self-healing ability|
How to land SLO?
I have said a lot of good SLO. How can I start with the landing?
In fact, we have more or less said a little bit before, but we will find out about this space
Core link finding
The practice of SRE is nothing more than serving the business, so we should start from analyzing the business and find out the core points
Although there are many applications in the business, it is obvious that the core value can be created. After all, the whole link with core value can be screened out from the perspective of user access, business performance and business characteristics
Therefore, it is our guideline to sort out core and non core applications from the perspective of business, so as to sort out the core links.
In fact, there is no good automatic means to sort out, after all, close to users, in addition to using machine learning to infer, it seems that there is no good solution
Therefore, there will be a lot of human work here, involving the carding of the architecture, the communication of business parties, the carding of technology stack, etc.
But it’s worth the effort, because it will have a comprehensive understanding of the whole business and better carry out the work in the future.
Sort out the relationship between applications
When the core link is sorted out, the core applications and non core applications will be sorted out accordingly. After all, the core link is composed of core applications
When dealing with the direct relationship between application and application, there are two types of strength and weakness. The specific combination is classified as follows
|Application role||Application role||Relationship strength|
|Non core||Non core||weak|
When we sort out the relationship, we can divide and rule them and set SLO
Set application slo
There are four principles for setting the SLO of an application
1. Core application SLO should be more strict, and non core application can be relaxed
Let’s pay more attention to the core business
2. SLO between core applications with strong dependence should be consistent
It can be understood that they are on the same road. Once a certain road section is blocked, it will affect the vehicle operation of the whole road.
3. In weak dependence, core applications should have service governance measures such as degradation, fusing and current limiting for non core applications
The main purpose is to reduce the impact of non core applications on core applications and ensure the highest rights and interests of users
4. Error budget strategy, the error budget of core application should be shared
If a core errorbudget is consumed, it must have an impact on the entire link, thus affecting the user experience. In principle, all changes on the link should be stopped and priority should be given to repair.
Of course, it can be relaxed according to the actual situation. For example, a core application has sufficient budget and does not affect the core link function. Of course, this decision needs to be very careful.
Of course, after setting, we need to verify and give the corresponding empirical evidence, otherwise it will be “self entertainment”.
There are two kinds of means here
- Capacity pressure measurement
- Chaos Engineering
Capacity pressure measurement
The main function of capacity pressure measurement is to verify the volume class in SLO. Generally, the indicators of capacity class include QPS and TPS,
Therefore, we will stress test the capacity according to these indicators, so as to expose the problem of dependency and the effectiveness of various service governance measures.
For example, simulate the user access request to enhance the concurrent access of TPS to the value set by SLO, and then observe whether the service has any impact, and whether the original current limiting and degrading policy is effective and meets the expectation
The main function of chaos engineering is to simulate the scene of fault occurrence and generate online anomalies and faults actively
For example, the computer room is powered off to verify remote dual activity, full traffic verification network, full disk or full CPU operation
Chaotic engineering is a very complex and systematic engineering. If the real impact caused by the simulated fault exceeds the estimated impact, it should also be able to quickly isolate and quickly resume normal business.
See this sentence is not feel a bit scary, seems to be a little bit against the stability of ah.
Therefore, the implementation of chaotic engineering should be very careful.
The simulation strategy must be verified repeatedly, including the original implementation of abnormal recovery. After ensuring that the impact is controllable, it can be implemented online after multi-party review or verification.
So chaos engineering is not something that we would try at the beginning of SRE,
It must be in the advanced stage, that is, service governance, capacity pressure testing, link tracking, monitoring alarm, operation and maintenance automation and other relatively basic and necessary parts are very perfect.
In fact, chaos engineering is to dig out problems from the unknown, so that the business can understand itself more clearly and protect itself. In short, it is “growing in failure”.
When will the verification be done?
We know to do validation, but when?
According to Google’s suggestion, the core is to try when the wrong budget is sufficient, and try to avoid the period when the wrong budget is insufficient.
Under normal business, completing SLO is not a simple thing, and it can not cause risk to system stability.
And we have to evaluate the impact of failure simulation, such as whether it will damage the company’s profits? Does it damage the user experience?
If it has a great impact on the business, it is also necessary to refine the scheme granularity and step by step to avoid unpredictable losses.
Therefore, in practice, the time period should be selected according to the business characteristics, and the recovery time should be considered. The preparation must be sufficient and not be careless.
This article is for learning
Mr. Zhao Cheng's SRE practical manualIn this paper, I summarize my understanding of SRE.
I hope it will be helpful to all interested colleagues.