On the theoretical basis of SRE


What is SRE?

When I first got into contact with SRE, many people thought that it was a post with full stack capability in Google and could solve many problems independently.

After in-depth exploration, it is found that SRE can solve many problems, but there are too many problems. It is difficult for a post or a person to solve them efficiently and quickly.

For example, how to do capacity assessment, how to conduct fault drills, how to achieve service current limiting, how to achieve abnormal fusing, how to make monitoring alarm more effective

Therefore, in order to solve these problems, it is not difficult to see that the personnel of testing, development, operation and maintenance and other relevant posts have to cooperate in construction. Therefore, we can find that SRE is a systematic method to guide the construction.

What is the goal of SRE?

Improve stability

The goal of building SRE system is to “improve stability”

In SRE, there are two indicators to measure the goal of “improving stability”

index interpretation
MTBF(Mean Time Between Failure) Mean time to failure
MTTR(Mean Time To Repair) Mean time to repair

From their definitions, we can see that the relationship between the two indicators and the system operation state is as follows

index System operation status
MTBF When the system is running normally
MTTR When the system fails

In fact, our understanding of system stability is to maintain the normal operation of the system for a long time, and when there is a fault, it can be quickly recovered.

thereforeIncrease MTBF and reduce MTTRIt becomes the goal of “improving stability”

This allows us to build SRE based on whetherIncrease MTBF and reduce MTTRTo judge the effectiveness of the work

Target segmentation

With this goal in mind, the question arises whether the two indicators of MTBF and MTTR are a little too large. Even if the time data of the two indicators can be sorted out by warning, notification or other means, it is not clear how to implement the improvement.

In fact, MTTR can also be subdivided into four indicators, corresponding to the four stages of system failure, as follows

index interpretation stage
MTTI(Mean Time To Identify) Mean time to find fault Fault discovery: fault occurs to response
MTTK(Mean Time To Know) Mean failure recognition time Fault location: until the root cause or the range of root cause is located
MTTF(Mean Time To Fix) Mean time to resolution Fault recovery: take measures to recover business
MTTV(Mean Time To Verify) Mean time to repair verification Fault recovery verification: verify the service recovery time after the fault is solved

MTBF can also be subdivided into two stages, as follows

stage interpretation
Pre-MTBF Fault prevention
Post-MTBF Fault improvement

Therefore, with the specific stage segmentation, we can do the work targeted at, such as reference toMr. Zhao Cheng's SRE stability assurance planas follows

Pre MTBF (fault prevention) MTTI (fault finding) Mttk (fault location) MTTF & mttv (fault recovery) Post MTBF (fault improvement)
Trouble shooting AIOps Log analysis Disaster recovery switch Fault recovery
Capacity assessment Public opinion perception Link tracking service degradation Improvement acceptance
Continuous delivery Monitoring alarm Root cause location Service current limitation fault simulation
automation Abnormal fusing Chaos Engineering
architecture design Capacity pressure measurement

In the aspect of system construction, it can correspond to each other

Pre MTBF (fault prevention) MTTI (fault finding) Mttk (fault location) MTTF & mttv (fault recovery) Post MTBF (fault improvement)
Construction drill / on call Emergency response Emergency response Emergency response Double disk improvement / on call

In such a clear and clear stage division, our construction phase work is very clear, targeted to do, not afraid of going wrong.

such asPre-MTBFWe can do a good job in architecture design, provide design for failure service governance methods such as current limiting, degradation, fusing, etc., and provide conditions for rapid isolation

andPost-MTBFWe need to do a good job in troubleshooting, summarize deficiencies and implement improvement measures.

Here, the aiops capability can also be used to improve the problem location efficiency and alarm accuracy, and reduce theMTTIandMTTK

The basis of fault identification: appropriate SLI, corresponding SLO

We know that the goal isIncrease MTBF and reduce MTTRIt is basically measured around the “fault” dimension, but when is the system fault?

Therefore, we need to have a judgment condition, or judgment standard, to define “failure or failure”

Students who are familiar with the monitoring system will know that “alarm” may occur. The “possible” is used here because it is not always necessary in the real scene.

In the SRE system, there are better and more accurate measurement standardsSLO(Service Level Objective)To define “failure or not”.

When SLO is mentioned, we have to mention the relevantSLI(Service Level Indicator)before

What is SLI

Students who build a monitoring system will know that there will be a large number of indicators when monitoring the target object, but the effect of many indicators is estimated to be very small.

By following the following two principles, SLI stands out from them, and the indicator of luminous heat is SLI.

  1. It can identify whether the target object is stable or not
  2. Strongly related to user experience or perceived by users

Therefore, SLI can express “the stability and instability of the target object”.

Valet selection method

When selecting SLI, many students may be a little confused. After all, it is difficult to distinguish only principles.

The industry is also well aware of this problem, so there is also a set of Valet selection methods for classification and screening based on the characteristics of indicators. The classification of indicators is as follows

category explain
Volume What is the maximum capacity of the service commitment. For example, QPS, TPS, number of sessions, throughput, number of connections, etc
Availability (availability) It represents whether the service is normal, such as the success rate of the non 5xx state of the request call and the success of the task execution
Latency (delay) Whether the response is fast enough, such as time delay, but the distribution of time delay conforms to normal distribution, different confidence intervals should be specified to solve the problem.
Error (error rate) How many error rates, such as 5xx and 4xx, can be customized
Ticket (manual intervention) Whether manual intervention is needed, such as recovery of task failure

Through the above categories, we can quickly distinguish SLI, which is a very practical skill in actual use scenarios.

However, it is inevitable that there is manual intervention, whether it is the screening of existing indicators or the screening of future access indicators.

What is SLO

Well, from the point that SLI can express “the target object is stable”, we can letSLI + goal + time dimensionCan more accurately express the status quo of stability

For example, 90% delay in one hour is less than or equal to 80ms

And it’s SLO.

If the value of the above example is higher than 80ms, it means that the SLO is not up to standard, and there may be a fault.

However, we will find that if we simply use SLO as the “stability” criterion, we will fall into the similar warning storm and false alarm dilemma in the monitoring field.

In reality, when we measure the stability of business, we usually decide whether the business is faulty by a number of different judgment criteria and basis

Therefore, we can also combine multiple SLOS and use and operation to express the stability of services more accurately

The formula is as follows:Availability = SLO1 & SLO2 & SLO3

Therefore, all SLOS must be reached before they can be regarded as meeting the standard!

In simple terms, the appearance of SLO makes the expression of service stability more accurate and reliable.

On time dimension

The time dimension in SLO can be divided intoDurationandcycleTo cover the following two scenarios

  1. Time dimension: assessment from fault Perspective
  2. Request dimension: from the percentage of successful requests

Time dimension: assessment from fault Perspective

This can be understood as defining whether the SLO is abnormal from the perspective of how long the SLI has failed to reach the set threshold

For example, if the success rate of a request is lower than 95% within one minute and lasts for 10 minutes, it is abnormal

But this approach is not fine-grained in terms of time granularity

For example, the frequency of successful requests within one minute is less than 95%, but it does not last for 10 minutes. In fact, there are exceptions that need attention. Therefore, the request dimension can be used to supplement

Request dimension: from the percentage of successful requests

This can be understood as whether the SLI is lower than the set threshold in the statistical cycle to determine whether SLO is abnormal

For example, a request success rate of less than 95% in one day is abnormal

This way can effectively supplement the lack of time dimension, which is usually the existence of complementary

About SLO and availability

Usability is usually known as several nines, such as four nines and three nines

However, usability has always been criticized for the accuracy of its data, and the combination of SLO to express the availability can guarantee the accuracy

Because its underlying foundation is SLI which can express whether the target object is stable or not + target adjusted according to business characteristics + time adjusted according to business.

Through continuous adjustment, optimization and improvement, the accuracy of availability will continue to improve, and more close to business performance.

About SLO and fault

From the above expression, SLO can effectively express whether the stability meets the standard, so setting an alarm through SLO can effectively tell whether the system is in fault,

Of course, the alarm of the SLO combined with multiple SLOS will be more stable,

Because this can not only achieve the effect of alarm convergence, but also make the alarm more accurate and effective to prevent the wolf from coming.

From this point of view, the data errorbudget, which quantifies SLO, will be introduced in the next step to make this advantage better

Quantitative data for guiding work: errorbudget

When we set SLO, but how to carry out specific work? It’s not that intuitive

So we need to have a quantifiable data that can be used to alert and observe SLO

In SRE, a quantitative data error budget can be obtained through the reverse derivation of SLO

What is errorbudget

Errorbudget, a wrong budget, can be understood as “reminding you how many chances you still have to make mistakes.”

For example, four weeks is a cycle, and the number of application requests is 4653680. The following SLO is inversely deduced and the error budget is as follows

SLO ErrorBudget
99.95% availability 23,268
90% delay < = 80ms 465,368
99% delay < = 200ms 46,536

In this way, the data can be converted into the form of scoring, which is more intuitive and has stronger sensory impact.

Therefore, we can use the error budget to normalize the data to better promote the achievement of the stability goal

Consume errorbudget data

Stability burn out chart

Using the form of errorbudget scoring and bar chart to display its status in real time. Of course, it is necessary to set a cycle as four natural weeks, and the data will be recovered after the cycle.

For special scenarios, you can appropriately increase the errorbudget to rationalize the scenario, but it is still a case-to-case analysis.

Fault grading

When the errorbudget is normalized into times, the percentage of consumption can be used to determine the fault level, so that all different SLOS can use the same rule to do fault grading, so as to achieve the purpose of unified specification.

Generally, the fault level can be divided into five levels (P0 ~ P4), and 0 is the highest.

Common fault levels are set as follows

Single consumption ratio Fault level
Proportion < = 5% P4
5% < proportion < = 20% P3
20% < proportion < = 30% P2
30% < proportion < = 50% P1
50% < proportion P0

For example, if the errorbudget is 25000, the error request generated by the problem exceeds 5000, that is, the consumption is more than 20%, which can be graded to P2 level, and so on.

The specific level setting needs to be formulated according to the business situation and tolerance.

Stability consensus mechanism

The driving license scoring system must be familiar to all of us. When you find that there is 1 point left in the score, you will drive very carefully, so as to avoid re education or license suspension caused by foul.

So you will find that the same is true of errorbudget. Once there is not much left, you will be vigilant and formulate corresponding action measures to avoid SLO failure to meet the stability target.

And how to formulate action measures? Two principles can be considered

1. When the surplus budget is sufficient or not consumed, the occurrence of problems should be tolerated

In daily life, we will encounter network jitter or device instantaneous switching, which leads to very short-term system instability. At this time, a small number of customers give feedback or encounter when using business. As a result, they are complained that the business is unstable. Then the technical personnel immediately put down their work to investigate the problem, and then spend a lot of time to summarize and report on the follow-up.

This consumes a lot of time and energy of technical personnel, and the results of investigation are not of great help to the business. As a result, the work in hand of technical personnel can not be completed, and the time of other assistants is also wasted.

Generally speaking, the price performance ratio is not high, and it is a ripple diffusion effect. If such things happen more, it is estimated that “tsunami” will be triggered!

Now with SLO and wrong budget judgment criteria, there is a clear response: if the budget is adequate, it should be tolerated, should not be complained about, and should not be high priority response.

2. SRE has the right to suspend and reject any online changes when the remaining budget is consumed too fast or is about to be consumed

In this case, it can be understood as a sick engineer who still insists on working. However, his work is not satisfactory at this time, and there is a risk that he may fall down directly

Do you have the heart to assign him new tasks or let him continue to work in this state?

At this time, he should be restored to health, in order to continue to do well!

From this analogy, we can see that the team should give priority to solving problems that affect stability until the problems are solved, and then return to the normal pace of change after the next cycle has a new wrong budget

Key points

These two points need to be recognized and implemented by everyone. Because this involves the cooperation of many parties, the same consensus can ensure the smooth and efficient work cooperation.

From the point of multi-party cooperation, if the mechanism is to be implemented, it needs to be “top-down”, such as technology VP or CTO level.

In addition, when there are problems, they can be raised gradually and the decision can be made from the perspective of CTO.

Warning based on wrong budget

In the past, we often received a large number of warning messages, but their value was very low, resulting in the wolf came, and everyone began to distrust the alarm.

In fact, such consequences are very serious, because it is very likely that useful information will be submerged, resulting in business interests being damaged and multi-party responsibility.

Of course, the industry also has a solution called “alarm convergence”“

The common method is to send the same similar alarms to the notifier after merging, such as the same cluster and the same exception alarm

But this practice will also be filled with a lot of information, can not quickly locate the problem and deal with it, why say so?

Because the information is simply merged, the amount of information remains unchanged, unless the information is refined and calculated by combining other means, such as the so-called alarm decision tree, which will be more accurate.

However, the cost of this construction is not low, involving convergence rule design, object logic hierarchy design, decision logic processing implementation and so on.

The alarm convergence can be achieved naturally by using the method based on error budget alarm, because it is based on the service SLO

This also shows that we only focus on the alarms that affect the stability, and we must respond quickly to the occurrence of such alarms, and the number of such alarms is not large

At the same time, it is very accurate.

The simple way is to set the alarm value of the fault grading, and the more detailed and accurate method will be related to the aiops field,

You can learn from Google’s several alert algorithms based on SLO and error budget

How to measure the effectiveness of SLO

Although we have determined SLO, does SLO effectively reflect the stability of the business, and whether the errorbudget derived from SLO can effectively guide the work?

We still need to do validation testing, and continue to optimize.

Here we need to sort out the scene from three dimensions and deal with the corresponding situation according to three strategies

What are the three dimensions?

We can evaluate it from three dimensions

dimension state
Achievement of Slo Met or missed
“Human flesh” input High or low
Customer satisfaction High or low

According to these three dimensions, there are eight different kinds of assembly

What are the three strategies?

We can use the following three strategies to deal with it

  1. Tighten slo
    When the user satisfaction is low, but the goal has been achieved. At this time, we need to tighten SLO, narrow the target, and gradually adjust to feedback the real situation
  2. Relax slo
    When the user satisfaction is high, the goal is not achieved. At this time, we can appropriately loosen the binding, increase the target, and appropriately increase the number of releases to accelerate business growth
  3. Continuous optimization for problems
    Here we need to analyze the cause of the problem according to the situation and optimize it
    For example, when the three dimensions meet the expectations, increase the number of iterations to improve business production efficiency
    When the three dimensions do not meet the expectations, analyze the business characteristics, continuously adjust and optimize SLO, implement optimization and improvement measures, and continuously improve

Coping strategies

Combing the specific situation, the response table is as follows

Achievement of Slo “Human flesh” input Customer satisfaction Execution strategy
Met Low High Continuous optimization: product user experience is poor, so relax the release and deployment process and improve the speed, or delay the planning and implementation first, and focus more on improving service reliability
Met Low Low Tighten slo
Met High High Continuous optimization: if it is an alarm, it will lead to wrong guidance and reduce the sensitivity. Otherwise, SLO will be temporarily relaxed or labor input will be reduced, product repair and fault self-healing ability will be improved.
Met High Low Tighten slo
Missed Low High Relax slo
Missed Low Low Continuous optimization: the quality of alarm setting is insufficient, and the sensitivity of alarm needs to be improved
Missed High High Relax slo
Missed High Low Continuous optimization: reducing labor input, repairing products and improving fault self-healing ability

How to land SLO?

I have said a lot of good SLO. How can I start with the landing?

In fact, we have more or less said a little bit before, but we will find out about this space

Core link finding

The practice of SRE is nothing more than serving the business, so we should start from analyzing the business and find out the core points

Although there are many applications in the business, it is obvious that the core value can be created. After all, the whole link with core value can be screened out from the perspective of user access, business performance and business characteristics

Therefore, it is our guideline to sort out core and non core applications from the perspective of business, so as to sort out the core links.

In fact, there is no good automatic means to sort out, after all, close to users, in addition to using machine learning to infer, it seems that there is no good solution

Therefore, there will be a lot of human work here, involving the carding of the architecture, the communication of business parties, the carding of technology stack, etc.

But it’s worth the effort, because it will have a comprehensive understanding of the whole business and better carry out the work in the future.

Sort out the relationship between applications

When the core link is sorted out, the core applications and non core applications will be sorted out accordingly. After all, the core link is composed of core applications

When dealing with the direct relationship between application and application, there are two types of strength and weakness. The specific combination is classified as follows

Application role Application role Relationship strength
core core strong
core Non core weak
Non core Non core weak

When we sort out the relationship, we can divide and rule them and set SLO

Set application slo

There are four principles for setting the SLO of an application

1. Core application SLO should be more strict, and non core application can be relaxed

Let’s pay more attention to the core business

2. SLO between core applications with strong dependence should be consistent

It can be understood that they are on the same road. Once a certain road section is blocked, it will affect the vehicle operation of the whole road.

3. In weak dependence, core applications should have service governance measures such as degradation, fusing and current limiting for non core applications

The main purpose is to reduce the impact of non core applications on core applications and ensure the highest rights and interests of users

4. Error budget strategy, the error budget of core application should be shared

If a core errorbudget is consumed, it must have an impact on the entire link, thus affecting the user experience. In principle, all changes on the link should be stopped and priority should be given to repair.
Of course, it can be relaxed according to the actual situation. For example, a core application has sufficient budget and does not affect the core link function. Of course, this decision needs to be very careful.


Of course, after setting, we need to verify and give the corresponding empirical evidence, otherwise it will be “self entertainment”.

There are two kinds of means here

  1. Capacity pressure measurement
  2. Chaos Engineering

Capacity pressure measurement

The main function of capacity pressure measurement is to verify the volume class in SLO. Generally, the indicators of capacity class include QPS and TPS,

Therefore, we will stress test the capacity according to these indicators, so as to expose the problem of dependency and the effectiveness of various service governance measures.

For example, simulate the user access request to enhance the concurrent access of TPS to the value set by SLO, and then observe whether the service has any impact, and whether the original current limiting and degrading policy is effective and meets the expectation

Chaos Engineering

The main function of chaos engineering is to simulate the scene of fault occurrence and generate online anomalies and faults actively

For example, the computer room is powered off to verify remote dual activity, full traffic verification network, full disk or full CPU operation

Chaotic engineering is a very complex and systematic engineering. If the real impact caused by the simulated fault exceeds the estimated impact, it should also be able to quickly isolate and quickly resume normal business.

See this sentence is not feel a bit scary, seems to be a little bit against the stability of ah.

Therefore, the implementation of chaotic engineering should be very careful.

The simulation strategy must be verified repeatedly, including the original implementation of abnormal recovery. After ensuring that the impact is controllable, it can be implemented online after multi-party review or verification.

So chaos engineering is not something that we would try at the beginning of SRE,

It must be in the advanced stage, that is, service governance, capacity pressure testing, link tracking, monitoring alarm, operation and maintenance automation and other relatively basic and necessary parts are very perfect.

In fact, chaos engineering is to dig out problems from the unknown, so that the business can understand itself more clearly and protect itself. In short, it is “growing in failure”.

When will the verification be done?

We know to do validation, but when?

According to Google’s suggestion, the core is to try when the wrong budget is sufficient, and try to avoid the period when the wrong budget is insufficient.

Under normal business, completing SLO is not a simple thing, and it can not cause risk to system stability.

And we have to evaluate the impact of failure simulation, such as whether it will damage the company’s profits? Does it damage the user experience?

If it has a great impact on the business, it is also necessary to refine the scheme granularity and step by step to avoid unpredictable losses.

Therefore, in practice, the time period should be selected according to the business characteristics, and the recovery time should be considered. The preparation must be sufficient and not be careless.


This article is for learningMr. Zhao Cheng's SRE practical manualIn this paper, I summarize my understanding of SRE.
I hope it will be helpful to all interested colleagues.

Recommended Today

Let me also summarize the knowledge of nginx

Recently, I want to deeply study the related knowledge of nginx, so I summarize the following contents. Nginx configuration parameters Nginx common commands Nginx variable Virtual host configuration Nginx’s own module Fastcgi related configuration Common functions Load balancing configuration Static and dynamic separation configuration Anti theft chain What is nginx? Nginx is a free, open […]