What is SRE? What capabilities does SRE need?


The word SRE must be familiar to everyone. SRE is spoken all over the world, but what role is SRE? What are you responsible for? Let’s solve your doubts today.

Sre was first proposed by Google. Its general meaning is to drive maintenance with standardization, automation and scalability, and solve operation and maintenance problems with software development. When this position is launched, the fundamental problem to be solved is to break the business instability caused by the rapid iteration of traditional R & D personnel, so as to ensure the balance between service quality and stability focused on business maintenance.

The SRE positioning of different companies is different. Maybe the operation and maintenance positions of some companies are also SREs. Therefore, it can not be generalized. Domestic SREs are basically distinguished by positions, such as SRE in charge of network, SRE in charge of DBA, SRE in charge of business, SRE in charge of security, etc. In terms of the understanding of SRE mentioned by Google, they are basically maintenance engineers based on stable service quality, but the requirements for SRE are harsh. Here is my personal understanding:

  • First: comprehensive skills, such as network, operating system, monitoring, cicd, R & D, etc. you may not need to be proficient in R & D ability, but you need to have the ability to complete the design, development and iteration of a function in one language.

  • Second: break the traditional ideological barriers of operation and maintenance, think from the perspective of products and run through the whole business architecture, and improve the communication and coordination ability on the premise of service quality.

  • Third: always take software engineering to solve problems as the direction of planning.

  • Fourth: strong ability of trouble shooting, thinking and abstraction. These three abilities are very important in SRE work and the final result of time and practice accumulation.

To sum up, the current SRE in China can be roughly divided into two levels: pass SRE and business SRE. The former is mainly to maintain the service quality of platform infrastructure, and the latter is mainly to maintain the stability of business service quality. Business SRE is more like business operation and maintenance.

The above definition is only suitable for large factories. For those who have not yet spread SRE culture, the position of SRE is relatively vague. Its positioning may be an operation and maintenance development engineer, and the problems to be solved are comprehensive and diversified. When promoting SRE culture and building SRE code of practice, we need to have a comprehensive understanding of the current business architecture and technical architecture, and reasonably design, plan and adjust the infrastructure, so that SRE culture can be better implemented.

The following are compiled from the Internet. Because the scope of work of SRE was first proposed and practiced by Google, his scope of work is as follows. Everything changes. In fact, there is a very core key point here, that is, the stability of infrastructure determines the work efficiency of SRE. Therefore, we should also take some factors of infrastructure into account when building SRE culture and work responsibilities.

The following are the key points mentioned in SRE Google operation and maintenance decryption:

  • Observability system

  • Fault response

  • Testing and deployment

  • Capacity planning

  • Automated software development

  • User support

  • Oncall

  • Develop deliverable SLI / SLO / SLA

  • Fault recovery

Observability system

In any enterprise with a certain scale, once the operation and maintenance mode of the whole SRE is implemented, the construction of observability system will become particularly important. In the whole observability system, we usually divide it into the following three aspects:

  • Indicator monitoring: that is, monitoring of various indicators, such as basic resource indicators, service performance indicators and business invocation indicators.

  • Log: operation log monitoring of various devices and services.

  • Call chain: call chain analysis at the business level, which usually helps the operation, development and operation and maintenance personnel quickly identify the bottleneck of the overall call in the distributed system

A complete set of observation system, which can ensure that you have insight into the system, track the health status, availability and what happens inside the system.

For the construction of the whole observable system, the following two points should be paid attention to:

  • Determine what the quality standard is and ensure that the system continues to approach or remain within the limits of the quality standard

  • Pay systematic attention to this work – rather than just looking at the system at random

In the whole enterprise observable system, I think it should include at least the following features:

  • Complete indicator collection: it can connect the corresponding monitoring indicators of most equipment and technology stacks in the enterprise; At the same time, it supports the monitoring index system of common equipment, which can quickly access the monitoring equipment and indicators to avoid all equipment monitoring from scratch; Support for log data collection

  • Massive equipment support: the number and scale of enterprise IT systems are increasing, so the monitoring system needs to monitor massive equipment than before.

  • Monitoring data storage and analysis: monitoring data is the basis of operation and maintenance analysis, operation and maintenance automation and intelligence. Therefore, massive monitoring data storage and visual analysis based on monitoring data are the basic capabilities of a monitoring system.

Observable system is the basis of the whole operation and maintenance system, which needs to provide data support for the whole operation and maintenance system.

Therefore, an enterprise level observability system should be platform based. On the one hand, more operation and maintenance indicators can be accessed through configuration or development; On the other hand, it can also connect with more professional operation and maintenance tools, integrate and open up diversified operation and maintenance data, and provide data services for more operation and maintenance scenarios. On the whole, the observability system provides a data base for enterprise operation and maintenance, allowing us to use more data in accident response and capacity prediction, rather than making decisions based on past experience and clapping our heads.

Fault response

If something breaks down, how can we remind everyone and respond? Tools can help solve this problem because they can define rules that remind humans.

Fault response is based on the data constructed by observability system, and helps us strengthen the monitoring of service with the help of feedback loop.

The fault response usually includes the following actions:

  • Attention: we should actively pay attention to whether we actively find bottlenecks or abnormal points, or passively expose bottlenecks through observable systems

  • Communication: timely notify the relevant parties of the observed risk points, and inform the impact area and relevant remedial measures

  • Recovery: after the three parties reach an agreement, repair relevant risk points and abnormal points according to remedial measures

It should be noted that if the whole observability system can be done well in the early stage, the fault should usually start with a simple alarm information or an alarm call. Therefore, generally, if the observable system is good enough, it can only play the role of tracing and troubleshooting, but can not play the role of timely discovery. At this time, it is necessary to rely on each observation data for calculation and evaluation of alarm, Timely notify relevant alarms to relevant personnel to expose risk points.

Alarm is only the first link of the whole fault response, which solves the problem of how to find the fault. Most fault response work is about defining processing strategies and providing training, so that people can know what to do when they receive the alarm. Usually, this part is more a summary and precipitation of past historical experience and operation and maintenance experience, including some abstract and instrumental precipitation of experience, To ensure the efficiency and generalization of fault response (i.e. not relying on human experience).

For the whole alarm system, what needs to be ensured is the effectiveness of the alarm. Otherwise, the whole alarm system is likely to be reduced to a garbage data generator. The effectiveness of the alarm means that the following two requirements need to be met:

  • Timeliness of alarm: if the system has problems, it needs to inform the operation and maintenance personnel through the alarm information to deal with the alarm in time;

  • Alarm accuracy: as long as there is alarm information, the system will inevitably have problems (for many enterprises, there may be a large number of useless alarms, such as disk problems, MEM and other related problems. Of course, this involves automation, business form and alarm threshold);

In the whole process of operation and maintenance, we often find a large number of irrelevant alarm information, which makes the attention of operation and maintenance personnel lose in the ocean of alarm. Usually, leaders in the field of non operation and maintenance will pay attention to the response degree of the whole alarm. Therefore, restraining and eliminating ineffective alarms and preventing operation and maintenance personnel from being swallowed by the alarm storm is also the key construction content in alarm management.

Generally, after the construction of each observable system, we can realize alarm compression convergence and strengthen the effectiveness of alarm by integrating various monitoring data into the monitoring platform and applying algorithms and means such as trend prediction, short cycle detection, intermittent recovery, baseline judgment and repeated compression.

What is SRE? What capabilities does SRE need?


At the same time, for the front-line operation and maintenance personnel, we need to conduct comprehensive modeling and analysis according to multiple monitoring indicators of the same system or equipment, summarize them into a health score, and give the front-line operation and maintenance personnel a system hierarchical evaluation system based on health, so as to truly and intuitively reflect the operation state of the system and realize the rapid positioning of problems.

For example, the overall utilization rate of the basic resource is evaluated through the comprehensive weighting calculation of multiple indicators of the basic resource; A score is calculated through the resource utilization of all resources associated with an application and the overall modeling and analysis of the operation and maintenance architecture of the application to evaluate the health of the application as a whole.

If this process is more mature, it can be closed-loop connected according to the existing internal solutions and alarms. A simple scenario is that when the disk is full, the alarm will first trigger a standardized disk patrol and delete the relevant discardable data. If the alarm still cannot be solved, it can be directly related to the front-line operation and maintenance for manual intervention next time, and then summarize the standardization experience.

Testing and deployment

The impact of testing and deployment on the overall stability and reliability is mainly due to a preventive role. Prevention refers to trying to limit the number of accidents and ensure that the infrastructure and services can remain stable when new code is released.

As a person who has been engaged in operation and maintenance for a long time, perhaps the most fear in his heart is the release of the new application version. Because in addition to hardware and network equipment damage, which is a natural disaster level probability event, the day after the release of the new application version is usually a high-risk period of downtime and accidents. Therefore, for some large-scale products, the network is usually closed on the eve of holidays and important activities to avoid business bugs caused by the launch of the new version.

Testing is an activity to find an appropriate balance between cost and risk. If you take too much risk, you may be tired of dealing with system failure; On the other hand, if you are too conservative, you can’t release new things fast enough for the enterprise to survive in the market.

In the case of more wrong budgets (i.e. less system downtime due to failure over a period of time), the test resources can be appropriately reduced and the test and conditions for system online can be relaxed, so that the business can have more functions online, so as to maintain the sensitivity of the business; When the wrong budget is relatively small (i.e. the system downtime is long due to failure in a period of time), it is necessary to increase test resources and tighten the on-line test of the system, so as to effectively release the potential risks of the system, avoid system downtime and maintain the steady state of the system. This balance between sensitivity and steady state needs to be shared by the whole operation and maintenance and development team.

In addition to testing, application publishing is also a common responsibility of the operation and maintenance team. One of the principles of SRE is to code and implement all repeatable labor; In addition, the complexity of application publishing is often proportional to the complexity of the system. Therefore, in the application system, large-scale enterprises often have begun to build an automatic application publishing process based on the automation framework.

What is SRE? What capabilities does SRE need?


Through the automatic release tool, we can build a pipeline to automate all operations in the deployment process (such as compilation and packaging, test release, production preparation, alarm shielding, service stop, database execution, application deployment, service restart, etc.).

Capacity planning

Capacity planning is about predicting the future and discovering the limits of the system. Capacity planning is also to ensure that the system can be improved and enhanced over time.

The main goal of planning is to manage risks and expectations. For capacity planning, it involves expanding capacity to the whole business; The concern is how people expect services to respond when they see business growth. The risk is to spend time and money on additional infrastructure to deal with this problem.

Capacity planning first is the analysis and judgment of future predictability, and its prediction is based on massive operation and maintenance data. Therefore, in addition to the corresponding architecture and planning team, a comprehensive operation and maintenance data center is a necessary facility to realize the system capacity planning.

Capacity trend early warning and analysis will comprehensively collect, sort, clean and structurally store various operation and maintenance data from various operation and maintenance monitoring, process management and other data sources, integrate these operation and maintenance data from various tools and build various data topics.

The data of these data topics are used to help the operation and maintenance personnel evaluate the problems, including:

  • What is the current capacity

  • When is the capacity limit reached

  • How should I change the capacity

  • Perform capacity planning

In addition to providing necessary data support, the operation and maintenance platform also needs to provide necessary data visualization support capabilities. The visualization of operation and maintenance data provides some necessary capabilities to ensure that the operation and maintenance personnel can make better use of the operation and maintenance data to evaluate the capacity.

First of all, the operation and maintenance platform needs to have strong data retrieval ability. The operation and maintenance platform stores a large amount of operation and maintenance data. In order to try to establish and verify an exploratory scenario, the operation and maintenance personnel often repeatedly retrieve and query specific data for many times. If the data query of the operation and maintenance data analysis platform is very slow or the query angle is few, the time for the operation and maintenance personnel to establish the scene will be very long or even impossible. Therefore, the operation and maintenance personnel can realize the keyword, statistical function, single condition, multi condition, fuzzy multi-dimensional search function through the platform, as well as the second level query of massive data, so as to more effectively help the operation and maintenance personnel analyze the data more conveniently.

Second, the platform needs strong data visualization ability. It is often said that “a thousand words are less than a picture”. The operation and maintenance personnel often conduct statistical analysis and generate various real-time reports through the operation and maintenance data of each system, conduct multi-dimensional and multi angle in-depth analysis, prediction and visual display of various operation and maintenance data (such as application log, transaction log and system log), and express and promote the prediction results and experience of their analysis to others.

Automated tool development

Sre involves not only operation, but also software development. Of course, this part refers to the development of tools and platforms related to operation and maintenance and SRE. In Google’s SRE system, SRE engineers will spend about half of their time developing new tools and services. Some of these tools are used to automate some manual tasks, while others are used to continuously fill and repair other systems within the whole SRE system.

Free yourself and others from repetitive work by writing code. If we don’t need humans to complete the task, write code so that humans don’t need to participate in it.

Sre despises repetitive work from the heart and will change from the original manual and passive response to a more efficient and automated operation and maintenance system.

Automated operation and maintenance framework:

What is SRE? What capabilities does SRE need?


Advantages and necessity of automated operation and maintenance tools:

  • Improve efficiency: the automatic operation of procedures can effectively reduce the investment of human resources in operation and maintenance, and release the energy of operation and maintenance personnel and invest in more important fields.

  • Standardization of operation: integrate many complex and error prone manual operations into the unified operation and maintenance operation portal, realize the white screen of operation and maintenance operation, and improve the manageability of operation and maintenance operation;

    At the same time, reduce the manual misoperation caused by the emotion of operation and maintenance personnel, and avoid the tragedy of “from deleting the library to running away”.

  • Inheritance of operation and maintenance experience and ability: the operation and maintenance automation tool summarizes the experience accumulated by many operation and maintenance teams into various operation and maintenance tools in the form of code to realize automatic and white screen operation and maintenance. The successor of the operation and maintenance team can effectively inherit, reuse and optimize them. This coded work inheritance transforms individual ability into team ability and reduces the impact of personnel mobility on work.

The construction of automatic operation and maintenance system must be based on operation and maintenance scenarios. These operation and maintenance scenarios are iterated and built repeatedly in the enterprise, and are the most commonly used operation and maintenance scenarios in the enterprise.

For example, common operation and maintenance scenarios: software installation and deployment, application release and delivery, asset management, automatic alarm processing, fault analysis, resource application, automatic patrol inspection, etc. Therefore, the construction of the whole automatic operation and maintenance system should also support a variety of different types of automatic job configuration capabilities, and realize more operation and maintenance scenarios through simple script development, scenario configuration and visual customization process.

User support

The user experience layer should say that as SRE, the ultimate goal is to ensure the stability and availability of the business from the perspective of users. In the traditional sense, the operation and maintenance personnel will not pay attention to this point, because we usually only consider whether the underlying operation and maintenance system or underlying resources are stable, but in fact, the stability of the whole business is what SRE needs to care about, and the stability and availability of the business usually need to simulate and measure the overall availability and reliability from the perspective of users.

All SRE related work areas mentioned above, whether monitoring, accident response, review, test and release, capacity planning and building automation tools, are all designed to provide better system user business experience. Therefore, we need to pay attention to the user experience of the system in the process of operation and maintenance.

In the actual operation and maintenance work, we can often use application logs, monitoring data, business detection and other business-related user experience information. In the operation and maintenance data platform, the association and connection between these user experience monitoring data reproduce the user’s final service call link and the relationship between each application link and performance data. Finally, starting with the business user experience data, the system operation status data and equipment operation status data link are gradually opened, so that the operation and maintenance system can achieve the goal of focusing on the end-user experience.

These user experience information plays an irreplaceable role for the operation and maintenance team to master the overall user experience of customers, the monitoring of system availability and the targeted optimization of the system.

In fact, SRE operation and maintenance system puts more emphasis on user experience as the core and automation and operation and maintenance data as the means to realize application business continuity guarantee. From this point of view, we will find that it is very different from the traditional operation and maintenance in the past. We are no longer just pure installation and deployment engineers. We need to continuously ensure the stability and reliability of the upper business through a series of technical means.


Oncall is simply to ensure the normal operation of online services. The typical workflow is: after receiving the alarm, check the cause of the alarm, confirm whether there is a problem with the online service, locate the problem and solve the problem.

Receiving an alarm does not always mean a real problem, or the alarm setting may be unreasonable. The alarm and monitoring panel is not a static configuration. It should change every day and be adjusted at all times. If it is found that there is no alarm indicating the real online problem, the alarm rules should be modified. If it is found that the current monitoring cannot quickly locate the problem, the monitoring panel should be adjusted and the monitoring indicators should be added or deleted. The business is developing, the amount of requests is changing, and some thresholds need to be adjusted constantly.

There is no general method to locate the problem. You need to make speculation based on the real-time monitoring data you see and your own experience, and then use tools to verify your speculation, and then determine the root cause of the problem.

But there can be a methodology for solving problems, calledSOP, standard operating procedure。 That is, if this phenomenon occurs, you can restore the business by performing that operation. SOP documents should be prepared in advance and verified for effectiveness.

It should be noted that the above positioning problems and solving problemsThere is no order。 A common mistake is that when a fault occurs, it takes a long time to locate the root cause of the fault and then repair it. This usually takes a long time. The correct way is to see whether the existing SOP can resume business according to the phenomenon. For example, if the current error only occurs on a certain node, you can directly offline this node and check the specific reason later. Restoring the current fault is always the first priority. However, the recovery operation should also be tested. For example, if you guess that the problem can be solved by restarting, you can restart one for testing instead of restarting all services at once. Most situations need on-the-spot analysis, which is a tense and exciting process.

How long does the fault recover? How many faults are tolerable? How to check the stability of the service? We use SLI / SLO to measure these problems.

Develop deliverable SLI / SLO / SLA

SLO and SLA are two common terms: service level objectives and service level agreements.

In the era of cloud computing, major cloud service providers have issued SLA terms for their own services. For example, Amazon’s EC2 and S3 services have corresponding SLA terms. The SLAs of these large companies look so tall. How are they generally defined?

It is well known that SLO cannot be mentioned in SLA, but few people know another concept, that is SLI (service level indicator). To define an executable SLA, good SLO and SLI are essential.

In addition, it must be mentioned that SLI / SLO / SLA is mainly specified for services. If there is no service as the supporting relationship, these concepts will lose their original meaning.

The following are some issues that need to be considered in the formulation of an SLA:

For example, when designing an availability rate, it does not mean that it is enough to meet the “four 9s” standard, because we need to consider the following issues:

  1. How to define this availability? For example, with the availability rate > 99.9% as the goal, if a service is deployed in five areas, one area is suspended and the other areas are available, is the availability rate destroyed? Is this availability calculated for each region or all regions together?

  2. What is the smallest unit of availability calculation? If the availability rate is not reached in 50s within 1min, is this minute down or up?

  3. How is the availability cycle calculated? According to a month or a week? Is a week the last 7 days or a natural week?

  4. How to design and plan SLI and SLO monitoring?

  5. If the wrong budget is about to run out, what measures are there to deal with it? Such as reducing releases and configuration changes?


What is service?

In short, all the useful functions provided to customers can be called services.

Services are generally provided by service providers. The organizations that provide this useful function are called service providers, usually people plus software. The operation of software requires computing resources. In order to provide useful functions externally, software may rely on other software systems.

A customer is a person or company that uses the services provided by a service provider.


SLI is a carefully defined measurement index. It determines what to measure according to the characteristics of different systems. The determination of SLI is a very complex process.

The determination of SLI needs to answer the following questions:

  1. What are the indicators to be measured?

  2. System status at the time of measurement?

  3. How to summarize and process measured indicators?

  4. Can the measurement indicators accurately describe the service quality?

  5. Reliability of measurement index?

  6. Common measurement indicators include the following aspects:
  • performance

    • Response time (latency)

    • Throughput

    • Request volume (QPS)

    • Effectiveness (freshness)

  • usability

    • Uptime

    • Failure time / frequency

    • reliability

  • quality

    • Accuracy

    • Correctness

    • Integrity

    • Coverage

    • Relevance

  • Internal indicators

    • Queue length

    • RAM usage

  • Factor person

    • Time to response

    • Time to fix

    • Fraction fixed

The following is an example: hotmail’s downtime SLI

  • Error rate is the total number of errors returned by the service to users

  • If the error rate is greater than x%, even if the service is down, start calculating the downtime

  • If the error rate continues for more than y minutes, this downtime will be counted

  • Intermittent downtimes less than y minutes are not counted.

  1. The system state during measurement and under what circumstances will the measurement seriously affect the measurement results
  • Measure whether a bad formed request or a failed request or a timeout request

  • System load at measurement (maximum load or not)

  • Measurement initiation location, server side or client side

  • Time window of measurement (only working days, or 7 days a week, whether the planned maintenance period is included)

  1. How to summarize and process measured indicators?
  • What is the calculated time interval: is it a rolling time window or simply calculated by month

  • Whether to use the average value or the percentile value, for example, the response time SLI of the ticket processing of a service X

  • Measurement index: count the time from the time when the user creates the ticket to the time when the problem is solved

  • How to measure: count the tickets created by all users with the time stamp provided by the ticket

  • Measurement under what circumstances: only including working hours, excluding legal holidays

  • Data indicators for SLI: take one week as the sliding window and 95% of the quantile solution time

  1. Can the measurement indicators accurately describe the service quality?
  • Performance: timeliness and deviation

  • Accuracy: accuracy, coverage, data stability

  • Integrity: data loss, invalid data, outlier data

  1. Reliability of measurement index
  • Is it recognized by both service providers and customers

  • Can it be independently verified, such as a third-party organization

  • Client or server measurement, sampling interval

  • How is the error request calculated


SLO (service level objective)Specifies a desired state of the functionality provided by the service. What should SLO contain? All information that can describe what functions the service should provide.

The service provider uses it to specify the expected state of the system; Developers write code to implement; Customers rely on SLO for business judgment. SLO doesn’t mention what happens if the goal is not achieved.

SLO is described by SLI, which is generally described as: for example, the following SLO:

  • Average QPS per minute > 100k / S

  • 99% access delay < 500ms

  • 99% bandwidth per minute > 200MB / S

Several best practices when setting up SLO:

  • Specifies the time window for the calculation

  • Use consistent time windows (XX hour scrolling window, quarter scrolling window)

  • There should be an exemption clause, for example: SLO should be reached 95% of the time

  • If the service is setting SLO for the first time, you can follow the following principles

  • Current status of measurement system

    • Set expectations, not guarantees

    • The initial SLO is not suitable as a tool to strengthen the quality of service

  • Improved slo

    • Lower throughput, response time for changes, etc
  • Maintain a certain safety buffer

    • The SLO used internally is higher than the SLO declared externally
  • Don’t overfulfil

    • Regular downtimes to keep SLO from overfulfilling

The goal of setting SLO depends on different conditions of the system. Set different SLO according to different states: * * total SLO = service1 SLO1weight1 + service2.SLO2 weight2 + …

Why SLO? What are the benefits of setting SLO?

  • For customers, it is the expected quality of service, which can simplify the system design of the client

  • For service providers

    • Predictable service quality

    • Better trade-offs cost / benefit

    • Better risk control (when resources are limited)

    • Faster response in case of failure and take correct measures

    • When SLO is set up, how can we ensure that the goal can be achieved?

    • A control system is required to:

Monitor / measure SLIS and compare whether the detected SLIS value reaches the target. If necessary, modify the target or modify the system to meet the target needs. Implement the modification of the target or the system. The control system needs to repeat the above actions to form a standard feedback loop and continuously measure and improve the SLO / service itself.

We discussed the goal and how the goal is measured, and also discussed the control mechanism to achieve the set goal, but what if the set goal is not achieved for some reason?

Maybe it’s because of a large number of new loads; The SLO of the last service may be affected because the underlying dependency cannot reach the nominal SLO. This requires the SLA to come out.


SLA is a contract involving two parties. Both parties must agree and abide by the contract. When external services need to be provided, SLA is a very important signal of service quality, which requires the simultaneous intervention of product and legal departments.

SLA is described by a simple formula: SLA = SLO + consequence

  • SLO can not satisfy a series of actions, which can be partially achieved

    • For example: SLO reaching response time + SLO not reaching availability
  • Specific implementation of actions

    • A common currency is needed to reward / punish, such as performance score, etc

SLA is a good tool to help allocate resources reasonably. The ideal running state of a service with a clear SLA is:The benefit of adding additional resources to improve the system is less than that of investing the resources in other services.

A simple example is the ratio of the resources needed to improve the availability of a service from 99.9% to 99.99% and the benefits it brings, which is an important basis for determining whether the service should provide four 9s.

Fault recovery

Fault recovery is to review and summarize some service exceptions and service interruptions in the past to ensure that the same problems will not occur again next time. In order to make everyone unite and cooperate, we hope to establish a culture of no blame and transparency. Individuals should not be afraid of accidents, but be confident that if accidents occur, the team will respond and improve the system.

Note: in fact, in the domestic SRE culture, generally only large-scale accidents that have a significant impact on the business will be reviewed, but in fact, if time and experience allow, ordinary accidents should also be reviewed in a small range. The so-called large faults are accumulated from small problems. In addition, in fact, for individuals related to operation and maintenance, we should also conduct minor fault recovery in time to continuously strengthen personal fault handling and repair ability.

I think a key consensus of SRE is to recognize the imperfection of the system, and it is unrealistic to pursue a system that never stops. Based on the imperfect system, we inevitably have to face and experience system failures and failures.

Therefore, what is important for us is not to find the person or person responsible for the fault, but to thoroughly review what is the root cause of the fault and failure, and how to avoid the same fault again. System reliability is the direction for the whole team to work together to quickly recover from failure and learn lessons. Everyone can put forward problems with confidence, deal with downtime and strive to improve the system.

Note: usually, in the process of fault recovery within many enterprises, relevant personnel may inadvertently take the root cause tracing of faults and failures as fault responsibility determination and a series of punishment measures, and forcibly agree on the occurrence of faults through some punishment measures. This method is often not desirable. Imagine that everyone doesn’t want to have accidents, either out of cognition or rule defects, No one will ever make a fault knowing that there will be a fault.

What needs to be kept in mind is that failure is something we can learn from, not something that makes people afraid and ashamed!

In the process of daily operation and maintenance, failures and other accidents are actually a good opportunity for us to resume our study. Through the historical monitoring data, analyze the root causes of the accident, formulate subsequent response strategies, and edit these response strategies into standardized, reusable and automated operation and maintenance application scenarios through the operation and maintenance platform, so as to provide standard and fast solutions for the subsequent treatment of the same problems. The process of retrospection is the real value of retrospection.

The only purpose of fault recovery is to reduce the occurrence of faults. There are a few things I think are good at present.

Fault recovery requires document records, including the process of fault occurrence, the record of time line, the record of operation, the method of fault recovery, the analysis of fault root cause and the analysis of why the fault occurs. The document should hide the names of all parties and make it public to all people in the company. I don’t think it makes sense for many companies to set viewing permissions for fault documents. Some companies’ failure recovery evenIt is also open to the outside world

In case of failure, the name of the party concerned should be replaced with code, which can create a better discussion atmosphere.

Action should not be required for all fault recovery. In the previous company’s failure recovery, because the leader must be “explained”, some measures will be taken every time to prevent the same failure from happening again, such as increasing the approval process. This is ridiculous. Allowing high-level leaders to approve operations that they can’t understand can only make the leaders more painful and make the operation process smelly and long. Finally, everyone will forget why there is an approval here, but no one dares to delete it. You delete it. If something happens, you are responsible.

Blade free culture? I thought it was good. However, it was later found that some problems caused by non-compliance with the process should be blanked. For example, when you go offline without checking the TCP connection, you go offline directly, or you operate all the operations without canary. This kind of irrational behavior leads to the failure. But there shouldn’t be too many rules, or you can’t do any work.

Reference source:


https://bgbiao.top/post/sreOperation and maintenance system

On the avenue of automation operation and maintenance reform based on Devops thought, we have been forging ahead and never stopped.

The road is blocked and long, the line is coming, the line does not stop, and the future can be expected.

You are welcome to search k8stech to follow the official account, and regularly update the articles on operation and maintenance development, SRE, cloud native, etc.

Recommended Today

Principle analysis of Flink CDC 2.0 batch stream fusion technology

In August, flinkcdc released version 2.0.0. Compared with version 1.0, it supports distributed reading and checkpoint in the full reading stage, and ensures data consistency without locking the table in the process of full + incremental reading. Detailed introduction referenceFlink CDC 2.0 was officially released to explain the core improvements in detail。 Flink CDC2. 0 […]