Six steps of stability support: operational guide of high availability system!


Introduction: every year there is a big promotion, and everyone is familiar with the word “big promotion stability guarantee”. Although the business scenarios are different, the “routine” often leads to the same goal. The whole link voltage measurement, capacity evaluation, current limiting, emergency plans, etc. come and go, there are always so many things to do. Why should we follow these strategies? What else can we do besides oral historical experience? What is the theoretical basis?

Six steps of stability support: operational guide of high availability system!

1、 Preface

Every year there is a big promotion. Everyone is familiar with the word “big promotion stability guarantee”. Although the business scenarios are different, the “routine” often leads to the same goal. There are always so many things to come and go, such as full link voltage testing, capacity evaluation, current limiting, emergency plans, etc.

Why should we follow these strategies?

What else can we do besides oral historical experience? What is the theoretical basis?

2、 What kind of system is stable?

First, to answer another question, what kind of system is stable?

Google SRE (SRE trilogy [1]) has a hierarchical model to describe the system reliability foundation and Dickerson’s hierarchy of service reliability, as shown in the following figure:

Six steps of stability support: operational guide of high availability system!

The model was proposed by Mikey Dickerson, an engineer of Google SRE, in 2013. The system stability requirements were systematically differentiated at different levels according to the basic level, forming a stability standard pyramid model.

The base of the pyramid is monitoring, which is the most basic requirement of a system for stability. A system without monitoring is like a wild horse running blindfolded. It is impossible to talk about controllability, let alone stability. The upper level is the emergency response. The time-consuming period from the detection of a problem to the final solution directly depends on the maturity of the emergency response mechanism. Reasonable emergency strategy can ensure that when the fault occurs, all problems can be orderly and properly handled, rather than panic into a pot of porridge. Post mortem & root cause analysis is what we usually talk about. Although many people don’t like this activity very much, we have to admit that it is the most effective way to avoid making the same mistake next time. Only when we find out the root cause of the fault and the corresponding defects, can we suit the remedy to the case and reasonably avoid it.

Assuming that a system is no longer updated from the initial release, the above three aspects of work can basically meet all the requirements of the system for stability. Unfortunately, there is no such system at present, and applications of large and small cannot be separated from continuous changes and releases. Therefore, to ensure the continuous stability of the system in these iterations, testing & release procedures is essential. Effective test and release strategy can ensure that all new variables are in controllable stable range, so as to achieve the overall service end state stability. In addition to code logic update, iteration may also bring changes in business scale and traffic, and capacity planning is a guarantee strategy for these changes. Whether the existing system volume is enough to support the new traffic demand, and whether there are unequal weak nodes in the overall chain, are the problems that need to be considered in capacity planning.

At the top of the pyramid is product design ( Product ) Information technology and software development ( Development ) That is to say, through excellent product design and software design, the system will have higher reliability and build a high availability product architecture system, so as to improve the user experience.

3、 The method of stability guarantee for large scale promotion

From the pyramid model, we can see several aspects of work that need to be done to build and maintain a highly available service. Then the problem comes back to the big promotion of stability. How to systematically guarantee the system stability during the big promotion?

Compared with the daily support work, big promotion support has the characteristics of high concurrent traffic, short support cycle, and clear requirements for system performance and support time (generally about 2 months).

Considering the above characteristics, how can we optimize and consolidate the system stability requirements in a short period of time for the business scenario of promoting large traffic?

Since time is limited, blindly casting the net is not the best strategy. We need to start from the key points and weak points. Therefore, in the first step, we need to obtain the status quo of the global system link, including the key external dependencies and the impact of key services, and find the core concerns of the overall support. Next, we further analyze the business data of Datong, and get the variable interference factors except the system itself. On the basis of the two, focusing on the pyramid model of system monitoring, planning capacity, emergency response, testing and replication and other aspects of the requirements of the system, the system is constructed to obtain the final results.

So far, we have basically obtained a complete strategy direction of promoting stability

  1. System & Biz profiling
  2. Monitoring
  3. Capacity planning
  4. Emergency response
  5. test
  6. Testing & Postmortem)

Six steps of stability support: operational guide of high availability system!

1. System & Biz profiling

System link carding is the basis of all support work, just like a comprehensive physical examination of the whole application system, starting from the traffic entrance, according to the link trajectory, hierarchical nodes step by step, to get the overall picture and core support points of the system.

Entry sorting and inventory

A system often has more than a dozen or even more traffic entries, including HTTP, RPC, messages and other sources. If all links cannot be covered, it can be combed from the following three types of entrances:

  • Core re guarantee traffic portal users promise high service SLI, and have clear requirements for data accuracy, service response time and reliability. For enterprise users
  • The entry corresponding to the asset loss event is related to the company’s capital income or customer’s capital income charging service
  • Although TPS & QPS top 5 ~ 10 does not involve high SLI and asset loss requirements, it has a high flow, which has a great impact on the overall system load.

Hierarchical judgment of nodes

The traffic entrance is just like the thread head in the coil. After picking out the thread head, the nodes on the link can be detected according to the traffic trajectory ( Hsfdbtairhbase and other external dependencies ) According to the degree of dependence, availability, reliability of the primary hierarchical distinction.

(1) Judgment of strong and weak dependent nodes

  • If the node is not available, the link service logic is interrupted or the high level is damaged (there is a certain tolerance threshold), then the service is strongly dependent; otherwise, it is weakly dependent.
  • If the node is unavailable and the link execution logic is interrupted (return error), the system is strongly dependent; otherwise, the system is weakly dependent.
  • If the node is not available and the system performance is affected, the system is strongly dependent; otherwise, the system is weakly dependent. According to the fast failure design logic, this kind of node should not exist, but if it appears without changing the application code, it should be treated as a strong dependency.
  • If the node is insensitive to degradation or there is a slight damage replacement scheme, it is weak dependence.

(2) Low availability dependent node judgment

  • The daily timeout of node service is serious
  • The system resource corresponding to the node is insufficient

(3) High risk node judgment

  • After the last big promotion, there is a big version of the node system transformation
  • New online nodes without big promotion
  • Has the system corresponding to the node ever experienced high-level failure
  • There is asset loss risk after node failure

Output data

After finishing the carding work, we should output the following data: analysis of all the core links of the corresponding business domain, technical support & Strong business dependence, core upstream and downstream systems, and asset loss risk should be clearly marked.

The figure below shows an example of single link analysis

Six steps of stability support: operational guide of high availability system!

2. System & Biz profiling – business policy synchronization

Different from the high availability system construction system, it greatly promotes the stability guarantee system and the targeted guarantee construction for specific business activities. Therefore, business strategy and data are indispensable data before our guarantee.

Generally, the business data of big promotion can be divided into two categories, global business form evaluation and emergency strategy play.

Global assessment

This kind of data can help us to carry out accurate traffic assessment, peak prediction, greatly promote human scheduling, etc., and generally includes the following categories:

  • Business promotion duration (XX days – XX days)
  • Estimated volume of business volume (x times daily)
  • Estimated peak date
  • Traffic allocation in business scenarios

Emergency strategy

This kind of data refers to the business variables of this big promotion compared with the previous big promotion activities, which can be used for emergency response plan and high-risk node evaluation, and generally includes the following two types:

  • Special business play
  • Strategy of playing in emergency

3. Monitoring – Monitoring & alarm carding

At present, there are two common monitoring methods in the industry, black box monitoring ( Black – box monitoring ) And white box monitoring ( White – box monitoring )。 Black box monitoring is object-oriented, which generally monitors the exception that is happening (not about to happen), that is, the existing fault of the system. The white box monitoring mainly relies on the monitoring of internal indicators of the system, which is object-oriented and cause oriented. It can give early warning to the system about to face the exception, and can also synchronously monitor the lower level internal indicators when the exception occurs, so as to locate the root cause. Therefore, we generally choose the white box monitoring in the stability guarantee of big promotion.

Six steps of stability support: operational guide of high availability system!

From the perspective of monitoring, our system can be divided into three layers: biz, application and system. The system layer is the lowest level foundation, which represents the relevant state of the operating system; the application layer is the JVM layer, which covers the main application process and middleware running state; the business layer is the top level, which is the external running state of the service from the business perspective.

Therefore, when sorting out the large-scale stability monitoring, we can first break away from the existing monitoring, start from the core and asset loss link, sort out the monitoring needs according to the three levels of business, application (middleware, JVM, DB) and system, and then find the corresponding monitoring alarm according to these indexes. If it does not exist, it should be supplemented accordingly; if it does exist, check the threshold, time and alarm person Is it reasonable.


The monitoring system generally has four golden indicators: latency, error, traffic and saturation. The key monitoring of each layer can also be classified according to these four indicators, as follows:

Six steps of stability support: operational guide of high availability system!

Table 1

give an alarm

Does every monitoring need to be alerted? The answer, of course, is No. It is recommended to give priority to setting biz layer alarm, because biz layer is the most intuitive service performance and the most appropriate user experience. Application & system layer indicators are mainly used for monitoring, and some key & high-risk indicators can be set with alarms, which can be used for troubleshooting and fault detection in advance.

For an alarm, we generally need to pay attention to the level, threshold, notifier and other points.

1) Level

When the current alarm is triggered, the severity of the problem is generally measured by several points

  • Is it associated with GOC
  • Does it have a serious business impact
  • Is there any asset loss

2) Threshold

That is to say, the trigger condition & time of an alarm should be set reasonably according to the specific scene. The following principles are generally followed:

  • Don’t be too slow. In a reasonable monitoring system, any exception should trigger relevant alarm.
  • Don’t be too sensitive. Too sensitive threshold will cause frequent alarms, which leads to the response staff fatigue response, unable to screen the real exception. If an alarm appears frequently, there are generally two reasons: unreasonable system design or unreasonable threshold setting.
  • If a single indicator cannot feed back and cover the overall business scenario, it can be constructed by combining multiple indicators.
  • According to the business fluctuation curve, different conditions & notification strategy can be set in different periods.

3) Notifier & method

If the business indicator is abnormal (biz layer alarm), the notifier should be a collection of problem processing personnel (development, operation and maintenance students) and business concerned personnel (TL, business students), and the notification method is more real-time, such as telephone notification.

If it is an application system layer alarm, it is mainly used to locate the cause of the exception. The notifier can set the problem investigation and processing personnel, and the notification method can consider low interference methods such as nailing and SMS.

In addition to the association level, the scope of notifier can be appropriately expanded for different levels of alarms, especially for the alarm indicators associated with GOC fault, the scope should be appropriately expanded, and the notification mode should be more real-time and direct.

Output data

After finishing the sorting work, we should output the following data:

  • The format of the system monitoring model is the same as that in table 1biz, application and system. What are the points to be monitored, whether all the monitoring points have indicators, and what are still to be supplemented
  • The system alarm model list should include the following data association monitoring indicators (links) whether the alarm key level is pushed to GOC, whether the asset loss is associated, whether the fault is associated with the plan
  • Business indicators, including biz layer key monitoring indicator data.
  • System & Application indicators, including the key system indicators of the core system, can be used for white box monitoring and positioning.

4 、Capacity Planning – Capacity planning

The essence of capacity planning is to seek the balance between computing risk minimization and computing cost minimization. It is not reasonable to only pursue any one of them. In order to achieve the best balance between the two, it is necessary to accurately calculate the peak load flow of the system as far as possible, and then convert the flow into the corresponding capacity according to the upper limit of single point resource load to get the final capacity planning model.

Traffic model evaluation

1) Inlet flow

For a large-scale promotion, the peak entrance traffic of the system is generally composed of conventional business traffic and unconventional increment (such as the change of traffic model ratio caused by the change of disaster recovery plan and business marketing strategy).

(a) There are generally two kinds of calculation methods for conventional business traffic

Historical traffic algorithm: this kind of algorithm assumes that the growth rate of big promotion in the current year fully conforms to the historical traffic model, and calculates the year-on-year increment model of the overall business volume according to the current & calendar year daily traffic; then calculates the estimated traffic month on month increment model according to the past year big promotion daily comparison; finally, the final evaluation data is obtained by fitting the two.

Since the calculation does not need to rely on any business information input, this kind of algorithm can be used to ensure that the total business volume has not been evaluated at the beginning of the work, and get the initial estimated business flow.

Traffic flow conversion algorithm (gmvdau order volume): this kind of algorithm generally takes the estimated total business volume (gmvdau order volume) as the input, and obtains the corresponding sub domain business volume evaluation according to the historical & daily traffic flow conversion model (such as the classic vulnerability model).

This method strongly depends on the total business volume estimation, and can be used in the middle and later stages of the security work. On the basis of the initial business flow estimation, it can be considered as a business evaluation factor.

(b) Unconventional increment generally refers to the incremental traffic caused by the change of front desk business marketing strategy or the change of traffic model after the implementation of system emergency plan. For example, when the na61 computer room fails, the incremental change brought by the 100% flow switching to na62.

Considering the cost minimization, the unconventional increment P does not need to be calculated together with the conventional business flow W, and the total amount is included in the superimposed entrance flow K. generally, the unconventional strategy occurrence probability λ is taken as the weight

Six steps of stability support: operational guide of high availability system!

2) Node traffic

The node traffic is transformed proportionally from the entrance traffic according to the traffic branch model. The branch traffic model is based on the system link and follows the following principles:

  • For the same entry, the traffic of different link proportions is calculated independently.
  • For the same node on the same link, if there are multiple calls, it is necessary to calculate the multiple year-on-year amplification (such as dbtair, etc.).
  • Focus on DB write traffic, hot spots may appear, causing the death of DB hang.

Capacity conversion

1) Little law derivation rule

Different types of resource nodes (application container, TAIR, DB, HBase, etc.) have different traffic capacity conversion ratios, but they all obey little law derivation rule

Six steps of stability support: operational guide of high availability system!

2) N + X redundancy principle

  • On the basis of meeting the minimum capacity required by the target traffic, redundancy reserves x unit redundancy capacity
  • X is positively correlated with the target cost and the failure probability of resource nodes. The higher the unavailability probability is, the higher the X is
  • For general application container cluster, x = 0.2N can be considered

The above rules can only be used for initial capacity estimation ( Before large pressure boosting test & New dependence ) The final accurate system capacity still needs to be obtained by combining with the periodic pressure test of the system.

Output data

  • The entrance traffic model based on Model Evaluation & the result of cluster’s own capacity transformation (if it is a non entrance application, sort out the flow limiting points).
  • Branch traffic model based on link grooming & the result of external dependence capacity transformation.

5 Incident Response – urgent & Pre planning

In order to quickly respond to online emergencies in the scenario of large promotion and high concurrent traffic, it is far from enough to rely on the on-the-spot performance of students on duty. In the case of race against the clock, there is not enough room for strategic thinking for the processing personnel, and the wrong processing decision will often lead to more out of control and serious business & System impact. Therefore, students on duty need to do multiple-choice questions in order to respond to the questions quickly and correctly ( Which ) Instead of a statement ( What )。 The composition of the options is our business & System plan.

From the perspective of implementation timing and problem-solving attributes, the plan can be divided into four categories: technical emergency plan, technical pre plan, business emergency plan and business pre plan. Combined with the previous link combing and service evaluation results, we can quickly analyze the plans needed in the link and follow the following principles:

  • Technical emergency plan: this kind of plan is used to deal with the situation that a certain level node is not available in the system link, such as abnormal scenarios such as technology / service strong dependence, weak stability, high risk and so on.
  • Technology pre plan: this kind of plan is used to balance the overall system risk and single node service availability, and ensure the reliability of global service by fusing and other strategies. For example, weak stability & Weak dependent services are degraded in advance, and offline tasks that conflict with peak traffic time are temporarily scheduled in advance.
  • Business emergency plan: this kind of plan is used to deal with emergency problems caused by business changes and other non systematic anomalies, such as business data error (data correctness sensitive node), business strategy adjustment (cooperating with business emergency strategy), etc
  • Pre service plan: this kind of plan is used to adjust the pre service (non systematic requirements) in accordance with the business global strategy

Output data

After finishing the sorting work, we should output the following data:

  • Execution & closing time (pre plan)
  • Trigger threshold (emergency plan, associated with relevant alarm)
  • Associated impact (system) & Business)
  • Decision making & execution & verification personnel
  • Turn on verification mode
  • Closing threshold (emergency plan)
  • Turn off authentication mode

Periodic output – Full link operational map

After the above support work, we can basically get the global link operation map, including link branch traffic model, strong and weak dependent nodes, asset loss evaluation, corresponding plan & processing strategy and other information. During the promotion period, the map can be used to quickly view the relevant impact of emergency events from a global perspective, and at the same time, the map can be used to reverse evaluate whether the plan and capacity are perfect and reasonable.

Six steps of stability support: operational guide of high availability system!

6. Incident response – combation of operational manual

The operational manual is the operational basis of the whole campaign support and runs through the whole campaign life cycle. It can be considered in three stages: before, during and after.

The overall carding should be based on the principle of accuracy and refinement. In an ideal state, even shift students who are not familiar with the business and system can quickly respond to online problems with the help of the manual.


1) List of pre inspection items

A checklist of items that must be executed before a major promotion usually includes the following items:

  • Cluster machine restart or manual FGC
  • Shadow table data cleaning
  • Check the permissions of upstream and downstream machines
  • Check the current limiting value
  • Check the consistency of machine switches
  • Check database configuration
  • Check middleware capacity and configuration (DB cache, NoSQL, etc.)
  • Check the effectiveness of monitoring (business, technology and core alarm)
  • Each item should include three columns of data: specific executor, inspection scheme and inspection result

2) Pre plan

All business & Technology pre plan in the domain.

In the matter

1) Emergency Technology & business plan

The contents to be included are basically the same as those of the pre plan, and the differences are as follows:

  • Execution condition & recovery condition: specific trigger threshold, corresponding to monitoring alarm item.
  • Inform the decision maker.

2) Emergency Tools & Scripts

Common troubleshooting methods, core alarm hemostasis methods (strong or weak dependence is not available, etc.), business-related log fetching scripts, etc.

3) Alarm & Market

It should include business, system cluster and middleware alarm monitoring results, core business and system disk, corresponding log data source details, etc

  • Log data source details: data source name, file location, sample, segmentation format.
  • Business, system cluster and middleware alarm monitoring carding results: associated monitoring indicators (links), alarm key level, whether to push GOC, whether to generate asset loss, whether to associate failure, whether to associate plan.
  • core business & System disk: disk address, including index details ( Meaning, whether it is associated with alarm and corresponding log )。

4) Upstream and downstream machine grouping

It should include core system, upstream and downstream system, grouping in different computer rooms, unit clusters and application names, which can be used for black screen processing of pre machine permission check and in-process emergency problem investigation.

5) Notes on duty

It includes the items that must be done by the students on duty in each shift, the emergency change process, the core market link, etc.

6) Core broadcast indicators

It includes core system & service indicators (cpuloadrt), business concern indicators, etc. each indicator should be clear about the specific monitoring address and collection method.

7) Within the domain & Associated domain personnel address book, on duty

Including the technology, Tl, business scheduling and contact information of the domain ( Telephone ) , related upstream and downstream, basic components ( DB, middleware, etc ) Corresponding duty situation.

8) On duty problem record

Operation record, recording work order, business problem, pre emergency plan (including at least: time, problem description (screenshot), impact analysis, decision-making & solution process, etc.). Before the end of duty, the students on duty should make a record.


1) List of system restoration settings (current limiting and capacity reducing)

Generally, it corresponds to the pre check list, including current limiting threshold adjustment, cluster capacity reduction, etc.

2) Record of big promotion problems

It should include the summary and carding of the core events encountered by Datong.

7 Incident Response – sand table deduction

Real combat sand table drill is the last guarantee work in emergency response. Taking historical real fault case as the emergency scene input, it simulates the emergency situation during the promotion period, aiming to test the response of students on duty to the emergency problem.

Six steps of stability support: operational guide of high availability system!

Generally speaking, from discovery to solution, an online problem needs to go through the process of location & Investigation & Diagnosis & repair, which generally follows the following principles:

  • As far as possible, let the system restore service first, and protect the site (machine, log, water level record) for root cause investigation.
  • Avoid blind search, according to the white box monitoring targeted diagnosis positioning.
  • Orderly division of labor, each performing its duties, to avoid a swarm out of control chaos.
  • Real time assessment of the scope of impact based on the field situation, the situation that can not be saved by technical means (such as the strong dependence is not available) is transformed into business problem thinking (scope of impact, degree, whether there is capital loss, how to cooperate with the business party).
  • The sand table exercise aims to test the fault handling ability of the students on duty, focusing on hemostasis strategy, division of labor and problem positioning

Six steps of stability support: operational guide of high availability system!

International China Taiwan double 11 buyer field drill

According to the fault types, the common hemostasis strategies are as follows:

  • Current limiting at the entrance: lower the current limiting value of the corresponding provider service source to cope with the high burst traffic, which leads to the full load of its own system and downstream strong dependence.
  • Downstream downgrading: downgrading corresponds to downstream weak dependency unavailability of downstream services. The downstream business is strongly dependent, and it will be degraded after the business approval (the business part will be damaged).
  • Removal of single point failure: when the water level of a single machine rises after removing the unavailable node, it will be offline first, and the single machine service will not be available (no offline machine is required, and the site will be reserved). To deal with the single point of cluster unavailability and poor performance.
  • Handoff: the success rate of local traffic decreases due to the dependence of a single library or a unit due to its own reasons (host or network).

Six steps of stability support: operational guide of high availability system!

In Google SRE, there are the following elements for emergency management:

  • Nested separation of responsibilities, that is, the division of functions
  • Operation room of control center
  • Real time accident status document
  • Clear and open responsibility transfer

Among them, the nested separation of responsibilities, that is, the division of functions, can be divided into the following roles

  • General accident control: responsible for coordinating the division of labor and the work of unassigned affairs, mastering the overall overview information, generally for PM / TL.
  • Transaction processing team: real accident handling personnel, according to specific business scenarios & The system features are divided into several small teams. There is a person in charge within the team to communicate with the chief accident controller.
  • Spokesman: the external liaison personnel of the accident are responsible for periodic information synchronization of the internal members and external concerned personnel of the accident handling, and need to maintain and update the accident documents in real time.
  • Person in charge of planning: responsible for external continuous support work, such as organizing responsibility handover records in case of large-scale failure and multiple shifts.

Author: Developer Assistant_ LS
Original link

This article is the original content of Alibaba cloud and cannot be reproduced without permission