High availability risk that even Ali is afraid of

Time:2022-5-14

I am LeYang, a person who loves risk prevention and control. I have previously participated in the construction and high availability construction of ant glocal sites from 0 to 1. At present, I am participating in the high availability construction of ant security. Whether it’s a domain, a BG or a site, although the scope is different and the objects are different, the concept of high availability is the same. Today, I’ll share my little thought on high availability and the summarized [nprt formula] with you.

This paper introduces the logic of “what is high availability, why high availability is needed, how to make high availability, why to do so, and where is the software risk”.

High availability is the ability to control risks

High availability is a kind of risk oriented design, which enables the system to control risks and provide higher availability.

Why high availability

For a company, “why high availability” can be fully understood as “why does the company want high availability”. Taking the company as the object, from the inside, it includes: people, software (goods), hardware (goods); From the outside, it includes: customers, shareholders and society; From their own point of view, including: the company.

 

 

 

The premise of high availability: everything is not 100% reliable

Everything changes (the only constant is change).

All changes are not 100% reliable.

Conclusion: everything is not 100% reliable.

Internal cause: people and things are not 100% reliable

  • From the human level: people are likely to make mistakes.
  • From the software level: software may have bugs.
  • From the hardware level: the hardware may be broken.

From the perspective of probability, as long as the number of changes is enough, the probability of final error will infinitely tend to 1.

External cause: there is no high availability, which has a great external impact

  • From the perspective of customers: there is no high availability, and customer service may be interrupted.
  • From the perspective of shareholders: there is no high availability, and the stock price may fall.
  • From a social perspective: without high availability, social order may be affected.

Root cause (essence): control risk

From the perspective of the company itself: control risks, ensure the value of the company and avoid hurting the fundamental.

How to make high availability

1. Risk related concepts

  • Risk: refers to a possibility that harm will occur in the future, but it does not actually occur, which is recorded as R.
  • Failure: refers to the fact that hazards have occurred or are occurring, which is the result of the risk becoming a reality.
  • Risk probability: refers to the probability that a risk will change to failure. It is used to express the difficulty of risk triggering as fault, which is recorded as P (R).
  • Fault influence range: refers to the harmful influence caused by a fault in unit time, which is recorded as R (R).
  • Fault influence duration: refers to the duration of a fault, which is recorded as t (R).
  • Fault influence surface: refers to the sum of a fault influence range multiplied by the fault influence duration. Here, the total hazard degree of the fault is represented by the fault influence surface, which is recorded as f (R).
  • Risk expectation: refers to the sum of the probability of each risk variable fault multiplied by the fault influence surface after each risk variable fault. Here, the risk expectation is used to express the potential harm degree of the risk, which is recorded as e (R).

2. Risk expectation formula

According to the definition in the previous section, the formula of risk expectation can be deduced as follows:

 

 

R stands for risk, and the risk expectation will decrease with the decrease of the number n of risks and P, R and t of each risk, which is referred to as nprt formula for short.

Note: if you want to quote this formula, please indicate the source.

3. 4 risk control factors (nprt)

N, quantity of risk reduction

Keep away from the risk from the source and ensure no connection and relationship with the risk carrier; Then the risk probability is 0, and I don’t care whether the fault impact surface is large or small after the risk occurs. I don’t care at all.

  • For example, for major festival activities, if the whole station is closed, the number of changes will be significantly reduced, which is a typical number of risk reduction.
  • For example, system a does not rely on Oracle at all, so system a does not need to care about any risks of Oracle. Even if the U.S. president suddenly and urgently announces that Oracle is immediately banned from use in China, system a does not matter.
  • For example, the recent new crown pandemic is terrible from person to person. If you choose not to go to work or go out today, you don’t have to worry about being infected by pedestrians and colleagues outside today.

Reduce the probability of risk variable failure (i.e. increase the difficulty of risk variable failure), P

Treat the risk as an object, set up cards for it layer by layer, increase the threshold and difficulty of risk failure, and don’t let the tragedy of “accidentally adding a space or character, the system will hang up” easily appear.

  • For example, if person B wants to make a change to system C, it can add a change certification test to person B, require offline (or simulation) test for the change content, and Cr the change content. System C provides the ability to preview the change effect (similar to monitoring mode or trial operation). In case person B wants to make a malicious change and damage, it can also add a review by a different person, and system C can add an error proof design for protection, etc.
  • For example, taking the new crown as an example, wearing a mask, washing hands frequently and more ventilation can reduce the probability of catching the new crown.

Reduce the influence range of fault, R

A whole is divided into n small individuals, and each individual is isolated from each other. Problems in a single individual only affect a single individual, so as to realize small and beautiful.

  • For example, the distributed architecture is a model of this. The centralized one loses, and the distributed one loses, that is, one N loss.
  • For example, taking the new crown as an example, grid management restricts the flow between provinces or cities. Cross provinces must be isolated for 14 days to effectively control the transmission range of the new crown.

Reduce the duration of fault impact, t

The duration of fault influence is determined by fault discovery time and fault hemostasis time, so early detection and hemostasis should be carried out.

The discovery methods are divided into pre-warning and post alarm. Try to make early warning as much as possible, buy time for hemostasis, and even nip the risk in the cradle.

Hemostatic methods include switching, rollback, capacity expansion, degradation or current limiting, bug repair, etc. In case of failure, the first priority principle is to stop bleeding quickly (such as switching, rollback and capacity expansion), and it is strictly prohibited to locate the root cause; When it is impossible to stop bleeding quickly, the second priority principle is less bleeding, such as degradation and flow restriction.

Hemostatic efficiency: automatic vs manual; One key vs multi-step operation. Try to replace manual operation with automation. If manual operation is carried out, try to achieve one key to improve the hemostasis speed.

  • For example, for the capacity water level, you can draw an early warning line before the warning line to give early warning and deal with it calmly.
  • For example, in a distributed application cluster, when any application server has a problem, the load balancing will automatically eliminate the problem application server through the heartbeat check and forward the request to other (hot) redundant servers.
  • For example: take the new crown as an example, but because each life is unique, there is no way to switch, roll back, or downgrade (involving humanitarianism), so we can only apply the medicine to the case and treat it slowly.

4. Seven core principles of high availability architecture design

According to nprt formula, there are seven core principles in the design of high availability architecture:

Principle of less dependence: those that can not be relied on shall not be relied on as much as possible, and the less the better (n)

Since all things are not 100% reliable, when there is a relationship between two things, they will affect each other, which is a risk to each other. If one goes wrong, it may affect the other. We use dependency to refer to the “relationship” here.

For example, a system relies on three relational databases: Oracle, MySQL and ob at the same time. The principle of less dependence is to rely only on the most mature and stable ob instead of Oracle and mysql.

What scenario is suitable for multi dependency?

When the introduction of dependency (n becomes larger) can reduce one or more of PRT and reduce e (R) as a whole.

For example, in order to solve the DB risk, a distributed cache is introduced, which is still available when the two are not hung at the same time.

Weak dependence principle: we must rely on as weak as possible, and the weaker the better (P)

Thing a strongly depends on thing B. once something goes wrong with B, then a will also go wrong and lose everything.

Therefore, any strong dependence should be transformed into weak dependence as much as possible, which can directly reduce the probability of problems.

  • For example: after the transaction is successful, the core link of the transaction should issue integral rights and interests to users; The trading core system needs to rely on the integral equity system. A good way is to use weak dependence and asynchronous mode, so that when the integral equity system is unavailable, the probability will not affect the trading core link.

Dispersion principle: don’t put eggs in one basket and spread the risk (R)

 

 

Break up and split into N parts; Avoid only one copy of the overall situation, otherwise the impact range of a problem is 100%.

  • For example, all transaction data are placed in the same database and the same table. If the database hangs up, all transactions will be affected.
  • For example: buy the same stock with all your money. If this stock is LETV, it will be miserable.

Balance principle: evenly spread risks and avoid imbalance (R)

 

 

 

Preferably, each of the N portions is balanced; Avoid a share that is too large. Otherwise, if there is a problem with the share that is too large, the scope of influence will be too large.

 

  • For example, there are 1000 XX application clusters, but due to the drainage component bug, all traffic is led to 100 of them, resulting in serious load imbalance. Finally, the load can not be carried and completely crashes. Similar major failures have occurred many times.

 

For example, I bought 10 stocks with all my money, one of which accounted for 99%. If this stock is LETV, it will be miserable.

 

Isolation principle: control risk, non-proliferation and non amplification (R)

 

 

Each is isolated from each other; Avoid one problem affecting others, and spread the scope of influence.

  • For example: transaction data is split into 10 libraries and 100 tables, but deployed on the same physical machine; If a large SQL in a table fills up the network card, all 10 databases and 100 tables will be affected.
  • For example, he divided all his money and bought 10 stocks, each accounting for 10%, but all 10 are LETV.
  • For example, the battle of Chibi in ancient times is a typical negative example. The isolation was destroyed due to the iron lock and ship, and a fire burned the 80W army.

There are levels of isolation. The higher the isolation level, the greater the difficulty of risk dissemination and diffusion, and the stronger the disaster tolerance ability.

  • For example, an application cluster is composed of N servers, which are deployed on the same physical machine, or on different physical machines in the same computer room, or in different computer rooms in the same city, or in different cities. Different deployments represent different disaster recovery capabilities.
  • For example, human beings are composed of countless people and live on different continents of the same earth, which means that human beings do not have the ability to isolate at the planetary level. When the earth has a devastating impact, human beings do not have the ability to tolerate disasters.

The principle of isolation is an extremely important principle, which is the premise of the first four principles. Without good isolation, the first four principles are fragile, and the risk is easy to spread and spread, undermining the effect of the first four principles. A large number of real system failures are caused by poor isolation, such as offline affecting online, offline affecting online, advance affecting production, a bad SQL affecting the whole library (or the whole cluster), etc.

Dispersion, balance and isolation are the three core principles to control the scope of risk impact. Break up and split into N parts. Each part is balanced and isolated from each other. One part has problems, and the influence range is 1 / n.

Whether there is single channel redundancy or not (other versions can be returned in principle)

 

 

 

The methods of rapid hemostasis are switching, rollback, capacity expansion, etc; Rollback and capacity expansion belong to special switching. Rollback refers to switching to a version, and capacity expansion refers to switching traffic to the newly expanded machine.

Only when there is a place to switch, so there can be no single point (here specifically refers to the single point of strong dependence, and the weak dependence can be degraded), and there should be redundant backup or other versions; A single point will limit the overall reliability.

Assuming that the reliability of a single point is 99.99%, it is very difficult to improve it to 99.999%. However, if there is no single point but depends on two (it doesn’t matter if one hangs up, as long as it doesn’t hang up at the same time), the overall reliability is 99.999999%, which will be improved qualitatively.

Single point failure will lead to failure of rapid hemostasis and prolong the whole hemostasis time. It is very important to remove single point. The single point here not only refers to the system node, but also includes personnel, such as people who subscribe to alarms, emergency people and so on.

For (important) data nodes, the principle of no single point must be met, otherwise in extreme cases, data may be permanently lost and can never be recovered; (important) after the data node meets the principle of no single point, ensuring data consistency is more important than availability requirements.

  • For example, a merchant only supports one payment channel, which is a typical single point. In case this payment channel hangs, it cannot be paid.
  • For example, all the income of a family depends only on the salary of the father. If the father is ill, there will be no income.

There is no difference between the single point principle and the decentralized principle:

  • When the node is stateless, it is broken up into N parts, each of which has the same function and is redundant to each other, that is, when the node is stateless, the dispersion principle is equivalent to the no single point principle, and one can be satisfied.

When the node is in a state, it is broken up and divided into N copies. Each copy is different and there is no redundancy. Redundancy needs to be made for each copy, that is, when the node is in a state, it should meet both the dispersion principle and the single point principle.

Principle of self-protection: less bleeding, sacrifice one part and protect another part (P & R & T)

External input is not 100% reliable, sometimes unintentional errors, sometimes even malicious damage. Therefore, there should be error proof design for external input to protect yourself more.

In extreme cases, it may not be possible to stop bleeding (quickly). Consider less bleeding and sacrificing one part to protect the other. For example: current limiting, degradation, etc.

  • For example, during the peak period, many functions are generally degraded in advance and current is limited at the same time, mainly to protect the transaction payment experience of most people at the peak.
  • For example, when the human body loses too much blood or has too much pain, it will trigger shock, which is also a typical self-protection mechanism.

Where is the software risk

The methods of risk control were introduced earlier. Back to the field of software system, where is its risk?

Taking the software system as the object, from the inside, it includes: computing system and storage system; From the outside, it includes: personnel, hardware, upstream system and downstream system; And (implicit) time.

 

Since each object is composed of other objects, each object can be further decomposed (theoretically, it can be decomposed infinitely). The above decomposition method is mainly to simplify understanding.

1. Source of software system risk

Risk comes from (harmful) changes. The risk of an object comes from (harmful) changes of all objects related to it. Therefore, the sources of software system risks are divided into the following seven categories:

Computing system changes: slow down and wrong operation

The load of server resources (such as CPU, MEM, IO, etc.), application resources (RPC threads, DB connections, etc.) and business resources (full business ID, insufficient balance, insufficient business quota, etc.) that the system depends on will affect the risk expectation of system operation.

Storage system changes: slow running, running error, data error

The load and data consistency of server resources (such as CPU, MEM, IO, etc.), storage resources (concurrency, etc.), data resources (single database capacity, single table capacity, etc.) that the system depends on will affect the risk expectation of storage system operation.

Human change: change error

The number of changed personnel, safety production awareness, proficiency, the number of changes, the way of changes, etc. will affect the risk expectation of changes.

Due to the large number of people and times of change, change has become the top 1 of all fault sources of ants, which is why “change three board axe” is so famous.

The correct sequence of “change three board axe” should be “grayscale, monitoring and emergency”; The gray level can represent R, and the monitoring and emergency can represent t.

Thinking: if you change the three board axe to add another board axe, what do you think it should be?

Hardware change: damaged

The quantity, quality, service life and maintenance of hardware will affect the risk expectation of hardware, and hardware damage will affect the unavailability of upper software system.

Upstream change: request becomes larger

Requests are divided into three dimensions: network traffic (composed of countless APIs), API (composed of countless key requests), and key.

  • Excessive network traffic will cause network congestion and affect all network traffic requests in the network channel.
  • Excessive API requests will overload the corresponding service cluster, affect all API requests on the whole service machine, and even spread out.
  • Too large key requests (commonly known as “hot keys”) will overload the single machine, affect all key requests on the single machine, and even spread out.

Therefore, when promoting the guarantee, we should not only pay attention to the capacity guarantee of the core API, but also consider the network traffic and hot key.

Downstream change: slow response, wrong response

The quantity, service level and service availability of downstream services affect the risk expectation of downstream services. Slow downstream response may slow the upstream, and wrong downstream response may affect the upstream operation results.

Time change: time expiration

Time expiration is often ignored, but it is often sudden and globally destructive. Once the time expires, triggering the fault will lead to very passive. Therefore, it is necessary to identify in advance and give early warning, such as secret key expiration, certificate expiration, fee expiration, cross time zone, cross year, cross month, cross day, etc.

  • For example, in 2019, Softbank, a Japanese operator, caused a 4-hour communication interruption for 3000W users due to the expiration of the certificate.

Each of the above categories of risks can be analyzed and processed one by one based on nprt formula.

2. Number of risks: three in a lifetime, three in everything

Any thing is not only composed of other things, but also a part of other things, which circulates indefinitely; The number of risks is endless.

Looking inward, it can be infinitely small; When the problem of atomic granularity spreads, it may also affect the availability of the software system, just as the 100 nm novel coronavirus can affect the availability of the human body.

Go down and look out; When the solar system is destroyed, the availability of software systems naturally ceases to exist.

Although the risks are endless, as long as we know more about the risks and according to some concepts and principles of risk control, we can better reduce the risk expectations.

Talk about awe:

  • Our knowledge of the world is limited, which makes us less afraid and less awe.
  • What we really need to fear is not the punishment regulations, but what we don’t know and what we don’t know.

Concluding remarks

  • Everything changes.
  • Everything is not 100% reliable.
  • Therefore, there is a risk. The risk is invisible and the visible is the fault.
  • Risks cannot be eliminated, but they can be kept away and reduced.
  • Failure is inevitable, but it can be delayed, the impact scope can be reduced and the impact time can be shortened.

Nprt formula is not only applicable to software system risk, but also applicable to other risk areas. I hope it will be useful to you.