MSHA x chaos disaster recovery high availability practice

Time:2021-1-20

MSHA x chaos disaster recovery high availability practice
Author Yuan Zhi, Han LAN
Source|Alibaba cloud official account

preface

Due to the complexity of the external environment and the unreliability of hardware, the high availability of Internet services is facing great challenges, and there are many cases of service unavailability of major Internet companies caused by network outage, power failure and other accidents. Beneficial to the people’s livelihood and the national economy, the business is not available, and the economic losses are small enough to affect the reputation of the enterprises. The national applications such as WeChat and Alipay are affecting the national economy and people’s livelihood. In the face of unavoidable natural and man-made disasters, the construction of disaster recovery architecture has become an urgent demand of digital enterprises.

In December 2020, Alibaba cloud’s application high availability service (AHAs) released a new function module ahas-msha, which is a multi active disaster recovery architecture solution evolved from the business environment of Ababa e-commerce. In this article, we first introduce several important concepts in the field of disaster recovery, and then, combined with a case study of e-commerce micro service, we will share how the remote multi activity capability (AHAs MSHA) and chaos engineering capability (AHAs chaos) based on AHAS can help businesses realize the high availability of disaster recovery architecture.

Disaster recovery and evaluation index

1. What is disaster recovery?

Disaster recovery Tolerance refers to the establishment of two or more sets of systems with the same functions in remote places. The systems can monitor each other’s health status and switch functions. When one system stops working due to accidents (such as fire, flood, earthquake, man-made sabotage, etc.), the whole application system can switch to another place, so that the system functions can continue to work normally.

2. How to evaluate the disaster recovery capability?

Disaster recovery system is mainly to avoid business interruption in case of disaster. How to evaluate and quantify the disaster recovery capability? Here, we need to introduce the disaster tolerance evaluation indicators commonly used by the industry

  • RPO(Recovery Point Objective)

That is, the goal of data recovery point, which takes time as the unit, that is, the requirement of the time point when the system and data must be recovered in the event of a disaster. RPO indicates the maximum amount of data loss that the system can tolerate. The smaller the amount of data that the system can tolerate, the smaller the value of RPO.

  • RTO(Recovery Time Objective)

That is, recovery time objective, which is the time requirement of information system or business function from stop to recovery after disaster. RTO marks the longest time that the system can tolerate the service stopping. The higher the urgency of system service, the smaller the RTO value.

AHAS-MSHA

1. Introduction

MSHA (multi site high availability) is a multi active disaster recovery architecture solution (solution = technical products + consulting services + ecological partners), which can decouple business recovery and fault recovery, support rapid business recovery in fault scenarios, and help enterprises build disaster recovery stability.

1) Product architecture

MSHA usesLive in different placesThe core idea of disaster recovery architecture is “isolated redundancy”. We call each redundant logical data center “isolated redundancy”unitMSHA can close the traffic in the cell and isolate the fault between cellsExplosion radiusControl inOne unitIn the future, not only can we solve the problemDisaster recoveryProblems, improve business continuity, and can achievecapacityThe expansion of the.

MSHA x chaos disaster recovery high availability practice

2) Comparison of mainstream disaster recovery architectures

MSHA x chaos disaster recovery high availability practice

2. Functional characteristics

  • Fast recovery of faults

UpholdRecover first, then locateMSHA provides the ability of disaster recovery and flow cutting, which can make the system more efficient under the premise of data protectionBusiness recovery timeandFailure recovery timeDecoupling ensures business continuity.

  • Capacity expansion in different places

With the rapid development of business, limited by the limited resources of a single location, there are also problems such as database bottleneck. MSHA can be used to expand business units in other areas and computer rooms to achieve the purpose of rapid horizontal expansion.

  • Traffic distribution and error correction

MSHA provides layer by layer traffic error correction and verification from the access layer to the application layer, retransmits calls that do not conform to the traffic routing rules, and controls the fault explosion radius in one unit.

  • Data anti dirty write

Multi cell data writing may cause the problem of dirty write coverage. MSHA provides write inhibit protection when the traffic enters the wrong cell, and write inhibit / update inhibit protection during the synchronization delay of stream cutting data.

3. Application scenarios

MSHA can be applied to the construction of multi activity disaster recovery architecture in the following typical business scenarios:

  • Read more write less business

    • Business scenarios: typical business scenarios are information and shopping guide services (such as product browsing and news information).
    • Data features: read more, write less business, the core is read business, can accept write business temporarily unavailable.
  • Flow document business

    • Business scenarios: typical business scenarios are e-commerce transactions and billing flow services (such as orders, call records, etc.).
    • Data characteristics: the data can be divided according to certain dimensions, and can accept the final consistency of the data.

Business disaster recovery practice

Next, we will introduce different scenarios of disaster recovery architecture construction through an e-commerce micro service case.

1. Business background of e-commerce

1) Business application

  • Frontend, portal web application, responsible for interaction with users
  • Cartservice, shopping cart application. Record the user’s shopping cart data, and use the self built redis
  • Product service, commodity application. Provide goods and inventory services, and use RDS mysql
  • Checkoutservice, order application. Use RDS Mysql to generate purchase orders from the goods in the shopping cart

2) Technology stack

  • SpringBoot
  • RPC framework: spring cloud, the registry uses Eureka

3) E-commerce application architecture 1.0

In the early stage of e-commerce business, like many Internet enterprises, they did not consider disaster recovery and only deployed in a single region.

MSHA x chaos disaster recovery high availability practice

2. Case 1: read more and write less business disaster recovery cases

1) The occurrence of a fault

The e-commerce business developed rapidly in the initial stage, and the small and beautiful single region deployment mode has not changed until a commodity applicationfaultAs a result, the e-commerce business is paralyzed and the page cannot be accessed for a long time. The failure was finally solved, but the customer churn and the impact of word-of-mouth caused by the failure had a great impact on the rapid development of the business, forcing us to consider the construction of high availability capability.

E-commerce business is mainly divided into shopping guide, shopping cart, trading and other business scenarios, and shopping guide is the first to bear the brunt. It is a typical read more write less business scenario, the core of which is the display of the shopping guide page (read link). It can usually accept the temporary unavailability of goods and services published and put on the shelf (write link). Combined with our own disaster recovery demands, we first set a small goal for improvement–“Read more in different places”。

2) Reconstruction of remote multi reading disaster recovery architecture

Based on MSHA, the shopping guide business is transformed into “multi reading in different places”.

Multi live transformation & MSHA access:

  • Partition dimension: use userid as the flow identification.
  • Transformation scope: deploy the portal web application and commodity application related to the shopping guide link in two domains.
  • Control configuration: enter MSHA console to configure multiple active resources of each layer.

MSHA x chaos disaster recovery high availability practice

3) Fault recurrence

After the reconstruction of disaster recovery architecture is completed, it is not over. It is necessary to verify whether the disaster recovery capability meets the expectation. Next, we will reproduce the historical faults and verify the disaster recovery ability by making real faults.

[drill preparation]

Business monitoring indicators: Based on MSHA traffic monitoring or other monitoring capabilities, determine the business steady-state monitoring index, so as to judge the fault influence surface when the fault occurs and the actual recovery situation after the fault recovery.

MSHA x chaos disaster recovery high availability practice

Exercise expectation

  • The shopping guide link is weakly dependent on the shopping cart application (the shopping guide page will show the number of goods put into the shopping cart), and the weak dependence failure does not affect the business.
  • The shopping guide link is strongly dependent on the commodity application. The failure of strong dependence will lead to the unavailability of the service, and the explosion radius of the failure should be controlled within the unit.
[fault drill]

utilizeAHAS chaos fault drillFunction, it is convenient to drill a variety of fault scenarios.

The first stage: weak dependence fault drill
  • fault injection : fault injection for shopping cart application

    • Expectation: shopping guide business will not be affected
    • Results: the guide page can be opened normally, which is in line with the expectation

MSHA x chaos disaster recovery high availability practice

MSHA x chaos disaster recovery high availability practice

The second stage: strong dependence fault drill

The routing rules configured before the drill are as follows (after userid% 10000, match according to the following routing range rules):

MSHA x chaos disaster recovery high availability practice

  • fault injection : YesBeijing unitFault injection for commercial applications based on

    • It is expected that users with userid = 6000 will be affected by the failure when routing to Beijing unit
    • Results: the visit of shopping guide page was abnormal, which met the expectation

MSHA x chaos disaster recovery high availability practice

MSHA x chaos disaster recovery high availability practice

  • Verification of explosion radius: verify whether the support radius is controlled in the fault unit

    • It is expected that users with userid = 50 will not be affected by the failure of Beijing unit
    • Results: the visit to the shopping guide page was normal and in line with expectations

4) Tangential flow recovery

In the fault scenario, MSHA flow cut function is used to verify the disaster recovery ability.

  • Disaster recovery switch verification: switch userid = 6000 to Hangzhou unit

    • It is expected that the user will route to Hangzhou unit after the flow cut, and will not be affected by the failure of Beijing unit.
    • Results: the access to the shopping guide page was normal (the actual call chain of the shopping guide request is shown in the dynamic diagram below), and the disaster recovery ability met the expectation.

MSHA x chaos disaster recovery high availability practice

Follow up: fault cancellation

  • Fault injection termination
  • Feedback the drill results and record the risk problems identified in the drill
  • Flow backflow
  • Check whether steady state business indicators are restored

3. Case 2: disaster recovery case of flow document business

1) New fault

After the above transformation, the shopping guide business has the ability to resistRegional levelFailure ability. But the order application area is largefaultIt has become the last straw to crush the order business. Therefore, the construction of high availability architecture for single business is also on the agenda.

Placing an order is a typical flow document business scenario. Compared with shopping guide, it is a more complex read-write business. Combined with the business scenario and business disaster recovery demands, we select a disaster recovery construction scheme suitable for the business–“Live in different places”。

2) Reconstruction of remote multi activity disaster recovery architecture

Based on MSHA, the order business is transformed into “multi live in different places”.

Note: the next single link strongly relies on the application of shopping cart, and the complete disaster recovery construction of multi activity, and the subsequent application of shopping cart should also be transformed into “multi activity in different places”.

Multi live transformation & MSHA access

  • Transformation scope: order application and order database for two domain deployment.
  • MSHA access: install the single link application on the agent, so as to realize the spring cloud RPC cross cell routing function and data anti dirty write function without invasion.
  • Control configuration:

MSHA x chaos disaster recovery high availability practice

3) Fault recurrence

After the reconstruction of the disaster recovery architecture, we will reproduce the historical faults, and verify the disaster recovery ability by making real faults.

[drill preparation]

Business monitoring indicators: Based on MSHA traffic monitoring or other monitoring capabilities, determine business steady-state monitoring indicators.

Exercise expectation: the single link is strongly dependent on the order application, the failure of strong dependence affects the unavailability of the service, and the failure explosion radius is controlled in the unit.

[fault drill]

The routing rules configured before the drill are as follows (after userid% 10000, match according to the following routing range rules):

MSHA x chaos disaster recovery high availability practice

  • fault injection : YesBeijing unitFault injection for order application based on

    • It is expected that users with userid = 6000 will be affected by the failure when routing to Beijing unit
    • Results: the order was abnormal, in line with expectations

MSHA x chaos disaster recovery high availability practice

  • Verification of explosion radius: verify whether the support radius is controlled in the fault unit

    • It is expected that users with userid = 50 will not be affected by the failure of Beijing unit
    • Results: the order was normal, in line with expectations

4) Tangential flow recovery

The MSHA flow cut function is used to verify the disaster recovery and switching ability in fault scenarios.

  • Disaster recovery switch verification: switch userid = 6000 to Hangzhou unit

    • It is expected that the user will route to Hangzhou unit after the flow cut, and will not be affected by the failure of Beijing unit
    • Results: the order is normal (the actual call chain of the order request is shown in the dynamic diagram below), and the disaster recovery ability meets the expectation.

summary

In this article, we introduce the MSHA multi activity disaster recovery solution, which is a powerful tool for business disaster recovery provided by AHAS. Combined with an e-commerce business, we introduce two typical business scenarios of “read more, write less” and “flow document”, and give the practical method of disaster recovery architecture construction. At the same time, we combine with AHAS chaos The fault drill function simulates a real possible fault to verify whether the disaster recovery capability meets the expectation.

The public cloud MSHA has started the public test, and has provided the e-commerce business demo experience of the two business scenarios in this article (you can experience without opening). WelcomeApplication experience

Finally, I would like to tell you that disaster recovery construction is a systematic project, which can not be achieved overnight and is not a one-off business. We need to evaluate and formulate an appropriate disaster recovery architecture construction scheme according to the business scenario, disaster recovery demands, technology stack, disaster recovery budget, etc. We welcome you to consult and exchange your own disaster recovery demands and scenarios.

Extended reading

Welcome to nail search group number: 31623894, join MSHA communication nail group.

Recommended Today

How to Build a Cybersecurity Career

Original text:How to Build a Cybersecurity Career How to build the cause of network security Normative guidelines for building a successful career in the field of information security fromDaniel miesslerstayinformation safetyCreated / updated: December 17, 2019 I’ve been doing itinformation safety(now many people call it network security) it’s been about 20 years, and I’ve spent […]