Multi Center disaster recovery practice: how to realize real multi activity in different places?

Time:2021-9-16

Introduction:In the realization of multi live in different places, the core technology to solve the real multi live in different places is that the data can be synchronized in two directions between three or more centers. Based on the three center and cross overseas scenario, this paper shares a multi center disaster recovery architecture and implementation method, introduces several distributed ID generation algorithms, and the implementation process of final consistency in data synchronization.

Multi Center disaster recovery practice: how to realize real multi activity in different places?

One background

Why is it called real off-site living? It is not a new word to live more in different places, but it seems that it has not realized the real sense of live more in different places. Generally, there are two forms: one is that the application is deployed in two or more places in the same city. The database is written and read more than once (mainly to ensure data consistency). When the main write database hangs up, switch to the standby database; The other is the unit service. The data of each unit is not the full amount of data. If one unit hangs up, it cannot be switched to other units. At present, we can still see the form of two centers. Both centers have full data, but there is still a big gap between two and more. In fact, it is mainly limited by the data synchronization ability. The two-way synchronization of data between three or more centers is the core technology to solve the real multi activity in different places.

When it comes to data synchronization, we have to mention DTS (data transmission service). At first, Alibaba’s DTS did not have the ability of two-way synchronization. Later, with the cloud version, it was only limited to two-way synchronization between two databases. We couldn’t do a < – > b < – > C. therefore, we developed the data synchronization component ourselves. Although we don’t want to build wheels again, we can’t do it, Some implementation details will be introduced later.

Let’s talk about why we should do Multi Center disaster recovery. Taking my CDN & Video cloud team as an example, the first is the needs of overseas business. In order to enable overseas users to access our services nearby, we need to provide an overseas center. However, most businesses are mainly domestic, so double centers should be built in China to prevent the core library from hanging up, and the whole control will hang up. At the same time, the overseas environment is more complex. Once the overseas center hangs up, you can also use the domestic center to top it. Another great advantage of domestic dual center is that it can disperse the pressure of single center system through some routing strategies. This three center and cross overseas scenario should be the most difficult to realize multi activity in different places at present.

II. System cap

In the face of this global and cross regional distributed system, we have to talk about the cap theory. In order to provide services with multi center and full amount of data, partition tolerance must be solved. However, according to the cap theory, consistency and availability can only meet one. For online applications, it goes without saying that in the face of such a problem, the final consistency is the best choice.

Multi Center disaster recovery practice: how to realize real multi activity in different places?

III. design principles

1 data partition

Select a data dimension for data slicing, so that the business can be deployed separately in different data centers. The primary key needs to be designed in the form of component distributed ID, so that there will be no primary key conflict during data synchronization.

Several distributed ID generation algorithms are introduced below.

Snowflake algorithm

1) Algorithm description

+--------------------------------------------------------------------------+
| 1 Bit Unused | 41 Bit Timestamp |  10 Bit NodeId  |   12 Bit Sequence Id |
+--------------------------------------------------------------------------+
  • The highest bit is the sign bit, always 0, not available.
  • 41 bit time series, accurate to milliseconds, and the length of 41 bits can be used for 69 years. Another important function of time bits is that they can be sorted according to time.
  • 10 bit machine ID, with a length of 10 bits, supports the deployment of 1024 nodes at most.
  • The 12 bit counting serial number, that is, a series of self incrementing IDS, can support the same node to generate multiple ID serial numbers in the same millisecond, and the 12 bit counting serial number can support each node to generate 4096 ID serial numbers per millisecond.

2) Algorithm summary

advantage:

  • It is completely a stateless machine, no network call, efficient and reliable.

Disadvantages:

  • Depending on the machine clock, if the clock is wrong, such as clock callback, duplicate IDs may be generated.
  • The capacity has limitations. The length of 41 bits can be used for 69 years, which is generally enough.
  • Concurrency limitation: a single machine can generate 4096 IDS at most per millisecond.
  • Only applicable to ID allocation of type Int64, int32-bit ID cannot be used.

3) Applicable scenario

Generally, Int64 type IDs of non web applications can be used.

Why not use non web applications? Why not use web applications? Because the maximum integer supported by JavaScript is 53 bits. Beyond this number, JavaScript will lose precision.

Raindrop algorithm

1) Algorithm description

In order to solve the problem of JavaScript losing accuracy, a 53 bit distributed ID generation algorithm transformed from snowflake algorithm.

+--------------------------------------------------------------------------+
| 11 Bit Unused | 32 Bit Timestamp |  7 Bit NodeId  |   14 Bit Sequence Id |
+--------------------------------------------------------------------------+
  • The highest 11 bits are symbol bits, which are always 0. They are not available to solve the loss of precision in JavaScript.
  • The 32-bit time series is accurate to the second level, and the 32-bit length can be used for 136 years.
  • 7-bit machine ID, 7-bit length, supporting the deployment of up to 128 nodes.
  • The 14 bit counting serial number, that is, a series of self incrementing IDS, can support the generation of multiple IDS at the same node in the same second. The 14 bit counting serial number supports the generation of 16384 IDS per second by a single node.

2) Algorithm summary

advantage:

  • It is completely a stateless machine, no network call, efficient and reliable.

Disadvantages:

  • Depending on the machine clock, if the clock is wrong, such as clock out of sync and clock back dialing, duplicate IDS will be generated.
  • The capacity has limitations. The length of 32 bits can be used for 136 years, which is generally enough.
  • Concurrency limitations are lower than snowflake.
  • Only applicable to ID allocation of type Int64, int32-bit ID cannot be used.

3) Applicable scenario

The Int64 type IDs of general web applications are basically sufficient.

Partition independent allocation algorithm

1) Algorithm description

It is managed independently by assigning ID segments to different units. Different machines in the same cell are allocated centrally in the cell through shared redis.

It is equivalent to that each unit is pre allocated a batch of IDS, and then centrally allocated in each unit.

For example, the range of int32 ranges from – 2147483648 to 2147483647, and the ID use range is 12100000000). The first two bits represent regions, so each region supports 100000000 (one hundred million) resources, that is, the ID composition format can be expressed as [0-20].

That is, the int32 bit can support 20 units, and each unit supports 100 million IDs.

Multi Center disaster recovery practice: how to realize real multi activity in different places?

2) Algorithm summary

advantage:

  • There is no state and network call between regions, which is reliable and unique

Disadvantages:

  • There are limitations in partition capacity, and the service capacity needs to be evaluated in advance.
  • The generation order cannot be determined from ID.

3) Applicable scenario

It is applicable to ID allocation of int32 type. The upper capacity limit in a single region can be used by businesses that can be evaluated.

Centralized allocation algorithm

1) Algorithm description

The centralized can be redis, zookeeper, or the self incrementing ID of the database can be used for centralized allocation.

2) Algorithm summary

advantage:

  • Global increment
  • Reliable uniqueness ID
  • No capacity and concurrency limits

Disadvantages:

  • It increases the complexity of the system and requires strong dependence on the central service.

3) Applicable scenario

Scenarios with reliable central services can be selected. Other int32 types cannot use business scenarios with independent partition allocation.

summary

Each allocation algorithm has its own applicable scenario, and the appropriate allocation algorithm needs to be selected according to the business needs. Several factors need to be considered:

  • ID type is Int64 or int32.
  • Business capacity and concurrency requirements.
  • Whether you need to interact with JavaScript.

2 Center closure

Try to make the call occur in the center and avoid cross data center calls. On the one hand, for the sake of user experience, the local call RT is shorter. On the other hand, prevent the same data from being written in two centers at the same time, resulting in data conflict coverage. Generally, one or more routing methods can be selected, such as ADNs routing according to region, Tengine routing according to user attributes, or sidecar routing. The specific implementation methods will not be discussed here.

3 final consistency

The first two are actually to pave the way for final consistency, because data synchronization sacrifices some real-time performance, so we need to partition the data and close the center, so as to ensure the timely response to user requests and the real-time accuracy of data.

As mentioned earlier, because DTS support is not perfect, I have realized the ability of data synchronization based on DRC (an Alibaba internal data subscription component, similar to canal). Next, I will introduce the process of realizing consistency, and I have also taken some detours.

Receive DRC messages sequentially

In order to ensure the sequential reception of DRC messages, the first thought is to adopt the mode of single machine consumption, and the problem caused by single machine is the slow data transmission efficiency. To solve this problem, it involves the ability of concurrency. You may think of table level concurrency, but if the data of a single table changes greatly, there is also a performance bottleneck. Here we realize the concurrency capability at the primary key level, that is, on the same primary key, we strictly maintain the order, and different primary keys can be synchronized concurrently, which improves the concurrency capability by N orders of magnitude.

At the same time, the second problem of single machine consumption is single point. So we need to implement failover. Here, we use raft protocol to select the master and request the master. When the stand-alone hangs up, the other machines will automatically select a new leader to perform synchronization tasks.

Message transmission across cells

In order to support cross cell data synchronization, we use MNS (Alibaba cloud message service). MNS itself is a distributed component and cannot meet the order of messages. At first, in order to ensure strong consistency, I used message coloring and restoration. The specific implementation is shown in the following figure:

Multi Center disaster recovery practice: how to realize real multi activity in different places?

Through practice, we find that this client sorting is not reliable, and our system cannot wait for a message indefinitely. Here, the problem of final consistency is involved, which will be discussed in point 3. In fact, for sequential messages, rocketmq has sequential messages, but rocketmq does not yet have the ability to cross cells. In terms of data synchronization, we only need to ensure the final consistency, and there is no need to sacrifice performance to ensure strong consistency. At the same time, if the MNS message is not consumed successfully, the message will not be lost. Only when we display the deletion message, the message will be lost, so the message will come in the end.

Final consistency

Since MNS cannot guarantee strong order, what we do is data synchronization, as long as we can ensure the final consistency. In 2012, Eric brewer, the initiator of cap theory, also mentioned in his review of cap that C and a are not completely mutually exclusive. It is recommended that we use crdt to ensure consistency. Crdt (conflict free replicated data type) is a theoretical summary of the final consistency algorithms of various basic data structures. It can automatically merge and solve conflicts according to certain rules to achieve strong final consistency. By consulting relevant materials, we know that crdt requires us to meet the exchange law, association law and idempotent law during data synchronization. If the operation itself meets the above three laws, the merge operation only needs to replay the update operation. This form is called OP based crdt. If the operation itself does not meet the above three laws by attaching additional meta information, this form is called state based crdt.

Through the disassembly of DRC, there are three database operations: insert, update and delete. No matter which two operations can not meet the exchange law, they will conflict. Therefore, we add additional information at the concurrency level (primary key). Here we use the serial number, that is, the dyeing process mentioned in 2. This process is reserved. Primary keys are concurrent, not sequential. When receiving messages, we do not guarantee strong order. We use Lww (last write wins) method, that is, we execute the current SQL and give up the previous SQL, so we don’t have to consider the problem of exchange. At the same time, we will idempotent each message according to the uniqueness of the message (instance + cell + database + MD5 (SQL)) to ensure that each SQL will not be executed repeatedly. For the combination law, we need to analyze each operation separately.

1)insert

Insert does not meet the binding law, and there may be primary key conflicts. We change the insert statement to insert ignore, and there is no such record before receiving the insert operation description, or there is a delete operation in front of it. The delete operation may not have arrived yet. At this time, the return result of the insert ignore operation is 0, but the insert data may not be consistent with the existing records, so here we convert the insert operation into an update operation and execute it again.

2)update

The update operation naturally satisfies the binding law. However, a special case should be considered here, that is, the execution result is 0. This indicates that there must be an insert statement before this statement, but we haven’t received this statement yet. At this time, we need to use the data in this statement to turn the update statement into insert and execute it again.

3)delete

Delete also naturally satisfies the combination law. No matter what operations have been performed before, it’s just as long as it is executed.

There is a conversion process in both insert and update operations, and there is a premise here that each change data obtained from DRC is full field. Some people may say that the conversion here can be replaced with replace into. Why not use replace into? Firstly, there are a few cases of disordered order, and we are not simply copying data, but also copying. For DRC, the replace into operation will be resolved to update or insert. In this way, the uniqueness of the message cannot be guaranteed and the anti cyclic broadcasting cannot be achieved, so it is not recommended. Let’s take a look at the following flow chart, which may be clearer:

Multi Center disaster recovery practice: how to realize real multi activity in different places?

IV. disaster recovery architecture

According to the above introduction, let’s look at the form of Multi Center disaster recovery architecture. Here, two-level scheduling is used to ensure the closure of the center, and self-developed synchronization components are used for multi center two-way synchronization. We can also develop some quick recovery strategies, such as quickly removing a center. At the same time, some details need to be considered. For example, in the process of removing a center, when the removed center data has not been synchronized to other centers, the write operation should be disabled to prevent double writing in a short time. Because the synchronization time is milliseconds, the impact is very small.

Multi Center disaster recovery practice: how to realize real multi activity in different places?

V. conclusion

The architecture needs to evolve constantly. You still need to see which is more suitable for you. The above multi center architecture and implementation methods are welcome to discuss.

Our data synchronization component Hera DTS has been used in bu. The logic of data synchronization is still complex, especially the realization of two-way synchronization, which involves many details, such as breakpoint continuation, failover, data loss prevention, message retransmission prevention, circular replication prevention in two-way synchronization and so on. Our synchronization component has also experienced a period of optimization before reaching a stable version.

Author: Developer Assistant_ LS
Original link
This article is the original content of Alibaba cloud and cannot be reproduced without permission