Ran Xiaolong, the original author, was first released in the official account of Tencent cloud. For reprint, please contact. This article mainly introduces Apache pulsar’s cross regional replication solutions in different scenarios.
The following article comes from Tencent cloud middleware ， Author ran Xiaolong
About Apache pulsar
Apache pulsar is a top-level project of the Apache Software Foundation. It is a native distributed message flow platform for the next generation cloud. It integrates message, storage and lightweight functional computing. It adopts a separate architecture design of computing and storage, supports multi tenant, persistent storage, multi machine room cross regional data replication, and has strong consistency, high throughput Stream data storage features such as low latency and high scalability.
Apache pulsar is a multi tenant, high-performance message transmission solution between services. It supports multi tenant, low latency, read-write separation, cross regional replication, rapid capacity expansion, flexible fault tolerance and other features. Its native supports cross continental level cross regional replication solutions, and combined with its own tenant and namespace level abstraction, it can flexibly support a few kinds of cross regional replication solutions in different scenarios.
Under the design support of geo replication, first, we can easily distribute services to multiple computer rooms; Second, it can deal with machine room level faults, that is, when one machine room is unavailable, the service can be transferred to other machine rooms to continue to provide services.
Apache pulsar has built-in multi cluster cross region replication function. Geo replication refers to that clusters scattered in different physical regions can replicate data between clusters through certain configuration methods.
According to whether the message is asynchronous read-write or not, cross region replication can be divided into the following two schemes:
- Synchronous mode: if the requirements for data disaster recovery level are very high, synchronous cross city deployment mode can be adopted, and data copies will exist between different cities. The disadvantage is that the fluctuation of cross city network will have a great impact on performance, because it needs to wait for multiple cities to write successfully before returning to the client.
- Asynchronous mode: if the disaster recovery level of data is not so high, the asynchronous cross city deployment mode can be adopted. For example, there are two independent data centers, Shanghai and Toronto. The message written to Shanghai will be written to Toronto asynchronously. The advantages do not affect the performance of the main process, and there is less than one more storage overhead.
Next, we discuss pulsar’s cross region replication scheme in asynchronous mode.
Pulsar currently supports the following three asynchronous cross region replication schemes:
- Fully connected
- Unidirectional replication
- Failover mode
From the perspective of whether there is a configuration store server (Global zookeeper), there are two asynchronous cross region replication schemes:
- With configurationstoreservers
- Fully connected
- No configurationstoreservers
- Unidirectional replication
- Failover mode
A core concept in the whole cross regional replication is whether the data between clusters can be interconnected. Their interaction mainly depends on the following configuration information:
- cluster （cluster name）
- zookeeper （local cluster zk servers）
- configuration-store （global zk servers）
When initializing pulsar cluster, the user can specify the above corresponding information, for example:
bin/pulsar initialize-cluster-metadata \ --cluster pulsar-cluster-1 \ --zookeeper zk1.us-west.example.com:2181 \ --configuration-store zk1.us-west.example.com:2181 \ --web-service-url http://pulsar.us-west.example.com:8080 \ --web-service-url-tls https://pulsar.us-west.example.com:8443 \ --broker-service-url pulsar://pulsar.us-west.example.com:6650 \ --broker-service-url-tls pulsar+ssl://pulsar.us-west.example.com:6651
The form of full mesh allows data to be shared among multiple clusters, as shown in the following figure:
- configurationStoreServers: it stores the configuration information of each cluster, that is, it enables clusters to perceive each other’s address information. In addition, the tenant and namespace information will be stored. The main purpose is to simplify the operation process. When the information of one cluster is updated, other clusters can obtain the change of information through global zookeeper.
- tenant: which clusters are allowed to operate in the currently created tenant (– allowed clusters)
- namespace: which clusters are allowed to replicate data in the currently created namespace (– clusters)
Data replication between multiple clusters can be simplified to data replication between two clusters. Based on this concept, the principle of geo replication is shown in the figure below:
Currently, there are two clusters, which are deployed in Beijing and Shanghai respectively. When users use producer to send data in a cluster in Beijing, they will first send it to the local cluster of the Beijing computer room (topic1). At the same time, they will create a replication cursor, which is used to copy data. Through this cursor information, You can judge which stage the current data is copied to. At the same time, it will create a replication producer, which will read the data from topic1 in the Beijing computer room, and then write the data to topic1 in the Shanghai computer room. After receiving the producer’s request, the broker in the Shanghai computer room will write to the same local topic (topic1). At this time, if the user of Shanghai computer room starts consumer to consume data, he will receive the data information produced by Beijing computer room producer. vice versa.
Here we need to explain the following issues:
- In the fully connected scenario, the data of the Beijing computer room will be copied to the cluster of the Shanghai computer room, and the data of the Shanghai computer room will also be copied to the Beijing computer room. Then, will the data of the Beijing computer room be copied to the Shanghai computer room, and then the Shanghai computer room will reverse and copy the data back to Beijing, forming a data loop? When the producer sends a message, it knows which cluster it currently belongs to. When the produced message is copied by the replication producer, it will mark a label: replication on the message_ From represents where this message comes from, which can solve the problem of reverse replication.
- In the geo replication scenario, the semantics of the exactly once message (at least once + broker side de duplication (producer name + sequence ID)) can also be guaranteed
- The replication delay depends on the network delay between the two machine rooms. If the delay is large, the network between the two machine rooms needs to be considered.
Once global zookeeper is configured, data replication is bidirectional, and data between all clusters mounted under global zookeeper are interconnected.
As mentioned above, when global zookeeper is configured, there is no way to do one-way data replication, but in many scenarios, we do not need all the data between clusters to be fully connected. In this scenario, we can consider using the one-way replication function. It should be emphasized that, Unidirectional replication does not require users to configure or specify configurationstoreservers separately. During configuration, you only need to configure the value of configurationstoreservers as the zookeeper address (zookeeper servers) of the local cluster.
So how to do cross cluster replication without configuring global zookeeper?
As mentioned above, global zookeeper is mainly used to store the address information and corresponding namespace information of multiple clusters without additional metadata information. Therefore, in the scenario of one-way replication, you need to tell the clusters in other computer rooms that you need to read the namespace information between different clusters.
Failover mode is a special case of one-way replication.
In failover mode, the cluster in the remote computer room is only used for data backup, and there will be no producer and consumer. Only after the current active cluster goes down, will the corresponding producer and consumer be switched to the corresponding standby cluster to continue consumption. Because the replication sub exists, the subscription status will be copied to the backup machine room together.
- Blog recommended ｜ multi picture detailed explanation of Apache pulsar message storage model
- This blog post recommends you to understand pulsar’s message retention and expiration policies
clicklink, get Apache pulsar hard core dry goods information!