In depth understanding of zookeeper core principles


Previous articlesBasic principle of zookeeper & detailed explanation of application scenariosThe basic principle and application scenario of zookeeper are introduced in detail, although the underlying storage principle and how to use zookeeper to realize distributed lock are introduced. But I think this is just a little about zookeepercoatnothing more. So this article will give you a detailed talk about zookeeperCore underlying principle。 Those who are not familiar with zookeeper can look back.


This should be regarded as the foundation of zookeeper and the smallest unit of data storage. In zookeeper, the storage structure similar to the file system is abstracted into a tree by zookeeper, and each node in the tree is calledZNode。 A data structure is maintained in znode to record the changes of data in znodeVersion numberas well asACL(access control list).

With these dataVersion numberAnd its updatedTimestamp, zookeeper can verify whether the cache requested by the client is legal and coordinate the update.

Moreover, when zookeeper’s client executesto updateOr when deleting, the version number of the corresponding data to be modified must be brought. If zookeeper detects that the corresponding version number does not exist, this update will not be performed. If it is legal, the corresponding version number will also be updated after the data in znode is updatedUpdate together

This set of version number logic is actually used by many frameworks. For example, in rocketmq, when the broker registers with nameserver, it will also carry such a version number calledDateVersion

Next, let’s take a detailed look at the data structure of the maintenance version number related data, which is calledStat Structure, the fields are:

field interpretation
czxid Create a zxid for this node
mzxid Modify the zxid of this node for the last time
pzxid The zxid of the child node of this node is modified for the last time
ctime The milliseconds between the start of the current epoch and the creation of the node
mtime The milliseconds between the start of the current epoch and the last editing of the node
version Number of changes to the current node (i.e. version number)
cversion The number of changes to the child nodes of the current node
aversion Number of ACL changes of the current node
ephemeralOwner Sessionid of the current temporary node owner (empty if it is not a temporary node)
dataLength The length of the current node’s data
numChildren Number of child nodes of the current node

For example, bystatCommand, we can view the specific value of stat structure in a znode.

In depth understanding of zookeeper core principles

Epoch and zxid here are related to zookeeper clusters, which will be introduced in detail later.


ACL (access control list) is used to control the relevant permissions of znode, and its permission control is similar to that in Linux. Permissions in LinuxtypeThere are three kinds, namelyreadwriteimplement, the corresponding letters are R, W and X. The permission granularity can also be divided into three types:Owner permissionsGroup permissionsOther group permissions, for example:

drwxr-xr-x  3 USERNAME  GROUP  1.0K  3 15 18:19 dir_name

What do you meangranularity? Granularity is the classification of objects affected by permissions. To put it another way, the above three kinds of granularity are describedFor the user (owner), the group to which the user belongs (Group), and other groups (other)This should be regarded as a standard of permission control, a typical three-stage type.

Although zookeeper is also three-stage, there are differences in granularity between the two. The three-stage formula in zookeeper isScheme、ID、PermissionsThe meanings are permission mechanism, users allowed to access and specific permissions.

In depth understanding of zookeeper core principles

Scheme represents a permission mode with the following five types:

  • worldUnder scheme,IDOnlyanyone, which means everyone can access
  • authRepresents an authenticated user
  • digestUse user name + password for verification.
  • ipOnly certain IP addresses are allowed to access the znode
  • X509Authenticate through the client’s certificate

At the same time, there are five types of permissions:

  • CREATECreate node
  • READGets a node or lists its children
  • WRITECan set node data
  • DELETEAbility to delete child nodes
  • ADMINAbility to set permissions

As in Linux, this permission is also abbreviated. For example:

In depth understanding of zookeeper core principles

getAclMethod the user can view the permissions of the corresponding znode, as shown in the figure. The results we can output are in three segments. namely:

  • schemeUsed world
  • idValue isanyone, which means that all users have permissions
  • permissionsThe specific permissions are cdrwa, which areCREATE、DELETE、READ、WRite andAAbbreviation for Dmin

Session mechanism

After understanding the version mechanism of zookeeper, we can continue to explore the version mechanism of zookeeperSession mechanismYes.

As we know, there are four types of nodes in zookeeper: persistent node, persistent sequential node, temporary node and temporary sequential node.

In the previous article, we talked about that if the client creates a temporary node and then disconnects, all temporary nodes will be deleteddelete。 actuallyDisconnectYour statement is not very accurate. It should be when the client establishes a connectionSession expiredAfter that, all temporary nodes created by it will be deleted.

So how does zookeeper know which temporary nodes are created by the current client?

The answer is in stat structureEphemeral owner (owner of temporary node)field

As mentioned above, if the current isTemporary sequence node, thenephemeralOwnerThe sessionid of the owner who created the node is stored. With the sessionid, it can be matched with the corresponding client. When the session fails, all temporary nodes created by the client can be deletedDelete all

In depth understanding of zookeeper core principles

When creating a connection, the corresponding service must provide a string with all servers and portsComma separated, for example.,,

After receiving this string, zookeeper’s client will randomly select a service and port to establish a connection. If the connection is disconnected later, the client will select the next server from the string and continue to try to connect until the connection is successful.

In addition to this most basic IP + port, in zookeeper3.2.0Later versions also support carrying paths in connection strings, for example.,,

thus,/app/aIt will be regarded as the root directory of the current service, and all node paths created under it will be prefixed/app/a。 For example, I created a node/node_name, then its complete path will be/app/a/node_name。 This feature is especially suitable for multi tenant environments. For each tenant, they think they are the top-level root directory/

After zookeeper’s client and server have established a connection, the client will get a 64 bit sessionid and password. What is this password for? We know that zookeeper can deploy multiple instances. If the client disconnects and establishes a connection with another zookeeper server, it will bring this password when establishing the connection. This password is a security measure of zookeeper. All zookeeper nodes can verify it. In this way, the session is valid even if it is connected to other zookeeper nodes.

Sessionbe overdueThere are two situations, namely:

  • The specified expiration time has elapsed
  • The client did not send heartbeat within the specified time

For the first case,Expiration timeIt will be transmitted to the server when the zookeeper client establishes a connection. At present, the expiration time range can only be 2 timestickTimeAnd 20xtickTimebetween.

Ticketime is a configuration item of zookeeper server. It is used to specify the interval at which the client sends heartbeat to the server. Its default value istickTime=2000, inmillisecond

The expiration logic of this session is maintained by zookeeper’s server. Once the session expires, the server willDelete nowAll temporary nodes created by the client, and thennoticeAll client related changes listening to these nodes.

For the second case, the heartbeat in zookeeper is throughPing requestEvery once in a while, the client will send a ping request to the server, which is the essence of heartbeat. Heartbeat makes the server feel that the client is still alive. Similarly, it makes the client feel that the connection with the server is still valid. This interval istickTime, the default is 2 seconds.

Watch mechanism

After learning about znode and session, we can finally continue to the next key function watch, which is mentioned more than once in the above contentMonitor (watch)This word. First, summarize its function in one sentence

Register a listener for a node. Once the node is changed (such as updated or deleted), the listener will receive a watch event

Like many types in znode, there are many types of watches, including one-time watch and permanent watch.

  • Disposable WatchAfter being triggered, the watch will be removed
  • Permanent WatchAfter being triggered, it is still retained and can continue to listen for changes on znode. It is a new function in zookeeper version 3.6.0

A one-time watch can be calledgetData()getChildren()andexists()And other methods, set them in the parameters, and the permanent watch needs to be calledaddWatch()To achieve.

And a one-time watch willExisting problems, because there is a time interval between the event triggered by the watch reaching the client and setting up a new watch at the client. If changes occur during this time interval, the client cannot perceive them.

In depth understanding of zookeeper core principles

Zookeeper cluster architecture

Zab agreement

After you have paved the front, you can further understand zookeeper from the perspective of the overall architecture. Zookeeper to ensure itsHigh availability, based on master-slaveRead write separationframework.

We know that in a similar redis master-slave architecture, nodes adoptGossipWhat is the communication protocol in zookeeper?

The answer isZAB(Zookeeper Atomic Broadcast)agreement.

Zab protocol is aSupport crash recoveryYesAtomic broadcastingProtocol, which is used to pass messages between zookeepers to keep all nodes synchronized. Zab also has the characteristics of high performance, high availability, easy to use and easy maintenance, and supports automatic fault recovery.

Zab protocol divides the nodes in zookeeper cluster into three roles, namelyLeaderFollowerandObserver, as shown below:

In depth understanding of zookeeper core principles

Generally speaking, this architecture is similar to that of redis master-slave or MySQL master-slave (you can also read the previous articles and have talked about them)

The difference is that there are two roles in the general master-slave architecture: leader and follower (or master and slave), but there is an observer in zookeeper.

The question is, what is the difference between observer and follower?

In essence, the functions of the two are the same. Both provide zookeeper with the ability of horizontal expansion, so that it can carry more concurrency. But the difference lies in the leader’s election process, observerDo not participate in voting

Sequential consistency

The zookeeper cluster is mentioned aboveRead write separationYes, only the leader node can process the write request. If the follower node receives the write request, it will forward the request to the leader node for processing. The follower node itself will not process the write request.

After receiving the message, the leader node will process it one by one according to the strict order of the request. This is a major feature of zookeeper, which will ensure the of messagesSequential consistency

For example, if message a arrives earlier than message B, message a will arrive earlier than message B in all zookeeper nodes, and zookeeper will ensure the accuracy of the messageGlobal order


How does zookeeper ensure the order of messages? The answer is yeszxid

You can simplyzxidIt is understood as the unique ID of the message in zookeeper, which will be sent between nodesProposal (transaction proposal)For communication and data synchronization, zxid and specific data will be brought in the proposal(Message)。 Zxid consists of two parts:

In depth understanding of zookeeper core principles

  • epochIt can be understood as the dynasty, or the iterative version of the leader. The epoch of each leader is different
  • counterCounter, a message will increase automatically

This is also the underlying implementation of the unique zxid generation algorithm. Because the epoch used by each leader is unique, and different messages have different counter values in the same epoch, all proposals have unique zxid in the zookeeper cluster.

Recovery mode

The normally running zookeeper cluster will be in theBroadcast mode。 On the contrary, if more than half of the nodes are down, they will enterRecovery mode

What is recovery mode?

In zookeeper cluster, there are two modes:

  • Recovery mode
  • Broadcast mode

When zookeeper cluster fails, it will enterRecovery mode, also known as leader activation, as the name suggests, is to be at this stageElect a leader。 Zxid and proposal will be generated between nodes, and then vote for each other. Voting should be principled. There are two main points:

  • The zxid of the elected leader must be the largest of all followers
  • And more than half of the followers have returned ACK, indicating that they recognize the elected leader

If an exception occurs during the election, zookeeper will directly conduct a new round of election. If everything goes well, the leader will be elected successfully, but at this time, the cluster can not provide services normally, because there is no key communication between the new leader and the followerData synchronization

After that, the leader will wait for other followers to connect, and then send their missing data to all followers through the proposal.

As for how to know what data is missing, the proposal itself needs to record the log. A diff can be made through the value in the lower 32-bit counter of zxid in the proposal

Of course, there is an optimization here. If there is too much missing data, the efficiency of sending proposals one by one is too low. Therefore, if the leader finds that there are too many missing data, he will delete the current dataTake a snapshot, package and send directly to follower.

The echo of the newly elected leader will be + 1 on the original value and reset the counter to 0.

Do you think it’s over here? In fact, we still can’t provide normal services here

Data synchronizationAfter completion, the leader will send a new message_ The leader’s proposal is sent to the followers, and the leader will commit the new only after the proposal is acked by more than half of the followers_ The cluster can work normally only after leader proposal.

So far,Recovery modeEnd, the cluster entersBroadcast mode

Broadcast mode

In broadcast mode, after receiving the message, the leader will send it to all other followersProposal (transaction proposal), the follower will return an ACK to the leader after receiving the proposal. After the leader receives quorums acks, the current proposal will be submitted and applied to the memory of the node. How many are the quorum?

Zookeeper officials suggest that at least one of every two zookeeper nodes needs to return ack. Assuming there are n zookeeper nodes, the calculation formula should ben/2 + 1

This may not be very intuitive, usevernacularIn other words,More than half of the followersAfter the ACK is returned, the proposal can be submitted and applied to the znode in memory.

In depth understanding of zookeeper core principles

Zookeeper use2PCTo ensure the data consistency between nodes (as shown in the figure above), but since the leader needs to interact with all followers, the communication overhead will become larger and the performance of zookeeper will decline. So forpromoteZookeeper’sperformance, the ACK returned from all follower nodes becomesMore than half of the followers return ackJust.

Well, the above is the whole content of this blog. Welcome to wechat search【SH’s full stack notes】, reply【queue】Get MQ learning materials, including basic concept analysis and rocketmq detailed source code analysis, which are constantly updated.

If you think this article is helpful to you, it’s troublesomeLike itClose a noteShareLeave a message