In simple terms zookeeper


Zookeeper is a distributed coordination service maintained by Apache.

Zookeeper can be regarded as a highly available file system.

Zookeeper can be used for publish / subscribe, load balancing, command service, distributed coordination / notification, cluster management, master election, distributed lock and distributed queue.

1、 About zookeeper

1.1 what is zookeeper

Zookeeper is the top project of Apache.Zookeeper provides efficient and reliable distributed coordination services for distributed applications, such as unified naming service, configuration management and distributed lock. Zookeeper does not directly use Paxos algorithm to solve the problem of distributed data consistency, but uses the consistency protocol named Zab.

Zookeeper is mainly used to solve the consistency problem of application system in distributed cluster. It can provide data storage based on directory node tree similar to file system. But zookeeper is not used to store data, its main function is to store dataMaintain and monitor the status change of stored data. By monitoring the change of the data state, the cluster management based on data can be achieved.

Many famous frameworks are based on zookeeper to achieve distributed high availability, such as Dubbo, Kafka and so on.

1.2 characteristics of zookeeper

Zookeeper has the following features:

  • Sequence consistency:All clients see the same server data model; the transaction requests initiated from a client will be applied to zookeeper in strict accordance with its initiation order. Specific implementation can be seen below: atomic broadcasting.
  • Atomicity:The results of all transaction requests are consistent in all machines in the whole cluster, that is, the whole cluster either successfully applies a transaction or does not. The implementation can be seen below: transaction.
  • Single view:No matter which zookeeper server the client connects to, the server data model it sees is consistent.
  • High performance:Zookeeper stores all data in memory, so its performance is very high. It should be noted that since all the updates and deletions of zookeeper are based on transactions, zookeeper performs well in the application scenario of more reads and less writes. If the write operations are frequent, the performance will decline greatly.
  • High availability:The high availability of zookeeper is based on the replica mechanism. In addition, zookeeper supports fault recovery. See below: election leader.

1.3 design objectives of zookeeper

  • Simple data model
  • You can build clusters
  • Sequential access
  • High performance

2、 Zookeeper core concepts

2.1 data model

Zookeeper’s data model is a tree file system.

The nodes in the tree are called znode, where the root node is /, and each node will save its own data and node information. Znode can be used to store data and has an ACL associated with it (see ACL for details). Zookeeper’s design goal is to achieve coordination services, not really as a file storage, so znode stores dataThe size is limited to 1MB.

In simple terms zookeeper

Zookeeper’s data access is atomic.The read and write operations are either all successful or all failed.

Znode is referenced through the path.The znode node path must be absolute.

There are two types of znode

  • Ephemeral:At the end of the client session, zookeeper will delete the temporary znode.
  • Persistent:Zookeeper will not delete the persistent znode unless the client takes the initiative to delete it.

2.2 node information

There’s one on the znodeSequence mark (sequential)。 If theSequence mark (sequential)Then zookeeper will use the counter to add a monotonically increasing value for znode, namely zxid. Zookeeper uses zxid to achieve strict sequential access control.

While storing data, each znode node maintains a data structure called stat, which stores all state information about the node. As follows:

In simple terms zookeeper In simple terms zookeeper

2.3 cluster role

Zookeeper cluster is a highly available cluster based on master-slave replication. Each server plays one of the following three roles.

  • Leader:It is responsible for initiating and maintaining heartbeat with follwers and observers. All write operations must be completed through the leader, and then the leader will broadcast the write operations to other servers. A zookeeper cluster will only have one actual leader at the same time.
  • Follower:It responds to the leader’s heartbeat. The follower can directly process and return the client’s read request, forward the write request to the leader for processing, and is responsible for voting the request when the leader processes the write request. A zookeeper cluster may have multiple followers at the same time.
  • Observer:The role is similar to follower, but has no voting rights.

2.4 ACL

Zookeeper uses ACL (access control lists) policy to control permissions.

Each znode is created with an ACL list to determine who can perform what action on it.

ACL relies on zookeeper’s client authentication mechanism. Zookeeper provides the following authentication methods:

  • digest:User name and password to identify the client
  • sasl:Identify clients through Kerberos
  • ip:Identify clients through IP

Zookeeper defines the following five permissions:

  • CREATE:Allow the creation of child nodes;
  • READ:It is allowed to obtain data from nodes and list their children;
  • WRITE:Data setting is allowed for nodes;
  • DELETE:It is allowed to delete child nodes;
  • ADMIN:Set permissions to allow nodes.

3、 How zookeeper works

3.1 read operation

The leader / follower / observer can directly process the read request, read the data from the local memory and return it to the client.

Because there is no interaction between servers to process read requests,The more followers / observers there are, the greater the read request throughput of the whole systemThat is, the better the read performance.

In simple terms zookeeper In simple terms zookeeper

3.2 write operation

All write requests are actually handled by the leader. The leader sends the write request to all the followers in the form of transaction and waits for the ACK. Once more than half of the followers’ acks are received, the write operation is considered successful.

3.2.1 write leader

In simple terms zookeeper In simple terms zookeeper

As can be seen from the above figure, writing through leader is mainly divided into five steps:

  1. The client initiates a write request to the leader.
  2. The leader sends the write request to all followers in the form of a transactional proposal and waits for an ACK.
  3. The follower returns ack after receiving the leader’s transaction proposal.
  4. The leader gets more than half of the ACK (the leader has one ack by default) and sends a commit to all the followers and observers.
  5. The leader returns the processing results to the client.

be careful

  • The leader does not need to get the ack of the observer, that is, the observer has no voting right.
  • The leader doesn’t need to get all the acks of the follower, but only needs to receive more than half of the acks. At the same time, the leader has an ACK for itself. In the figure above, there are four followers, and only two of them need to return ACK, because $$(2 + 1) / (4 + 1) > 1 / 2 $$.
  • Although the observer has no voting rights, it still needs to synchronize the data of the leader so that it can return as new data as possible when processing the read request.

3.2.2 write follower / observerIn simple terms zookeeper

In simple terms zookeeper

The follower / observer can accept the write request, but it can’t process it directly. Instead, it needs to forward the write request to the leader for processing.

In addition to one more step of request forwarding, other processes are no different from writing leader directly.

3.3 business

Zookeeper has strict access control for each client.

In order to ensure the sequence consistency of transactions, zookeeper uses the incremental transaction ID number (zxid) to identify transactions.

The leader service will allocate a separate queue for each follower server, then put the transaction proposal into the queue in turn, and send messages according to FIFO (first in first out) policy.After receiving a successful response to the disk, the reply will be written to the leader in the form of a follow transaction.When the leader receives the ACK response from more than half of the followers, it will broadcast a commit message to all the followers to inform them to commit the transaction,After that, the leader itself will commit the transaction. Each follower completes the transaction commit after receiving the commit message.

All proposals are presented with zxid. Zxid is a 64 bit number, and its top 32 bits are epoch, which is used to identify whether the leader relationship has changed. Every time a leader is selected, it will have a new epoch, which indicates the current ruling period of that leader. The lower 32 bits are used to increment the count.

The detailed process is as follows:

  • The leader waits for the server to connect;
  • The follower connects to the leader and sends the largest zxid to the leader;
  • The leader determines the synchronization point according to the zxid of the follower;
  • After synchronization, notify the follower that it has become updated;
  • After the follower receives the update message, it can accept the client’s request again for service.

3.4 observation

When the client listens for changes in the status of the child node, it notifies the client of changes in the status of the child node.

There are two ways to keep the connection between client and server

  • The client polling the server continuously
  • Push status from server to client

The choice of zookeeper is the active push state of the server, which is the watch mechanism.

The observation mechanism of zookeeper allows users to register and listen for the events of interest on the specified node. When the event occurs, the listener will be triggered and the event information will be pushed to the client.

When the client uses interfaces such as GetData to get the status of znode, it passes in a callback to handle node changes, and the server will actively push node changes to the client

The watcher object passed in from this method implements the corresponding process method. Every time the state of the corresponding node changes, the watchmanager will call the method passed in to the watcher in the following ways:

Set<Watcher> triggerWatch(String path, EventType type, Set<Watcher> supress) {
    WatchedEvent e = new WatchedEvent(type, KeeperState.SyncConnected, path);
    Set<Watcher> watchers;
    synchronized (this) {
        watchers = watchTable.remove(path);
    for (Watcher w : watchers) {

In fact, all data in zookeeper is managed by a data structure called datatree. All requests for reading and writing data will eventually change the content of the tree. When a read request is issued, a watcher may be passed in to register a callback function, while a write request may trigger a corresponding callback, and the watchmanager will inform the client of the change of data.

In fact, the implementation of notification mechanism is relatively simple. By setting the watcher listening event in the read request, the notification can be sent to the specified client when the write request triggers the event.

3.5 conversation

Zookeeper client connects to zookeeper service cluster through TCP long connection. The session has been established since the first connection, and then the effective session state is maintained by heartbeat detection mechanism。 Through this connection, the client can send the request and receive the response, and also receive the notification of the watch event.

The zookeeper server cluster list is configured in each zookeeper client configuration. At startup, the client will traverse the list to try to establish a connection. And so on, if the next server tries to connect, and so on.

Once a client establishes a connection with a server, the server creates a new session for the client.Each session will have a timeout period. If the server does not receive any requests within the timeout period, the corresponding session will be regarded as expired.Once the session expires, it cannot be reopened, and any temporary znode associated with the session will be deleted.

Generally speaking, the session should exist for a long time, which needs to be guaranteed by the client. The client can ping to keep the session from expiration.

In simple terms zookeeper In simple terms zookeeper

Zookeeper’s session has four properties:

  • sessionID:Session ID, which uniquely identifies a session. Every time a client creates a new session, zookeeper will assign it a globally unique session ID.
  • TimeOut:Session timeout: when constructing a zookeeper instance, the client will configure the sessiontimeout parameter to specify the session timeout. After the zookeeper client sends the timeout to the server, the server will finally determine the session timeout according to its own timeout limit.
  • TickTime:For the next session timeout point, in order to facilitate zookeeper to implement the “bucket strategy” management for the session, and to realize the session timeout check and clean-up efficiently and cheaply, zookeeper will mark a next session timeout point for each session, whose value is roughly equal to the current time plus timeout.
  • isClosing:Mark whether a session has been closed or not. When the server detects that the session has expired, it will mark the isclosing of the session as closed, so as to ensure that new requests from the session will not be processed.

Zookeeper is mainly responsible for session management through “session Tracker”, which adopts “session Tracker”Bucket strategy(similar sessions are managed in the same block) so that zookeeper can isolate the sessions in different blocks and unify the sessions in the same block.

4、 Zab protocol

Zookeeper does not directly use Paxos algorithm, but uses a consistency protocol named Zab. Zab protocol is not Paxos algorithm, but it is similar, and their operation is not the same.

Zab protocol is a kind of atomic broadcast protocol specially designed by zookeeper to support crash recovery.

Zab protocol is the data consistency and high availability solution of zookeeper.

The Zab protocol defines two possible protocolsInfinite cycleThe process of management is as follows:

  • Election leader:For fault recovery, thus ensuring high availability.
  • Atomic Broadcasting:Used for master-slave synchronization to ensure data consistency.

4.1 election leader

Fault recovery of zookeeper

Zookeeper cluster adopts one master (called leader) and multiple slaves (called follower) mode, and master-slave nodes ensure data consistency through replica mechanism.

  • If the follower node is down – each node in the zookeeper cluster will maintain its own state in memory separately, and the communication between the nodes is maintained. As long as half of the machines in the cluster can work normally, then the whole cluster can provide services normally.
  • If the leader node is down – if the leader node is down, the system will not work properly. At this time, we need to use the election leader mechanism of Zab protocol for fault recovery.

The election leader mechanism of Zab protocol is simply: a new leader is generated based on more than half of the election mechanism, and then other machines will synchronize the state from the new leader. When more than half of the machines complete the state synchronization, they will exit the election leader mode and enter the atomic broadcast mode.

4.1.1 terminology

myid:Each zookeeper server needs to create a file named myid under the data folder, which contains the unique ID (integer) of the entire zookeeper cluster.

zxid:Similar to the transaction ID in RDBMS, it is used to identify the proposal ID of an update operation. In order to ensure the order, the zkid must be monotonically increased. Therefore, zookeeper uses a 64 bit number to represent, and the top 32 bits are the epoch of the leader. Starting from 1, each time a new leader is selected, epoch is added by one. The lower 32 bits are the serial number in the epoch. Each time the epoch changes, the serial number of the lower 32 bits will be reset. This ensures the global incrementability of zkid.

4.1.2 server status

  • LOOKING:Uncertain leader status. In the current state of the cluster, the leader thinks that there is no leader in the cluster.
  • FOLLOWING:Follower state. Indicates that the current server role is follower and that it knows who the leader is.
  • LEADING:Leadership status. Indicates that the current server role is the leader, which maintains the heartbeat with the follower.
  • OBSERVING:Observer status. This indicates that the current server role is observer, and the only difference between this role and folower is that it does not participate in the election, nor does it participate in the voting of cluster write operations.

4.1.3 vote data structure

During the leadership election, each server will send the following key information:

  • logicClock:Each server maintains a self incrementing integer called logicclock, which indicates the number of rounds of voting initiated by the server.
  • state:The status of the current server.
  • self_id:Myid of the current server.
  • self_zxid:The maximum zxid of the data stored on the current server.
  • vote_id:Myid of the recommended server.
  • vote_zxid:The maximum zxid of the data stored on the recommended server.

4.1.4 voting process

(1) Self increasing election rounds

Zookeeper stipulates that all valid votes must be in the same round. When each server starts a new round of voting, it will first self increment the logicclock it maintains.

(2) Initialize vote

Each server will empty its ballot box before broadcasting its own votes. The ballot box records the votes received. Example: if server 2 votes for server 3 and server 3 votes for server 1, the ballot box of server 1 is (2, 3), (3, 1), (1, 1). Only the last vote of each voter will be recorded in the ballot box. If the voter updates his or her own vote, other servers will update the server’s vote in their own ballot box after receiving the new vote.

(3) Send initialization vote

Each server initially votes for itself through the broadcast.

(4) Receiving external votes

The server will try to get votes from other servers and record them in its own ballot box. If you can’t get any external votes, you will confirm whether you have a valid connection with other servers in the cluster. If yes, send your vote again; if not, connect with it immediately.

(5) Judging election rounds

After receiving an external vote, different processing will be performed according to the logicclock contained in the voting information

  • Logicclock of external votinggreater thanOwn logicclock. It shows that the election round of the server lags behind that of other servers. Immediately clear your ballot box and update your logicclock to the received logicclock. Then compare your previous vote with the received vote to determine whether you need to change your vote, and finally broadcast your vote again.
  • Logicclock of external votingless thanOwn logicclock. The current server directly ignores the vote and continues to process the next vote.
  • Logickclock of external voting and its ownequal。 There was a vote PK.

(6) Vote PK

Vote PK is based on (self)\_ id, self\_ Zxid) and (vote)\_ id, vote\_ Zxid)

  • Logicclock of external votinggreater thanFor your own logicclock, change your own logicclock and the logicclock of your vote to the received logicclock.
  • iflogicClockagreementThen compare their vote\_ Zxid, if the vote of external voting\_ If zxid is larger, the vote in your ticket will be\_ Zxid and vote\_ Myid is updated to vote in the received ticket\_ Zxid and vote\_ Myid and broadcast out, in addition to the tickets received and their updated tickets into their own box. If the ticket box already exists (self\_ myid, self\_ Zxid).
  • If bothvote_zxidIf they are consistent, compare their vote\_ Myid, if the vote of external voting\_ If myid is larger, the vote in your ticket will be changed\_ Myid is updated to vote in the received ticket\_ Myid and broadcast out, in addition to the tickets received and their updated tickets into their own box.

(7) Count votes

If it is determined that more than half of the servers have approved their voting (possibly the updated voting), the voting will be terminated. Otherwise, continue to receive votes from other servers.

(8) Update server status

After the voting is terminated, the server starts to update its status. If more than half of the votes are for themselves, update their server status to leading, otherwise update their status to following.

Through the above process analysis, it is not difficult to see that: to make the leader obtain the support of most servers, theThe number of zookeeper cluster nodes must be odd. The number of surviving nodes should not be less than N + 1

The above process will be repeated after each server is started. In the recovery mode, if the server is just recovered from the crash state or just started, it will also recover the data and session information from the disk snapshot. ZK will record the transaction log and take the snapshot regularly to facilitate the state recovery during the recovery.

4.2 atomic broadcast

Zookeeper achieves high availability through replica mechanism.

So, how does zookeeper implement the replica mechanism? The answer is: atomic broadcasting of Zab protocol.

In simple terms zookeeper In simple terms zookeeper

The atomic broadcasting requirements of Zab protocol are as follows:

All write requests will be forwarded to the leader, who will notify follow by atomic broadcast. When more than half of the follow has been persistent, the leader will submit the update, and then the client will receive a successful response.This is somewhat similar to the two-phase commit protocol in a database.

In the whole process of message broadcasting, the leader server will generate the corresponding proposal for each transaction request, assign a globally unique incremental transaction ID (zxid) to it, and then broadcast it.

5、 Zookeeper application

Zookeeper can be used for publish / subscribe, load balancing, command service, distributed coordination / notification, cluster management, master election, distributed lock and distributed queue.

5.1 naming service

In a distributed system, a globally unique name is usually needed, such as generating a globally unique order number. Zookeeper can generate a globally unique ID through the characteristics of sequential nodes, so that it can provide naming service for the distributed system.

In simple terms zookeeper In simple terms zookeeper

5.2 configuration management

Using zookeeper’s observation mechanism, it can be used as a highly available configuration memory, allowing distributed application participants to retrieve and update configuration files.

5.3 distributed lock

The distributed lock can be realized by temporary node of zookeeper and watcher mechanism.

For example, there is a distributed system with three nodes a, B and C trying to obtain distributed locks through zookeeper.

(1) Access / lock (this directory path is determined by the program itself) and createTemporary node with serial number (ephemeral).

In simple terms zookeeper In simple terms zookeeper

(2) When each node tries to acquire the lock, it gets all the child nodes (ID) under the / locks node\_ 0000,id\_ 0001,id_ 0002),Determine whether the node you create is the smallest.

  • If so, get the lock.

    Release lock: delete the created node after the operation.

  • If it is not, it will listen to the node change which is 1 smaller than itself.

In simple terms zookeeper In simple terms zookeeper

(3) Release the lock, that is, delete the node you created.

In simple terms zookeeper In simple terms zookeeper

In the figure, NODEA deletes the node ID created by itself_ 0000, NodeB listens to the change and finds that its own node is already the smallest node, so it can obtain the lock.

5.4 cluster management

Zookeeper can also solve the problems in most distributed systems

  • For example, the heartbeat detection mechanism can be established by creating temporary nodes. If a service node of the distributed system goes down, the session it holds will time out. At this time, the temporary node will be deleted and the corresponding listening event will be triggered.
  • Each service node of the distributed system can also write its own node status to the temporary node to complete the status report or node work progress report.
  • Through the data subscription and publishing function, zookeeper can decouple the modules and schedule the tasks of the distributed system.
  • Through the monitoring mechanism, the service nodes of the distributed system can be dynamically up and down, so as to realize the dynamic expansion of services.

5.5 election leader node

An important mode of distributed system is master / sales mode, which can be used in matser election. All service nodes can create the same znode competitively. Since zookeeper cannot have a znode with the same path, only one service node can be created successfully, so that the service node can become a master node.

5.6 queue management

Zookeeper can handle two types of queues:

  • When all the members of a queue are gathered, the queue can be used. Otherwise, it will wait for all the members to arrive. This is a synchronous queue.
  • The queue operates in FIFO mode, such as producer and consumer model.

The implementation idea of synchronization queue with zookeeper is as follows:

Create a parent directory / synchronizing. Each member monitors whether the set watch bit directory / synchronizing / start exists, and then each member joins the queue. The way to join the queue is to create / synchronizing / member\_ I, and then each member gets all the directory nodes of the / synchronizing directory, that is, members\_ i。 Judge whether the value of I is already the number of members. If it is less than the number of members, wait for / synchronizing / start to appear. If it is equal, create / synchronizing / start.

reference material




Author: Zhang Peng