In our last article, we analyzed how consumers join the consumer group. In fact, the last article is a very macro thing, mainly about how the consumer coordinator communicates with the groupcoordinator. Wait, Lao Zhou, what are consumer coordinator and group coordinator? These two components are the coordinator of consumer and Kafka broker. To put it bluntly, they are the facade pattern in our design pattern. For details, please refer to the previous review. Today’s article mainly talks about how consumers join the rebalance mechanism in the consumer group. In fact, the last article talked about the general situation. This article will go deeper to talk about the specific details of the rebalance mechanism.
If you are a programmer with some experience, I think the rebalance mechanism can be investigated as an interview question, and it is still difficult. But there is no need to belittle yourself. Follow Lao Zhou’s article and I believe you can win it.
However, some readers do think it is still difficult. Don’t worry. First take a look at the topology of Kafka below. This structure is very clear. If you don’t understand the topology of Kafka, I suggest you don’t look down first, clarify the topology of Kafka, or read the previous articles of Lao Zhou before continuing to read. I think the effect will be better.
This article mainly discusses the rebalance mechanism from the following points:
- What is the rebalance mechanism?
- Timing of triggering rebalance mechanism
- Group status change
- Problems with legacy consumer clients
- Principle of rebalance mechanism
- Broker side rebalancing scenario
2、 What is the rebalance mechanism?
Rebalance is essentially an agreement that specifies how all consumers under a consumer group agree to allocate each partition of the subscription topic.
How do consumers redistribute consumption when new members join the cluster or partitions are added to some topics? Here comes the concept of rebalancing. Let me explain to you what is Kafka rebalancing mechanism.
Several concepts of the consumption group model can be found from the figure:
- For the same consumption group, a partition can only be subscribed and consumed by one consumer, but a consumer can subscribe to multiple partitions, that is, each message will only be consumed by one consumer in the same consumption group to ensure that it will not be consumed repeatedly;
- A partition can be subscribed by different consumer groups. Here is a special case. If each consumer group has only one consumer, the partition will be broadcast to all consumers to realize broadcast consumption.
In order to realize the above consumption group model, it is necessary to realize the dynamic adjustment when the external environment changes, such as the new partition of the theme and the addition of new members of the consumption group, so as to maintain the above model. Then this work will be handed over to the Kafka rebalancing mechanism.
It can be seen from the figure that Kafka rebalancing is caused by external triggering. Let’s look at the timing of triggering Kafka rebalancing.
3、 Timing of triggering rebalance mechanism
- A new consumer has joined the consumer group
- There is a consumer downtime offline. The consumer does not necessarily need to be offline. For example, when a consumer fails to send a heartbeatrequest to the groupcoordinator for a long time due to a long GC or network delay, the groupcoordinator will think that the consumer is offline.
- A consumer voluntarily quits the consumer group (sends a leavegrouprequest request). For example, the client calls the unsubscribe () method to unsubscribe from some topics.
- Consumer consumption timed out and did not submit offset within the specified time.
- The groupcoordinator node corresponding to the consumer group has changed.
- Any topic subscribed by the consumer group or the number of partitions of the topic changes.
4、 Group status change
4.1 consumer side
On the consumer side, the facade consumercoordinator inherits the abstractcoordinator abstract class. In the internal class memberstate of the coordinator abstractcoordinator, we can see the four states of the coordinator: unregistered, no response received after reallocation, response received after reallocation but no allocation, and stable state.
The conversion of the above four states of the consumer is shown in the figure below:
The groupcoordinator on the Kafka server has five statuses: empty, preparingrebalance, completingrebalance, stable and dead. Their state transition is shown in the figure below:
|Empty||There are no members in the group, but there may be submitted displacement data in the consumer group, and these displacement data have not expired.|
|Dead||Similarly, there are no members in the group, but the metadata information of the group has been removed on the coordinator side. The coordinator component stores all the group information registered with it. The so-called metadata information is similar to this registration information.|
|PreparingRebalance||The consumer group is ready to start the rebalancing. At this time, all members should re request to join the consumer group.|
|CompletingRebalance||All members of the consumer group have joined, and each member is waiting for the allocation scheme. This state is called awaitingsync in the older version, which is equivalent to completing rebalance.|
|Stable||The steady state of the consumer group. This status indicates that the rebalancing has been completed and all members of the group can consume data normally.|
- A consumer group starts with empty
- After rebalancing is enabled, it will be placed in preparingrebalance and wait for members to join.
- Then change to the completingrebalance waiting allocation scheme
- Finally, it flows to stable to complete rebalance
When a member changes, the status of the consumer group changes from stable to preparingrebalance.
- At this time, all existing members need to re apply to join the group
- When all group members exit the group, the status of the consumer group is empty.
- When the consumer group is in empty status, Kafka will automatically delete the expired offset on a regular basis.
5、 Problems with legacy consumer clients
The concepts of consumercoordinator and groupcoordinator are aimed at the consumer client after Kafka version 0.9.0. For the time being, we call the consumer client before Kafka version 0.9.0 the old consumer client. The old consumer client uses zookeeper’s watcher to realize these functions.
Each consumer group
<group>One is maintained in zookeeper
/consumers/<group>/idsPath, under which temporary nodes are used to record the unique identification of consumers belonging to this consumption group
consumerldStringCreated when started by the consumer. The unique identification of the consumer is provided by
consumer. Partial information of ID + hostname + timestamp + UUIDComposition, in which
consumer.idIs the configuration in the old version of the consumer client, which is equivalent to the configuration in the new version of the client
client.id。 For example, the unique identification of a consumer is
consumerld_localhost-1510734527562-64b377f5, so where
consumerldFor the specified
localhostIs the host name of the computer,
1510734527562Represents a timestamp, and
UUIDPart of the information.
Figure below and
/consumers/<group>/idsThere are also two nodes at the same level:
/consumers/<group>/ownersThe corresponding relationship between partitions and consumers is recorded under the path
/consumers/<group>/offsetsThe corresponding consumption displacement of this consumption group in the partition is recorded under the path
Each broker, topic and partition also corresponds to a path in zookeeper:
/brokers/ids/<id>Record the host, port and the list of subject partitions assigned to this broker;
/brokers/topics/<topic>The leader copy, ISR set and other information of each partition are recorded.
/brokers/topics/<topic>/partitions/<partition>/stateThe current leader copy, leader epoch and other information are recorded.
Every consumer will be at startup
/brokers/idsRegister a listener on the path. When
/consumers/<group>/idsWhen the child node under the path changes, it indicates that the consumers in the consumption group have changed; When
/brokers/idsWhen the child nodes under the path change, it indicates that the broker has increased or decreased. In this way, through the watcher provided by zookeeper, each consumer can monitor the status of the consumer group and Kafka cluster.
In this way, each consumer monitors the relevant paths of zookeeper respectively. When the rebalancing operation is triggered, all consumers in a consumer group will rebalance at the same time, and consumers do not know the results of each other’s operation, which may lead to Kafka working in an incorrect state. At the same time, there are two serious problems in this practice of relying heavily on zookeeper clusters.
Herding effect: herding refers to the change of a monitored node in zookeeper, and a large number of watcher notifications are sent to the client, resulting in the delay of other operations during the notification period. Similar deadlock may also occur.
Split brain: when consumers conduct rebalancing operation, each consumer communicates with zookeeper to judge the changes of consumers or brokers. Due to the characteristics of zookeeper, the states obtained by each consumer at the same time may be inconsistent, which will lead to abnormal problems.
6、 Principle of rebalance mechanism
The consumer client after Kafka version 0.9.0 has been redesigned to divide all consumer groups into multiple subsets
The subset is managed by a group coordinator corresponding to the server. The group coordinator is a component used to manage the consumption group in the Kafka server. The consumer coordinator component in the consumer client is responsible for interacting with the groupcoordinator.
- The complete rebalance process needs to be completed jointly by the consumer and coordinator
Consumer rebalance steps
- Join group: corresponding to joingroup request
- Waiting for leader consumer allocation scheme: corresponding to syncgroup request
- When a member of a group joins a group, the consumer sends a joingroup request to the coordinator.
- Each consumer will report its subscribed topic
After collecting all joingroup requests, the coordinator selects one of these members as the leader of the consumer group
- Usually, the first person to send a joingroup request automatically becomes a leader
- The task of leader consumer is to collect the topics of all members and formulate a specific partition consumer allocation scheme according to the information.
- After selecting the leader, the coordinator encapsulates all topic information into the joingroup response and sends it to the leader.
- The leader consumer makes a unified allocation scheme and enters the syncgroup request.
- The leader consumer sends the syncgroup to the coordinator and the allocation scheme to the coordinator.
- Other members will also issue syncgroup requests
- The coordinator distributes the scheme to all members in the form of syncgroup response
- All members successfully received the distribution scheme, and the consumer group entered the stable state and started normal consumption.
For the specific source code analysis, you can see how the consumer analyzed in my last article joined the consumer group article.
7、 Broker side rebalancing scenario
7.1 accession of new members
- New members join after the consumer group is stable
7.2 team members leave voluntarily
- Active departure: the consumer instance notifies the coordinator to exit by calling the close() method
- This scenario involves a third request: the leavegroup request
7.3 group members collapse and leave
- The coordinator needs to wait for some time to perceive
- This time period is determined by the consumer side parameter sessionn timeout. MS control
- Kafka will not perceive the crash beyond the above parameters
- The processing flow is the same
When submitting group member rebalance.4
- When rebalance is enabled, the coordinator will give members a buffer time and require each member to quickly report their offset within this time.
- Then start the normal joingroup / syncgroup request
Well, that’s all for the rebalancing mechanism. The next article will talk about how to avoid rebalancing.