Let’s meet Kafka consumer

Time:2019-12-5

We have introduced the overall framework of Kafka. Where is the final flow of Kafka producers’ messages? Of course, we need to consume. If we compare Kafka to a restaurant, then the producer is the chef, the consumer is the guest, and the chef is the only one, so it’s meaningless that no one eats the fried dishes. If only the guest has no chef, who will eat in this restaurant?! So if you can’t finish reading the previous article, you can continue to enjoy it. If you haven’t read the previous article, it will make you cool from now on.

Kafka consumer concept

Application usageKafkaConsumerSubscribe to topics from Kafka, receive messages from these topics, and then save them. The application first needs to create a kafkaconsumer object, subscribe to the topic and start accepting the message, validate the message and save the result. After a period of time, the producer writes to the topic faster than the application validates the data. What should I do? If only a single consumer is used, the application program will not keep up with the speed of message generation, just as multiple producers write messages like the same topic. At this time, multiple consumers need to participate in the consumption of messages in the topic, and the message will be split.

Kafka consumer subordinationConsumer groups。 Consumers in a group subscribe toidenticalEach consumer receives messages from a section of the topic. Here is a consumption diagram of Kafka District

Let's meet Kafka consumer

The topic T1 in the figure above has four partitions, namely, partition 0, partition 1, partition 2, and partition 3. We create a consumer group 1, which has only one consumer. It subscribes to the topic T1 and receives all messages in T1. As one consumer processes the messages sent by four producers to the partition, the pressure is a little high, and they need help to share the task, so it turns into the following figure

Let's meet Kafka consumer

In this way, the consumption ability of consumers will be greatly improved. However, in some environments, such as when users generate too many messages, the messages generated by producers still can’t be consumed by consumers, so consumers will continue to increase.

Let's meet Kafka consumer

As shown in the figure above, messages generated by each partition can be consumed by consumers in each consumer group. If more consumers are added to the consumer group, the extra consumers will be idle, as shown in the figure below

Let's meet Kafka consumer

Increasing consumers in the group is the main way of horizontal expansion of consumption capacity. All in all, we can do this by increasing the number of consumers in the consumer groupLevel expansion to improve consumption capacity。 This is also why it is recommended to use a larger number of partitions when creating themes, so as to increase consumers’ performance in case of high consumption load. In addition, the number of consumers should not be more than the number of partitions, because the extra consumers are idle, without any help.

An important feature of Kafka is that it only needs to write a message once and can support any number of applications to read the message. In other words, every application can read a full amount of messages. In order to enable each application to read the full amount of messages, the application needs to have different consumption groups. For the above example, if we add a new consumption group G2, and this consumption group has two consumers, it will evolve into the following figure

Let's meet Kafka consumer

In this scenario, both consumption group G1 and consumption group G2 can receive the full message of T1 topic, which are logically different applications.

In summary, if the application needs to read the full amount of messages, please set a consumption group for the application; if the application has insufficient consumption capacity, you can consider adding consumers to the consumption group

Consumer group and segment rebalancing

What is the consumer group

Consumer groupIt is a group composed of one or more consumer instances, which has a mechanism of scalability and fault tolerance. Consumers in the consumer groupShareA consumer group ID, which is also calledGroup ID, consumers in a group can subscribe to and consume a topic together. Consumers in the same group can only consume messages in one partition. Redundant consumers will be idle and useless.

We mentioned two ways of consumption above

  • A consumer group consumes messages in a topic. This consumption mode is also calledPoint to pointPoint to point consumption is also known as message queuing
  • Messages in a topic are consumed by multiple consumer groups. This consumption mode is also known asPublish subscribePattern

Consumer rebalancing

We’re from the topConsumer evolution mapWe can know the process: first, a consumer subscribes to a topic and consumes the messages of all its partitions, then a consumer joins the group, then more consumers join the group, and new consumer instances joinShareSome of the original consumer’s messages, which transfer ownership of the partition from one consumer to another, are calledRebalance, English name is also calledRebalance。 As shown in the figure below

Let's meet Kafka consumer

Rebalancing is very important, it bringsHigh availabilityandScalability, we can safely add consumers or remove consumers, but under normal circumstances we do not want such behavior. During rebalancing, consumers cannot read messages, making the entire consumer group unavailable during rebalancing. In addition, when the partition is reassigned to another consumer, the current read state of the message will be lost. It may also need to refresh the cache, which will slow down the application before it recovers.

The consumer through toOrganization Coordinator(Kafka broker) sends a heartbeat to maintain that it is a member of the consumer group and confirm the partition it owns. For different consumer groups, the organization coordinator can be different. As long as the consumer sends a heartbeat on a regular basis, the consumer is considered alive and processes messages in its partition. A heartbeat is sent when a consumer retrieves a record or submits a record it consumes.

If Kafka stops sending heartbeat after a period of time, the session will expire, the organization Coordinator will think that the consumer has died, and a rebalancing will be triggered. If the consumer goes down and stops sending messages, the organization coordinator waits a few seconds to confirm that it is dead before triggering the rebalancing. During this time,Dead consumers will not process any messages。 When cleaning up the consumer, the consumer will inform the coordinator that it wants to leave the group, and the organization Coordinator will trigger a rebalancing to minimize the processing pause.

Rebalancing is a double-edged sword. It brings high availability and scalability to consumer groups. At the same time, there are some obvious shortcomings (bugs), which cannot be modified by the community until now.

The process of rebalancing has a great impact on consumer groups. Because every rebalancing process will cause everything to be static, refer to the garbage collection mechanism in the JVM, that is, stop the world, STW (quoted from P76’s description of serial collector in “deep understanding of Java virtual machine”):

What’s more, it must pause all other worker threads when it is garbage collection. Until the end of its collection.Stop The WorldThis name sounds very handsome, but this work is actually initiated and completed automatically by the virtual machine in the background. When the user is not visible, stop all the normal working threads of the user, which is unacceptable for many applications.

In other words, during the rebalancing period, the consumer instances in the consumer group will stop consuming and wait for the rebalancing to complete. And the process of rebalancing is slow

Create consumer

The theory above is a little bit more. Now we will explain how consumers consume through code

Before reading a message, you need to create aKafkaConsumerObject. Creating a kafkaconsumer object is very similar to creating a kafkaproducer object — placing the attributes that need to be passed to the consumer in thepropertiesIn the object, we will focus on some configurations of Kafka later. Here, we will simply create three properties, namelybootstrap.serverkey.deserializervalue.deserializer

We’ve used these three attributes many times. If you’re not clear about them, you can refer to the posture of taking you up to get to know Kafka producer

Another attribute isgroup.idThis attribute is not required. It specifies which consumer group kafkaconsumer belongs to. It is also possible to create consumers who do not belong to any group

Properties properties = new Properties();
        properties.put("bootstrap.server","192.168.1.9:9092");     properties.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer");   properties.put("value.serializer","org.apache.kafka.common.serialization.StringSerializer");
KafkaConsumer<String,String> consumer = new KafkaConsumer<>(properties);

Topic subscription

Once the consumer is created, the next step is to subscribe to the topic.subscribe()Method takes a list of topics as parameters, which is easy to use

consumer.subscribe(Collections.singletonList("customerTopic"));

For simplicity, we only subscribe to one topiccustomerTopic, the parameter passes in a regular expression, which can match multiple topics. If someone creates a new topic and the name of the topic matches the regular expression, a rebalancing will be triggered immediately, and the consumer can read the new topic.

To subscribe to all test related topics, do this

consumer.subscribe("test.*");

polling

We know that Kafka supports subscription / publishing mode, and the producer sends data to Kafka broker. How does the consumer know that the producer sends data? In fact, the data consumers generated by the producers don’t know. Kafkaconsumer uses polling to regularly retrieve data in Kafka broker. If there is data, it will be used for consumption. If there is no data, it will continue to poll and wait. The following is the specific implementation of polling and waiting

try {
  while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofSeconds(100));
    for (ConsumerRecord<String, String> record : records) {
      int updateCount = 1;
      if (map.containsKey(record.value())) {
        updateCount = (int) map.get(record.value() + 1);
      }
      map.put(record.value(), updateCount);
    }
  }
}finally {
  consumer.close();
}
  • This is an infinite cycle. The consumer is actually a long-running application that requests data from Kafka through polling.
  • The third line of code is very important. Kafka must request data periodically. Otherwise, it will think that the consumer has been hung, trigger rebalancing, and its partition will be handed over to other consumers in the group. Pass topoll()The method is a supermarket time, usingjava.time.DurationClass to indicate that if the parameter is set to 0, the poll () method will return immediately, otherwise it will wait for the broker to return data for the specified number of milliseconds.
  • The poll () method returns a list of records. Each record contains the information of the subject to which the record belongs, the information of the partition where the record is located, the offset recorded in the partition, and the key value pair of the record. We usually traverse the list and process each record one by one.
  • Use before exiting the applicationclose()Method to close the consumer. Instead of waiting for the group coordinator to find out that it no longer sends a heartbeat and confirm that it has died, the network connection and socket will also close down and trigger a rebalancing immediately.

Thread security

In the same group, we can not let a thread run multiple consumers, nor can we let multiple threads share a consumer safely. According to the rules, a consumer uses a thread. If more than one consumer in a consumer group wants to run, each consumer must run in its own thread. You can use Java’sExecutorServiceLaunch multiple consumers for processing.

Consumer configuration

So far, we’ve learned how to use the consumer API, but we’ve only covered a few of the most basic properties, and the Kafka document lists all of the consumer related configuration instructions. Most of the parameters have reasonable default values, which generally do not need to be modified. Let’s introduce these parameters.

  • fetch.min.bytes

This property specifies the minimum number of bytes that consumers get records from the server. When a broker receives a data request from a consumer, if the amount of data available is less thanfetch.min.bytesThe specified size, then it will wait until there is enough data available to return it to the consumer. This can reduce the workload of consumers and brokers, because they do not need to process messages back and forth when the topic usage is not very high. If there is not a lot of data available, but the consumer’s CPU usage is high, then you need to set the value of this property to be higher than the default value. If the number of consumers is large, increasing the value of this attribute can reduce the broker’s workload.

  • fetch.max.wait.ms

We passed thefetch.min.bytesTell Kafka to wait until there is enough data to return it to the consumer. andfetch.max.wait.msIt is used to specify the waiting time of the broker, which is 500 milliseconds by default. If there is not enough data flowing into Kafka, the minimum amount of data obtained by consumers will not be met, resulting in a delay of 500 milliseconds. If you want to reduce the potential delay, you can set the parameter value to a smaller value. If fetch.max.wait.ms is set to a delay of 100 milliseconds and the value of fetch.min.bytes is set to 1MB, Kafka will either return 1MB of data or all available data after 100ms after receiving the consumer’s request. It depends on which condition is satisfied first.

  • max.partition.fetch.bytes

This attribute specifies that the server returns theMaximum bytes。 Its default value is 1MB, that is,KafkaConsumer.poll()Method returns no more than the bytes specified by max.partition.fetch.bytes from each partition. If a theme has 20 partitions and 5 consumers, each consumer needs toat least4 MB of available memory to receive records. When allocating memory for consumers, you can allocate more to them, because if there are consumers in the group who crash, the remaining consumers need to handle more partitions. The value of max.partition.fetch.bytes must be greater than the maximum number of bytes that a broker can receive (configured through the max.message.size property),Otherwise, the consumer may not be able to read these messages, causing the consumer to suspend retrying all the time。 Another consideration when setting this property is the time consumers spend processing data. Consumers need to call poll() frequently to avoid session expiration and partition rebalancing. If too much data is returned from a single call to poll(), consumers need more time to process, and may not be able to conduct the next poll in time to avoid session expiration. If this happens, you can reduce the value of max.partition.fetch.bytes or extend the session expiration time.

  • session.timeout.ms

This attribute specifies the time that a consumer can disconnect from the server before being considered dead. The default is 3S. If the consumer is notsession.timeout.msSending a heartbeat to the group coordinator within a specified time will be considered dead and the coordinator will trigger a rebalancing. Assign its partition to other consumers in the consumer group. This property is consistent with theheartbeat.interval.msClosely related. Heartbeat.interval.ms specifies the frequency at which the poll() method sends the heartbeat to the group coordinator, and session.timeout.ms specifies how long consumers can not send the heartbeat. Therefore, these two attributes need to be modified at the same time. Heartbeat.interval.ms must be smaller than session.timeout.ms, which is generally one third of session.timeout.ms. If session.timeout.ms is 3S, heartbeat.interval.ms should be 1s. Setting the value of session.timeout.ms smaller than the default value can detect and recover the nodes in a faster way, but long polling or garbage collection may lead to unexpected rebalancing. Setting the value of this property to a higher value can reduce unexpected rebalancing, but it takes longer to detect node crashes.

  • auto.offset.reset

This attribute specifies what the consumer should do when reading a partition without an offset or when the offset is invalid. Its default value islatest, which means that in case of invalid offset, the consumer will start reading data from the latest record. Another value isearliest, which means that if the offset is invalid, the consumer will start reading the records of the partition from the starting position.

  • enable.auto.commit

We’ll cover several different ways to commit offsets later. This attribute specifies whether the consumer submits the offset automatically. The default value is true. In order to avoid repeated data and data loss as much as possible, it can be set to false to control when to submit the offset. If you set it to true, you can also use theauto.commit.interval.msProperty to control the frequency of submissions

  • partition.assignment.strategy

We know that partitions are assigned to consumers in the group.PartitionAssignorKafka will decide which partition should be assigned to which consumer according to the given consumer and theme. Kafka has two default allocation strategiesRangeandRoundRobin

  • client.id

This property can be any string, which is used by the broker to identify messages sent from the client, and is usually used in logs, metrics, and quotas

  • max.poll.records

This attribute is used to control the number of records that can be returned by a single call to the call() method. It can help you control the amount of data to be processed in polling.

  • Receive.buffer.bytes and send.buffer.bytes

The size of the TCP buffer that socket uses to read and write data can also be set. If they are set to – 1, the operating system default is used. If the producer or consumer is in a different data center from the broker, these values can be increased appropriately, because the network across the data center generally has higher latency and lower bandwidth.

Concept of commit and offset

Special offset

As we mentioned above, every time a consumer callspoll()Method will return the records written by the producer to Kafka but not consumed by the consumer when polling regularly, so we can track which records are read by which consumer in the group. Consumers can use Kafka to track the location (offset) of messages in the partition

The consumer will call_consumer_offsetWhen sending messages in a special topic, this topic will save the partition offset in each message sent. The main function of this topic is to record the offset used after the consumer triggers the rebalancing. When the consumer sends messages to this topic every time, the rebalancing is not triggered under normal circumstances. This topic does not work. When the rebalancing is triggered, the consumer stops working Consumers may be divided into corresponding partitions. This topic is set to enable consumers to continue processing messages.

If the submitted offset is less than the offset last processed by the client, the message between the two offsets will be processed repeatedly

Let's meet Kafka consumer

If the submitted offset is greater than the offset at the time of the last consumption, the message between the two offsets will be lost

Let's meet Kafka consumer

Since_consumer_offsetSo important, how is it submitted? Let’s talk about it

Submission mode

The kafkaconsumer API provides several ways to submit offsets

Automatic submission

The easiest way to do this is to have the consumer submit the offset automatically. Ifenable.auto.commitIf it is set to true, the consumer will automatically submit the maximum offset polled from the poll () method every 5S. Submission interval byauto.commit.interval.msControl, default is 5S. Like everything else in the consumer, automatic submission takes place in polling. The consumer checks whether the offset has been submitted in each poll, and if so, the offset returned from the last poll.

Submit current offset

holdauto.commit.offsetSet to false to let the application decide when to commit the offset. UsecommitSync()Submit offset. This API will submit the latest offset returned by the poll () method, which will be returned as soon as it is submitted successfully. If it fails, an exception will be thrown.

Commitsync() will submit the latest offset returned by poll(). If all records are processed, make sure commitsync() is called. Otherwise, there will be a risk of message loss. If it happens to be in balance, all messages from the latest batch to the balance will be processed repeatedly.

Asynchronous submission

Asynchronous submissioncommitAsync()Submit withcommitSync()The biggest difference is that asynchronous submission will not be retried, and synchronous submission will be retried consistently.

Synchronous and asynchronous combined commit

In general, for occasional submission failures, there will be no big problem without retry, because if the submission failure is caused by temporary problems, then subsequent submissions will always succeed. But if the last commit before closing the consumer or rebalancing, make sure the commit is successful.

Therefore,Commitasync and commitsync are usually used together to submit offsets before the consumer shuts down

Submit a specific offset

The consumer API allows you to pass in a map of the partition and offset you want to submit when you call the commitsync() and commitasync() methods, that is, to submit a specific offset.

Article reference:

Interpretation of Kafka consumers

《Kafka-the-definitive》

Geek time Kafka core technology and actual combat

Kafka authority Guide

https://docs.confluent.io/cur…

KafkaConsumer