What will Kafka do if he loses his message?


A schematic diagram

What will Kafka do if he loses his message?

Kafka has the problem of message loss, which occurs in broker, producer and consumer.


In order to get higher performance and throughput, Kafka stores data asynchronously in batches on disk. In order to improve the performance and reduce the number of disk brushing, Kafka adopts the method of batch disk brushing. That is, according to a certain amount of messages, and the time interval to brush disk. This mechanism is also determined by the Linux operating system. When storing data in Linux operating system, it will be stored in page cache first, and swipe disk (from page cache to file) according to time or other conditions, or force to swipe disk through fsync command. When the data is in the page cache, if the system fails, the data will be lost.

What will Kafka do if he loses his message?

< figcaptation style = “box sizing: inherit; display: block; margin top: 0.5em; margin bottom: 1em; color: rgb (85, 93, 102); text align: Center; font size: 13px;” > broker can read and write at high speed on Linux server and synchronize to replica

The figure above briefly describes the process of data writing and synchronization of broker. Broker writes data only to pagecache, which is in memory. This part of data will be lost after power failure. The data of pagecache is flushed through the flush program of Linux. There are three trigger conditions for disk brushing

  • Call sync or fsync function actively
  • Available memory below threshold
  • Dirty data time reaches the threshold. Dirty is an identification bit of pagecache. When data is written to pagecache, pagecache is marked as dirty. After data is flushed, the dirty flag is cleared.

Broker configures the disk brushing mechanism by calling fsync function to take over the disk brushing action. From the perspective of a single broker, the data of pagecache will be lost.

Kafka doesn’t provide synchronous disk brushing. Synchronous disk brushing is implemented in rocketmq. The implementation principle is to block the process of asynchronous disk brushing and wait for the response, which is similar to the callback of Ajax or the future of Java. The following is the source code of rocketmq.

  1. GroupCommitRequest request = new GroupCommitRequest(result.getWroteOffset() + result.getWroteBytes());
  2. service.putRequest(request);
  3. boolean flushOK = request.waitForFlush ( this.defaultMessageStore.getMessageStoreConfig (). Getsyncflushtimeout()); / / swipe disk

That is to say, in theory, it is impossible for Kafka to ensure that a single broker does not lose messages. We can only alleviate this situation by adjusting the parameters of the disk brushing mechanism. For example, reduce the brush disk interval, reduce the brush disk data size. The shorter the time, the worse the performance and the better the reliability (as reliable as possible). This is a multiple choice question.

In order to solve this problem, Kafka cooperates with producer and broker to deal with the loss of parameters of a single broker. Once producer finds that the broker message is missing, it can automatically retry. Messages will not be lost unless the number of retrys exceeds the threshold (configurable). At this time, the producer client needs to handle the situation manually. So how does producer detect data loss? Through the ACK mechanism, it is similar to the three handshakes of HTTP.

The number of acknowledgments the producer requires the leader to have received before considering a request complete. This controls the durability of records that are sent. The following settings are allowed: acks=0 If set to zero then the producer will not wait for any acknowledgment from the server at all. The record will be immediately added to the socket buffer and considered sent. No guarantee can be made that the server has received the record in this case, and the retries configuration will not take effect (as the client won’t generally know of any failures). The offset given back for each record will always be set to -1. 
acks=1 This will mean the leader will write the record to its local log but will respond without awaiting full acknowledgement from all followers. In this case should the leader fail immediately after acknowledging the record but before the followers have replicated it then the record will be lost. 
acks=allThis means the leader will wait for the full set of in-sync replicas to acknowledge the record. This guarantees that the record will not be lost as long as at least one in-sync replica remains alive. This is the strongest available guarantee. This is equivalent to the acks=-1 setting.

<cite style=”box-sizing: inherit; font-style: normal; font-size: 13px;”>http://kafka.apache.org/20/do…</cite>

The above reference is Kafka’s official parameteracksIn the old version, the parameter isrequest.required.acks)。

  • Acks = 0, producer does not wait for the broker’s response, which is the most efficient, but the message is likely to be lost.
  • Acks = 1. After receiving the message, the leader broker does not wait for the response of other followers, that is, it returns ack. The number of ACK is 1. At this point, if the leader hangs up before the follower receives the synchronization message from the leader, the message will be lost. According to the example in the figure above, if the leader receives a message and writes it to pagecache successfully, it will return ack. At this time, producer thinks that the message has been sent successfully. But at this time, according to the figure above, the data has not been synchronized to the follower. If the leader loses power at this time, the data will be lost.
  • Acks = – 1. After receiving the message, the leader broker suspends and waits for all followers in the ISR list to return the result before returning ack. -1 equivalent andall. In this configuration, only when the leader writes data to the pagecache, it will not return ACK, and all ISRs will return “success” to trigger ack. If the power is off at this time, producer can know that the message has not been sent successfully and will resend it. If ack is returned successfully after the follower receives the data, and the leader is powered off, the data will exist in the original follower. After the re-election, the new leader will hold the data. There are two steps to synchronize data from leader to follower

    1. Data is flushed from pagecache to disk. Because only the data in the disk can be synchronized to the replica.
    2. The data is synchronized to the replica, and the replica successfully writes the data to the pagecache. After producer gets ACK, even if all machines are powered off, the data will at least exist in the leader’s disk.

The third point mentioned above is the follower of ISR list, which needs another parameter to ensure the effectiveness of ack. ISR is a “reliable follower list” maintained by the broker and an in sync replica list. The configuration of the broker includes one parameter:min.insync.replicas. This parameter represents the minimum number of copies in the ISR. If this value is not set, the follower list in ISR may be empty. This is equivalent to acks = 1.

What will Kafka do if he loses his message?

As shown in the figure above:

  • Acks = 0, total time consumption f (T) = f (1).
  • Acks = 1, total time consumption f (T) = f (1) + F (2).
  • Acks = – 1, total time consumption f (T) = f (1) + max (f (a), f (b)) + F (2).

The performance decreases and the reliability increases in turn.


Producer lost message, occurred on producer client.

In order to improve efficiency and reduce IO, producer can merge multiple requests before sending data. The merged requests are cached in the local buffer. The way of caching is similar to the previous brush disk. Producer can package the request into “blocks” or send out the data in the buffer according to the time interval. Through buffer, we can transform the producer into asynchronous mode, which can improve our sending efficiency.

However, the data in the buffer is dangerous. Under normal circumstances, the asynchronous call of the client can handle the message sending failure or timeout through callback. However, once the producer is stopped illegally, the data in the buffer will be lost and the broker will not be able to receive this part of the data. Or, when the producer client is out of memory, if the strategy is to discard the message (another strategy is block blocking), the message will also be lost. Or, message generation (asynchronous generation) is too fast, resulting in too many suspended threads and insufficient memory, resulting in program crash and message loss.

What will Kafka do if he loses his message?

<figcaption style=”box-sizing: inherit; display: block; margin-top: 0.5em; margin-bottom: 1em; color: rgb(85, 93, 102); text-align: center; font-size: 13px;”>producer</figcaption>

What will Kafka do if he loses his message?

According to the figure above, we can think of several solutions:

  • The asynchronous message is sent synchronously instead. Or when the service generates messages, it uses the blocked thread pool, and the number of threads has a certain upper limit. The overall idea is to control the speed of message generation.
  • Expand the capacity configuration of buffer. This way can alleviate the occurrence of this situation, but can not eliminate it.
  • Instead of sending messages directly to the buffer (memory), service writes messages to the local disk (database or file), and another (or a small number of) production thread sends messages. This is equivalent to adding a buffer layer with more space between buffer and service.


There are several steps for consumer to consume message

  1. receive messages
  2. Processing messages
  3. Feedback “committed”

There are two main consumption modes of consumers

  • Automatic offset committing
  • Manual offset control

The mechanism of automatic submission of consumers is to commit the received messages according to a certain time interval. The commit process and the process of consuming messages are asynchronous. That is to say, the consumption process may not succeed (such as throwing an exception), and the commit message has been submitted. The message is lost.

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "test");
//Auto submit switch
props.put("enable.auto.commit", "true");
//The time interval of automatic submission, here is 1s
props.put("auto.commit.interval.ms", "1000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("foo", "bar"));
while (true) {
        //After calling poll, 1000 ms later, the message status will be changed to committed
    ConsumerRecords<String, String> records = consumer.poll(100);
    for (ConsumerRecord<String, String> record : records)
        Insertintodb (record); // put the message into the database for more than 1000ms

The above example is an example of automatic submission. If at this point,insertIntoDB(record)If an exception occurs, the message will be lost. Next is an example of manual submission:

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "test");
//Turn off automatic submission and change to manual submission
props.put("enable.auto.commit", "false");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("foo", "bar"));
final int minBatchSize = 200;
List<ConsumerRecord<String, String>> buffer = new ArrayList<>();
while (true) {
        //After calling poll, auto commit will not be performed
    ConsumerRecords<String, String> records = consumer.poll(100);
    for (ConsumerRecord<String, String> record : records) {
    if (buffer.size() >= minBatchSize) {
                //After all messages are consumed, the commit operation is performed

Changing the submit type to manual ensures that the message is “consumed at least once.”. However, there may be repeated consumption at this time, which is beyond the scope of this article.

In the above two examples, the consumer’s high level API is directly used, and the client is transparent to offset and other controls. You can also use the low level API to manually control the offset to ensure that messages are not lost, but it will be more complex.

try {
     while(running) {
         ConsumerRecords<String, String> records = consumer.poll(Long.MAX_VALUE);
         for (TopicPartition partition : records.partitions()) {
             List<ConsumerRecord<String, String>> partitionRecords = records.records(partition);
             for (ConsumerRecord<String, String> record : partitionRecords) {
                 System.out.println(record.offset() + ": " + record.value());
             long lastOffset = partitionRecords.get(partitionRecords.size() - 1).offset();
             //Precise control of offset
             consumer.commitSync(Collections.singletonMap(partition, new OffsetAndMetadata(lastOffset + 1)));
 } finally {

Author: Info

What’s more what official account is Amway, the terminal research and development department, is now recommending a quality technology related article every day. It mainly shares Java related technology and interview skills. Our goal is to know what is, why, lay a good foundation and do well in every aspect. This public official account is worth our attention.

What will Kafka do if he loses his message?

Recommended Today

Review of SQL Sever basic command

catalogue preface Installation of virtual machine Commands and operations Basic command syntax Case sensitive SQL keyword and function name Column and Index Names alias Too long to see? Space Database connection Connection of SSMS Connection of command line Database operation establish delete constraint integrity constraint Common constraints NOT NULL UNIQUE PRIMARY KEY FOREIGN KEY DEFAULT […]