Interview | Kafka summary of common interview questions


Nowadays, Kafka is no longer a simple message queuing system. Kafka is a distributed stream processing platform, which is used by more and more companies. Kafka can be used in high-performance data pipeline, flow processing analysis, data integration and other scenarios. This article summarizes several common interview questions of Kafka, hoping to help you. It mainly includes the following contents:

  • How does Kafka protect against data loss?
  • How to solve the problem of Kafka data loss?
  • Can Kafka guarantee permanent data loss?
  • How to ensure that the messages in Kafka are orderly?
  • How to determine the number of Kafka theme partitions?
  • How to adjust the number of Kafka theme partitions in the production environment?
  • How to rebalance Kafka cluster?
  • How to check whether there is lagging consumption in the consumer group?

Q1: how does Kafka protect against data loss?

This question has become a common practice in Kafka interviews, just like Java’sHashMapIt is a high frequency interview question. So, how do we understand this problem? The problem isHow does Kafka protect against data loss, i.eWhat mechanism does Kafka’s broker provide to ensure that data is not lost.

In fact, for Kafka’s broker, Kafka’sReplication mechanismandMultiple copies of partitionsArchitecture is the core of Kafka reliability assurance. Writing messages to multiple copies enables Kafka to maintain message persistence in the event of a crash.

Make clear the core of the question, and then look at how to answer this question: mainly includes three aspects

1. Number of topic copy factors: replication.factor >= 3

2. Synchronous replica list (ISR) min.insync.replicas = 2

3. Ban the clean election unclean.leader.election .enable=false

The following three configurations will be analyzed step by step

  • Replica factor

Kafka’s topic can be partitioned, and multiple copies can be configured for the partition. The configuration can be configured through thereplication.factorParameter implementation. There are two types of partitioned replica in Kafka: leader replica and follower replica. When each partition is created, one replica should be selected as leader replica, and the rest will be automatically changed into follower replica. In Kafka, the follower copy does not provide services to the outside world, that is to say, any follower copy cannot respond to the reading and writing requests of consumers and producers. All requests must be processed by the leader’s copy. In other words, all read and write requests must be sent to the broker where the leader’s copy is located, which is responsible for processing. The follower replica does not process client requests, its only task is to copy from the leaderAsynchronous pullMessage and write it to your own commit log to synchronize with the leader copy.

Generally speaking, setting the replica to 3 can meet most of the usage scenarios, and there may be five replicas (such as banks). If the replica factor is n, data can still be read from or written to the topic even if n-1 brokers fail. Therefore, a higher replica factor leads to higher availability, reliability, and fewer failures. On the other hand, the replica factor n requires at least n brokers, and there will be n copies of data, which means that they will occupy n times the disk space. In the actual production environment, there is a trade-off between availability and storage hardware.

In addition, the distribution of replicas also affects availability. By default, Kafka ensures that each replica of the partition is distributed on different brokers, but if these brokers are in the same rack, the partition will not be available once the switch in the rack fails. Therefore, it is recommended to distribute the brokers in different racksbroker.rackParameter configures the name of the rack where the broker is located.

  • Synchronize replica list

In sync replica (ISR) is called synchronous replica. All the replicas in ISR are synchronized with the leader. Therefore, followers who are not in the list will be considered to be out of sync with the leader. So, what copy exists in ISR? First, it is clear that the leader replica always exists in ISR. Whether the follower replica is in ISR depends on whether the follower replica is “synchronized” with the leader replica.

The broker side of Kafka has one parameterreplica.lag.time.max.msThis parameter indicates the maximum time interval between the follower replica lag and the leader replica. The default value is 10 seconds. This means that as long as the time interval between the follower replica and the leader replica is less than 10 seconds, it can be considered that the follower replica and the leader replica are synchronized. Therefore, even if the current follower replica lags behind the leader replica by several messages, as long as it catches up with the leader replica within 10 seconds, it will not be kicked out.

It can be seen that ISR is dynamic, so even if three replicas are configured for the partition, there will be only one replica in the synchronized replica list (other replicas will be removed from the ISR list because they can’t keep up with the leader in time). If this synchronous copy becomes unavailable, we mustusabilityanduniformityMake a choice between (Cap theory).

According to Kafka’s definition of reliability assurance, a message is considered committed only after it has been written to all synchronous copies. But if the “all replicas” here contains only one synchronous replica, the data will be lost when the replica becomes unavailable. If you want to ensure that committed data is written to more than one replica, you need to set the minimum number of synchronized copies to a larger value. For a topic partition with three copies, ifmin.insync.replicas=2, then there must be at least two synchronized copies to write data to the partition.

If the above configuration is carried out, at least two copies must be ensured in ISR. If the number of copies in ISR is less than 2, the broker will stop accepting the request from the producer. Producers who try to send data will receiveNotEnoughReplicasExceptionException, the consumer can still continue to read the existing data.

  • Ban clean elections

The process of selecting a partition in the synchronized replica list as the leader partition is calledclean leader election。 Note that this is to be distinguished from the process of selecting a partition as the leader partition in an asynchronous replicaunclean leader election。 Since ISR is dynamically adjusted, the ISR list will be empty. Generally speaking, the asynchronous replica lags behind the leader too much. Therefore, if these replicas are selected as new leaders, data loss may occur. After all, the messages held in these replicas are far behind those in the old leader. In Kafka, the process of electing such a replica can be passed through the broker side parametersunclean.leader.election.enable Control whether unclean leader election is allowed. Open unclean leader election may cause data loss, but the advantage is that it makes the partition leader replica exist all the time, so it does not stop providing external services, so it improves high availability. On the contrary, the advantage of banning clean leader election is to maintain data consistency and avoid message loss, but at the expense of high availability. This is what cap theory of distributed system says.

Unfortunately,unclean leader electionThe election process may still cause data inconsistency because the synchronous replica is notcompletelySynchronous. Because replication isasynchronousCompleted, so there is no guarantee that the follower can get the latest news. For example, the offset of the last message in the leader partition is 100, and the offset of the replica may not be 100. This is affected by two parameters:

  • the time that the synchronous replica lags behind the leader replica
  • timeout for session with zookeeper

In short, if we allow unsynchronized replicas to be leaders, we have to take the risk of data loss and data inconsistency. If they are not allowed to be leaders, then accept lower availability because we have to wait for the original leader to return to usable state.

For the clean election, different scenarios have different configuration methods. YesData quality and data consistencySystems with higher requirements will disable this kind of clean leader election (such as banks). IfusabilityIn systems with high requirements, such as the real-time click stream analysis system, the clean leader election will not be disabled.

Q2: how to solve the problem of Kafka data loss?

You may ask: what’s the difference between this question and Q1? In fact, it can be understood as a question in an interview. The reason to distinguish here is that the two solutions are different.Q1 problem is from the perspective of Kafka’s broker side, while Q2 is from the perspective of Kafka’s producers and consumers

First, let’s take a look at how to answer this question

  • Producer
  • retries=Long.MAX_VALUE

    Set retries to a larger value. Here, retries is also a parameter of producer, which corresponds to the automatic retrying of producer mentioned above. In case of transient network jitter, message sending may fail. At this time, producer with retries > 0 can automatically retry message sending to avoid message loss.

  • acks=all

    Set acks = all. Acks is a parameter of producer that represents your definition of “submitted” messages. If it is set to all, it means that all replica brokers must receive the message before the message is considered as “submitted”. This is the highest level definition of “submitted.”.


    This parameter specifies how many messages the producer can send before receiving a response from the server. The higher the value, the more memory it takes up, but it also increases throughput. Setting it to 1 ensures that messages are written to the server in the order they were sent, even if a retry occurs.

  • Producer uses the API with callback notification, that is, do not use it producer.send (MSG) producer.send (msg, callback)。
  • Other error handling

    With the built-in retry mechanism of the producer, most errors can be easily handled without causing message loss, but
    Other types of errors, such as message size errors, serialization errors, and so on, still need to be handled.

  • Consumer
  • Disable auto submit:
  • The consumer submits the offset after processing the message
  • to configure auto.offset.reset

    This parameter specifies what the consumer will do when there is no offset to submit (such as the consumer’s first startup) or when the requested offset does not exist on the broker (such as data is deleted).

    There are two configurations for this parameter. One isearliest: consumers will read data from the beginning of the partition, regardless of whether the offset is valid or not. This will cause consumers to read a large number of duplicate data, but can guarantee the minimum data loss. One isLatest (default)If this configuration is selected, the consumer will start to read the data from the end of the partition, which can reduce repeated processing of messages, but is likely to miss some messages.

Q3: can Kafka guarantee permanent data loss?

The above analysis of some measures to protect data from loss, to a certain extent, can avoid the loss of data. But please note:Kafka only guarantees the persistence of “committed messages” to a limited extent。 Therefore, Kafka can not completely guarantee that the data will not be lost, and some trade-offs need to be made.

First, understand what isSubmitted messagesWhen several brokers of Kafka successfully receive a message and write it to the log file, they will tell the producer that the message has been successfully submitted. In Kafka’s view, the message officially becameSubmittedThe news. Therefore, whether ack = all or ack = 1, Kafka only guarantees the persistence of submitted messages, which is unchanged.

Second, understandLimited sustainability guaranteeIn other words, Kafka cannot guarantee that no message will be lost under any circumstances. It is necessary to ensure that Kafka’s brokers are available. In other words, if the message is stored on N Kafka brokers, then this precondition is that at least one of the N brokers is alive. As long as this condition holds, Kafka can guarantee that your message will never be lost.

To sum up, Kafka can avoid losing messages,It’s just that these messages must be submittedAnd we have to meet certain conditions.

Q4: how to ensure that the messages in Kafka are orderly?

First of all, it needs to be clear that:Kafka’s themes are well-organizedIf a topic has multiple partitions, Kafka will send it to the corresponding partition according to the key. Therefore, for a given key, its corresponding record is orderly in the partition.

Kafka can ensure that the messages in the same partition are orderly, that is, the producer sends messages in a certain order, and the broker will write them to the corresponding partition in this order. Similarly, consumers will consume them in this order.

In some scenarios, the order of messages is very important. For example,Save before you withdrawAndWithdraw money before savingThey are two very different results.

One parameter is mentioned in the above function of this parameter is to ensure the order of data writing when the number of retries is greater than or equal to 1. If the parameter is not 1, when the first batch fails to write, the second batch is successfully written, and the broker will try to write the first batch again. If the first batch is successful, the order of the two batch messages will be reversed.

Generally speaking, if there is a requirement for the order of messages, in order to ensure that data is not lost, it is necessary to set the number of sending retries > 0, and at the same time, you need to set the number of retries > the parameter is set to 1, no other messages will be sent to the broker when the producer tries to send the first batch of messages. Although the throughput will be affected, the order of the messages can be guaranteed.

In addition, single partitioned topics can also be used, but throughput will be seriously affected.

Q5: how to determine the appropriate number of Kafka theme partitions?

Selecting the appropriate number of partitions can achieve the purpose of high parallel read-write and load balancing. Load balancing on partitions is the key to achieve throughput. It needs to be estimated according to the expected throughput of producers and consumers in each partition.

Let’s take a chestnut: let’s assume that the expected rate (throughput) of reading data is1GB/SecAnd a consumer’s read rate is50MB/SecAt this point, you need at least 20 partitions and 20 consumers (a consumer group). Similarly, if the expected rate of production data is1GB/SecAnd the production rate of each producer is100MB/SecAt this point, you need to have 10 partitions. In this case, if you set up 20 partitions, you can guarantee1GB/SecIt can also guarantee the throughput of consumers. Usually, the number of partitions needs to be adjusted to the number of consumers or producers. Only in this way can the throughput of both producers and consumers be realized.

A simple calculation formula is as follows:Partition number = max (number of producers, number of consumers)

  • Number of producers = total production throughput / maximum production throughput per producer for a single partition
  • Number of consumers = overall consumption throughput / maximum throughput consumed by each consumer from a single partition

Q6: how to adjust the number of Kafka theme partitions in the production environment?

It should be noted that when we increase the number of partitions for a topic, it will violate theThe same key has the same partitionThe truth of. We can create a new topic so that the topic has more partitions, and then pause the producer, copy the data from the old topic to the new topic, and then switch the consumer and producer to the new topic, which is very difficult to operate.

Q7: how to rebalance Kafka cluster?

You need to rebalance the cluster when:

  • The unbalanced distribution of topic partition in the whole cluster results in the imbalance of cluster load.
  • The broker is offline, causing the partition to be out of sync.
  • The newly added broker needs to get the load from the cluster.

usekafka-reassign-partitions.shCommand rebalance

Q8: how to check whether there is lagging consumption in the consumer group?

We can use itkafka-consumer-groups.shCommand to view, such as:

$ bin/ --bootstrap-server cdh02:9092 --describe --group my-group
##Some of the indicator information below will be displayed
Subject partition current offset Leo lag message count consumer ID host client ID

In general, if it works well,CURRENT-OFFSETThe value of theLOG-END-OFFSET The value of is very close. With this command, you can see which partition consumption lags behind.


This article mainly shared 8 common Kafka interview questions, and gave the corresponding answers to each question. Compared with these problems, I believe that Kafka will have a deeper understanding.

GZH: wechat search: big data technology and data warehouse