Kafka principle and Practice


This article starts with vivo Internet technology WeChat public number https://mp.weixin.qq.com/s/bV8AhqAjQp4a_iXRfobkCQ.
About the author: Zheng Zhibin, graduated from computer science and Technology (bilingual class) of South China University of technology. Has been engaged in e-commerce, open platform, mobile browser, recommended advertising and big data, artificial intelligence and other related development and architecture. At present, I am engaged in AI middle platform construction and advertising recommendation business in vivo intelligent platform center. Good at business architecture, platform and business solutions of various business forms.
Blog address: http://arganzheng.life.


Recently, we need to migrate the original centralized log monitoring system. The original implementation scheme is: log agent = > log server = > elasticsearch = > kibana. Among them, thrift RPC is used between log agent and log server, and a simple load balancing (WRB) is implemented by ourselves.

In fact, the original scheme runs well, and the asynchronous agent has little impact on the application performance. There is no pressure to support our tens of millions of PV applications every day. However, there is a disadvantage that if the error log is increased dramatically and the log server cannot handle it, the message will be lost. Of course, we haven’t reached this level, and we can also introduce queue buffer to handle it. But now, it’s easier to use message queuing directly. PRC, load balancing and load buffering are all built-in. Another way is to read the log directly, similar to the way of logstash or flume. However, considering the flexibility, we decided to use message queuing. Anyway, we have deployed zookeeper. After investigation, Kafka is the most suitable for data transfer and buffering. Therefore, it is planned to change the scheme to log agent = > Kafka = > elasticsearch = > kibana.

Kafka introduction

I. Basic concepts of Kafka

  • Broker: a Kafka cluster contains one or more servers, which are called brokers.
  • Topic: each message published to Kafka cluster has a category, which is called topic.
  • Message

    • Message is the basic unit of Kafka communication, which consists of a fixed length message header and a variable length message body (payload). It is also called record in Java client.
    • Each part of the message structure is described as follows:

      • CRC32: CRC32 checksum, 4 bytes.
      • Magic: Kafka service program protocol version number, used for compatibility. 1 byte.
      • Attributes: this field takes up 1 byte, in which the lower two bits are used to indicate compression mode, the third bit is used to indicate timestamp type (0 for logcreatetime, 1 for logappendtime), and the upper four bits are reserved positions, which is not meaningful at present.
      • Timestamp: Message timestamp, whenmagic > 0The message header must contain this field when. 8 bytes.
      • Key length: message key length, 4 bytes.
      • Key: the actual data of the message key.
      • Payload length: the actual data length of the message, 4 bytes.
      • Payload: Message actual data
    • The actual storage of a message also includes 12 bytes of extra overhead (logoverhead):

      • Message offset: 8 bytes, similar to the message ID.
      • Total message length: 4 bytes
  • Partition:

    • Partition is a physical concept. Each topic contains one or more partitions.
    • Each partition is composed of a series of ordered immutable messages, which is an ordered queue.
    • Each partition physically corresponds to a folder, and the naming rule of the partition is${topicName}-{partitionId}Such as__consumer_offsets-0
    • The partition directory stores the log segments of the partition, including log data files and two index files.
    • Each message is appended to the corresponding partition, which is a sequential write disk, so the efficiency is very high, which is also an important guarantee of Kafka’s high throughput.
    • Kafka can only guarantee the ordering of messages in a partition, but not the ordering of messages across partitions.
  • LogSegment:

    • The log file is divided into one or more log segments according to the size or time. The log segment size is determined by the configuration itemlog.segment.bytesSpecified, the default is 1GB. The length of time is based onlog.roll.msperhapslog.roll.hoursConfiguration item settings; the currently active log segment is called the active segment(activeSegment)。
    • Unlike ordinary log files, Kafka’s log segment has two auxiliary index files besides a specific log file:

      • data file

        • The data file is.logMessage set file (filemessageset) with file suffix to save the actual data of the message
        • The naming rule is: offset from the first message of the data file, also known as the base offset(BaseOffset)The left complement 0 is composed of 20 digit characters
        • The base offset of each data file is that of the previous data fileLEO+1(first data file is 0)
      • Offset index file

        • The file name is the same as the data file, but the.indexIs the suffix. Its purpose is to quickly locate the location of the message according to the offset.
        • First, Kafka takes each log segment toBaseOffsetSave to a keyConcurrentSkipListMapIn the jump table, when looking up the message with specified offset, the binary search method can quickly locate the data file and index file where the message is located
        • Then through binary search in the index file, the search value is less than or equal to the maximum offset of the specified offset. Finally, the data file is scanned sequentially from the maximum offset found until the message that the offset is equal to the specified offset is found in the data file
        • It should be noted that not every message has an index, but a sparse storage method is adopted. An index is established every certain byte of data. We can useindex.interval.bytesSet index span.
      • Timestamp index file

        • Kafka introduced a time stamp based index file from version The file name is the same as that of the data file, but.timeindexAs a suffix. Its purpose is to quickly locate the location of the message according to the time stamp.
        • The Kafka API provides aoffsetsForTimes(Map<TopicPartition, Long> timestampsToSearch)Method, which returns the offset and timestamp corresponding to the first message with a timestamp greater than or equal to the time to be queried. This function is actually very useful. If we want to start consuming from a certain period of time, we can use itoffsetsForTimes()Method to locate the offset of the first message closest to this time, then call it.seek(TopicPartition, long offset)Method moves the consumer offset in the past and calls it.poll()Method long polling pull message.
  • Producer:

    • Responsible for publishing information to Kafka broker.
    • Some important configuration items of the producer:

      • request.required.acksKafka provides three message acknowledgment mechanisms (acks) for producers, which are used to configure brokers to send acknowledgments to producers after receiving messages, so that producers can process them according to acks. This mechanism uses attributes torequest.required.acksSet, value can be 0, – 1, 1, default is 1.

        • Acks = 0: the producer does not need to wait for the broker to return the confirmation message, but sends the message continuously.
        • Acks = 1: the producer needs to wait for the leader copy to successfully write the message to the log file. This method reduces the possibility of data loss to a certain extent, but still can not guarantee that the data will not be lost. Because you did not wait for the follower replica synchronization to complete.
        • Acks = – 1: a confirmation message is sent to the producer only when the leader replica and all replicas in the ISR list have completed the data store. In order to ensure that the data is not lost, it is necessary to ensure that the synchronous copy is at least greater than 1, through the parametermin.insync.replicasSet, when the number of synchronous copies is less than configuration items, the producer will throw an exception. But this way also affects the speed and throughput of the producer sending messages.
      • message.send.max.retries: the number of times the producer retries before abandoning the message, which is 3 times by default.
      • retry.backoff.ms: the time to wait before each retry. The unit is Ms. the default is 100.
      • queue.buffering.max.ms: in asynchronous mode, the maximum time for messages to be cached. When the time reaches, messages are sent in batches. If the maximum value of cached data is configured in asynchronous mode at the same timebatch.num.messages, then reaching either of these thresholds triggers the bulk sending of messages. The default is 1000ms.
      • queue.buffering.max.messages: in asynchronous mode, the maximum number of unsent messages that can be cached in the queue. The default is 10000.
      • queue.enqueue.timeout.ms

        • =0: indicates to enter the queue directly when the queue is not full, and discard immediately when it is full
        • <0: indicates unconditional blocking and not discarding
        • >0: indicates the length of time when the block reaches the valueQueueFullExceptionabnormal
      • batch.num.messages: Kafka supports batch message sending to specific partition of broker, and batch size is determined by attributebatch.num.messagesSet to indicate the maximum number of messages sent in batches each time. When the producer sends messages in synchronous mode, the configuration item will be invalid. The default is 200.
      • request.timeout.ms: the timeout for the producer to wait for a broker response when acks is required. The default is 1500ms.
      • send.buffer.bytes: socket send buffer size. The default is 100kb.
      • topic.metadata.refresh.interval.ms: the time interval at which the producer periodically requests to update the subject metadata. If it is set to 0, the update data will be requested after each message is sent. The default is 5min.
      • client.id: producer ID, which is mainly used by businesses to track call location problems. The default isconsole-producer
  • Consumer & Consumer Group & Group Coordinator:

    • Consumer: message consumer, the client that reads messages to Kafka broker. Kafka 0.9 releases a new consumer based on Java rewriting, which no longer relies on the scala runtime environment and zookeeper.
    • Consumer group: each consumer belongs to a specific consumer group. You can use thegroup.idThe configuration item is specified. If group name is not specified, the default value istest-consumer-group
    • Group coordinator: for each consumer group, a broker will be selected as the coordinator of the consumer group.
    • Each consumer also has a globally unique ID that can be configured through theclient.idSpecify, if not specified, Kafka will automatically generate a format for the consumer${groupid} - ${hostname} - ${timestamp} - ${first 8 characters of UUID}The globally unique ID of.
    • Kafka provides two ways to submit consumer ﹣ offset: Kafka submits automatically or the client calls kafkaconsumer’s corresponding API to submit manually.

      • Auto commit: it is not to commit periodically, but to detect whether the time interval between the last commit and the last commit exceeds when some specific events occurauto.commit.interval.ms

        • enable.auto.commit=true
        • auto.commit.interval.ms
      • Manual submission

        • enable.auto.commit=false
        • commitSync(): synchronous commit
        • commitAsync(): asynchronous commit
    • Some important configuration items of consumers:

      • group.id: A unique string that identifies the consumer group this consumer belongs to.
      • client.id: The client id is a user-specified string sent in each request to help trace calls. It should logically identify the application making the request.
      • bootstrap.servers: A list of host/port pairs to use for establishing the initial connection to the Kafka cluster.
      • key.deserializer: Deserializer class for key that implements the org.apache.kafka.common.serialization.Deserializer interface.
      • value.deserializer: Deserializer class for value that implements the org.apache.kafka.common.serialization.Deserializer interface.
      • fetch.min.bytes: The minimum amount of data the server should return for a fetch request. If insufficient data is available the request will wait for that much data to accumulate before answering the request.
      • fetch.max.bytes: The maximum amount of data the server should return for a fetch request.
      • max.partition.fetch.bytes: The maximum amount of data per-partition the server will return.
      • max.poll.records: The maximum number of records returned in a single call to poll().
      • heartbeat.interval.ms: The expected time between heartbeats to the consumer coordinator when using Kafka’s group management facilities.
      • session.timeout.ms: The timeout used to detect consumer failures when using Kafka’s group management facility.
      • enable.auto.commit: If true the consumer’s offset will be periodically committed in the background.
  • ISR: Kafka dynamically maintains an ISR (in sync replica) in ZK, that is, a list of replicas to keep synchronization. In this list, the brokerid corresponding to all replicas to keep message synchronization with the leader replica is saved. If a replica goes down or falls too far behind, the follower replica is removed from the ISR list.
  • Zookeeper:

    • Kafka uses ZK to save corresponding metadata information, including: broker information, Kafka cluster information, old version consumer information and consumption offset information, subject information, partition status information, partition scheme information, dynamic configuration information, etc.
    • Description of Kafka registering node in ZK:

      • /Consumers: a consumer node will be created under this node of ZK after the old version of consumer is started
      • /Brokers / SEQID: the broker ID generated by auxiliary generation, when the user does not configurebroker.idZK will automatically generate a globally unique ID.
      • /Brokers / topics: each time a topic is created, a node with the same name as the topic will be created in the directory.
      • /Borkers / IDS: when Kafka starts a kafkaserver, it will create a directory named{broker.id}Child nodes
      • /Config / topics: store the configuration information of dynamically modifying topic level
      • /Config / clients: stores configuration information for dynamically modifying the client level
      • /Config / changes: store the corresponding information when dynamically modifying the configuration
      • /Admin / delete menu topics: save the information of the topic to be deleted when deleting the topic
      • /Cluster / ID: save cluster ID information
      • /Controller: save the broker ID information corresponding to the controller
      • /ISR? Change? Notification: save the corresponding path of notification when the ISR list of Kafka copy changes
    • During startup or operation, Kafka will create corresponding nodes on ZK to save metadata information, and register corresponding listeners on these nodes through monitoring mechanism to monitor changes in node metadata.


If it corresponds to es, broker is equivalent to node, topic is equivalent to index, message is relative to document, and partition is equivalent to shard. Logsegment is relative to the segment of ES.

How to view message contents (dump log segments)

In the process of using Kafka, we sometimes need to view all kinds of information of messages we produce, which are stored in the log file of Kafka. Because of the special format of the log file, we can’t directly view the information content in the log file. Kafka provides a command to dump binary segmented log files into character type files:

$ bin/kafka-run-class.sh kafka.tools.DumpLogSegments
Parse a log file and dump its contents to the console, useful for debugging a seemingly corrupt log segment.
Option                                  Description                           
------                                  -----------                           
--Deep iteration uses deep iteration instead of shallow iteration                          
--Files < file1, File2,... > required. Log segment file entered, comma separated
--Key decoder class custom key value deserializer. 'Kafka. Serializer. Decoder' trait must be implemented. The jar package needs to be placed in the 'Kafka / LIBS' directory. (the default is' Kafka. Serializer. Stringdecoder ').
--Max message size < integer: size > the maximum number of bytes in the message (the default is 5242880)                           
--Print data log prints out log messages at the same time             
--Value decoder class defines the value value value deserializer. 'Kafka. Serializer. Decoder' trait must be implemented. The jar package needs to be placed in the 'Kafka / LIBS' directory. (the default is' Kafka. Serializer. Stringdecoder ').
--Verify index only
$ bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files /tmp/kafka-logs/test-0/00000000000000000000.log --print-data-log 
Dumping /tmp/kafka-logs/test-0/00000000000000000000.log
Starting offset: 0
offset: 0 position: 0 CreateTime: 1498104812192 isvalid: true payloadsize: 11 magic: 1 compresscodec: NONE crc: 3271928089 payload: hello world
offset: 1 position: 45 CreateTime: 1498104813269 isvalid: true payloadsize: 14 magic: 1 compresscodec: NONE crc: 242183772 payload: hello everyone

Note: here--print-data-logYou can only see the header and no payload without this item.

It can also be used to view the index file:

$ bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files /tmp/kafka-logs/test-0/00000000000000000000.index  --print-data-log 
Dumping /tmp/kafka-logs/test-0/00000000000000000000.index
offset: 0 position: 0

The timeindex file is also OK:

$ bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files /tmp/kafka-logs/test-0/00000000000000000000.timeindex  --print-data-log 
Dumping /tmp/kafka-logs/test-0/00000000000000000000.timeindex
timestamp: 1498104813269 offset: 1
Found timestamp mismatch in :/tmp/kafka-logs/test-0/00000000000000000000.timeindex
  Index timestamp: 0, log timestamp: 1498104812192
Found out of order timestamp in :/tmp/kafka-logs/test-0/00000000000000000000.timeindex
  Index timestamp: 0, Previously indexed timestamp: 1498104813269

Consumer balance process

Consumer rebalancing refers to the process that consumers rejoin the consumption group and reallocate to consumers. Consumer balancing operations are caused when:

  • New consumers join the consumer group
  • Current consumer exits from consumption group (whether abnormal exit or normal shutdown)
  • Consumer unsubscribes from a topic
  • The number of partitions of subscription topics increases (the number of Kafka partitions can be increased dynamically but not decreased)
  • Broker downtime new coordinator elected
  • When the consumer has not sent a heartbeat request within ${session. Timeout. MS}, the group coordinator thinks the consumer has exited.

The automatic balancing operation of consumers provides consumers with high availability and high scalability, so when we increase or reduce the number of consumers or partitions, we do not need to care about the distribution relationship between the underlying consumers and partitions. However, it should be noted that in the rebalancing process, consumers will not be able to pull messages for a short period of time due to the need to re partition them.


Pay special attention to the last situation, so-called slow consumers. If the heartbeat request is not received within session.timeout.ms, the coordinator can remove the slow consumer from the group. Generally, if message processing is slower than session.timeout.ms, it becomes a slow consumer. As a result, the call interval of poll() method is longer than session.timeout.ms. Since the heartbeat is only sent when poll() is called (in version, the client heartbeat is sent asynchronously in the background), this will cause the coordinator to mark the slow consumer to die.

If a heartbeat request is not received within session.timeout.ms, the coordinator marks the consumer as dead and disconnects from it. At the same time, the rebalance operation is triggered by sending the illegalgeneration error code to the heartbeat response of other consumers in the group.

In the mode of manual commit offset, pay special attention to this problem, otherwise the commit will fail. It leads to repeated consumption.

II. Characteristics of Kafka

  1. Message order: ensure the internal order of each partition, but not the global order across partitions. If global message ordering is required, topic can only have one partition.
  2. Consumer group: consumers in the consumer group get messages concurrently, but in order to ensure the order of partition messages, only one consumer consumes each partition. Therefore, the number of consumers in the consumer group needs to be less than or equal to the number of partition of topic. (if the global message needs to be ordered, there can only be one partition and one consumer)
  3. A message of the same topic can only be consumed by one consumer within the same consumer group, but multiple consumer groups can consume the message at the same time. This is Kafka’s way to broadcast (to all consumers) and unicast (to a certain consumer) a topic message. A topic can correspond to multiple consumer groups. If you need to implement broadcasting, you only need to have an independent group for each consumer. To implement unicast, all consumers need to be in the same group.
  4. Producer push message, client pull message mode: some logging centric systems, such as scribe of Facebook and flume of cloudera, adopt push mode. In fact, push mode and pull mode have their own advantages and disadvantages. Push mode is difficult to adapt to consumers with different consumption rates, because the message sending rate is determined by the broker. The goal of push mode is to deliver messages as fast as possible, but it is easy to cause the consumer to have no time to process messages. The typical performance is denial of service and network congestion. The pull mode can consume messages at an appropriate rate according to the consumption capacity of the consumer. Pull mode can simplify the design of broker, and the consumer can control the rate of consumption messages independently. At the same time, the consumer can control the consumption mode by itself, that is, batch consumption can also be consumed one by one, and different submission modes can be selected to achieve different transmission semantics.

In fact, one of Kafka’s design concepts is to provide both offline and real-time processing. According to this feature, you can use the real-time stream processing system such as storm or spark streaming to process messages online in real time, and use the batch processing system such as Hadoop to process messages offline. You can also back up the data to another data center in real time. You only need to ensure that the consumers used in these three operations belong to different consumer groups.

III. ha of Kafka

In versions before 0.8, Kafka does not provide the high availability mechanism. Once one or more brokers are down, all the partitions on Kafka cannot continue to provide services during the down time. If the broker can never be recovered, or if the disk fails, the data on it will be lost. One of Kafka’s design goals is to provide data persistence. At the same time, for distributed systems, especially when the cluster scale rises to a certain extent, the possibility of one or more machines’ downtime is greatly improved, and the requirement for failover is very high. Therefore, Kafka started to provide high availability mechanism from 0.8. Mainly in data replication and leader election.

Data Replication

Kafka starts from 0.8 to provide replication at the partition level. The number of replication can be

Configuration in $Kafka? Home / config / server.properties:

default.replication.factor = 1

This replication provides an automatic failover mechanism in cooperation with leader election. Replication has a certain impact on Kafka’s throughput, but greatly enhances availability. By default, Kafka has 1 number of replications. Each partition has a unique leader. All read and write operations are completed on the leader. The follower pulls data from the leader in batches. Generally, the number of partitions is greater than or equal to the number of brokers, and the leaders of all partitions are evenly distributed on the brokers. The logs on the follower are exactly the same as those on its leader.

It should be noted that the replication factor does not affect the throughput test of the consumer, because the consumer only reads data from the leader of each partition, regardless of the replication factor. Similarly, the consumer throughput is independent of whether synchronous or asynchronous replication occurs.

Leader Election

After replication is introduced, there may be multiple replicas of the same partition. At this time, you need to select a leader between these replicas. Producer and consumer only interact with this leader replica, and other replicas replicate data from the leader as followers. Note that only the leader is responsible for data reading and writing. Follower only feeds data (N channels) to the leader in sequence, and does not provide any reading and writing services. The system is simpler and more efficient.

Think about why the follower copy does not provide reading and writing, but only cold standby?

It is easy to understand that folwer replicas do not provide write services, because if folwer also provides write services, then all replicas need to be synchronized with each other. N replicas need NxN channels to synchronize data. If asynchronous synchronization is used, data consistency and order are difficult to guarantee. If synchronous data synchronization is used, the write delay is actually magnified by N times, which is the opposite.

So why not let the follower replica provide the read service and reduce the read pressure of the leader replica? In addition to data inconsistency caused by synchronization delay, unlike other storage services (e.g. es, MySQL), Kafka’s reading is essentially an ordered message consumption, and the consumption progress depends on an offset called offset, which is to be saved. If multiple copies are read load balanced, then the offset is uncertain.


The leader copy of Kafka is similar to the primary shard of ES, and the follower copy is relative to the replica of ES. ES is also an index with multiple shards (compared with Kafka, a topic with multiple partitions). Shards are divided into primary shards and replication shards, where primary shards are used to provide read-write services (sharding is very similar to MySQL: shard = hash (routing)% number_of_primary_shards). However, ES introduces the role of coordinating node, which is transparent to the client.) , and replication shard only provides read service (like Kafka, es will wait for the relationship shard to return successfully before finally returning to the client).

Students who have the experience of traditional MySQL sub database and sub table must think that this process is very similar, that is, a sharding + replication data architecture, which is transparent to you only through the client (SDK) or coordinator.

Propagate message

When a producer publishes a message to a partition, it first finds the leader of the partition through zookeeper, and then regardless of the replication factor of the topic (that is, how many replicas the partition has), the producer only sends the message to the leader of the partition. The leader writes the message to its local log. Each follower pulls data from the leader. In this way, follower stores data in the same order as leader. Follower sends an ACK to the leader after receiving the message and writing its log. Once the leader receives all replica’s acks in ISR (in sync replicas), the message is considered as committed. The leader will increase HW (high watermark) and send ack to the producer.

To improve performance, each follower immediately sends an ACK to the leader after receiving the data, rather than waiting for the data to be written into the log. Therefore, for a message that has already been committed, Kafka can only guarantee that it is stored in the memory of multiple replicas, but cannot guarantee that they are persisted to the disk, and cannot fully guarantee that the message will be consumed by the consumer after the exception occurs. However, considering that this scenario is very rare, it can be considered that this approach makes a good balance between performance and data persistence. In future releases, Kafka will consider providing higher persistence.

The message read by the consumer is also read from the leader. Only messages that have been committed (messages with offset lower than HW) will be exposed to the consumer.

The data flow of Kafka replication is as follows:

Kafka principle and PracticeKafka principle and Practice

There are many and complicated contents in this field, so we will not start here. This article is well written and interested students can learn

Kafka design analysis (II): Kafka high availability (I).

Several cursors of Kafka (offset / offset)

The figure below is very simple and shows all the cursors of Kafka


Kafka principle and Practice

Kafka principle and Practice

Here is a brief description:


In sync replicas list, as the name implies, is the “save synchronization” replicas with the leader. In version 0.9, the broker parameter replica.lag.time.max.ms is used to specify the definition of ISR. If the leader does not receive the pull request from the follower for such a long time, or if the follower does not fetch the log end offset of the leader for such a long time, the leader will remove it from the ISR. ISR is a very important indicator. When the controller selects the leader replica of the partition, it will be used. The leader needs to maintain the ISR list, so the leader will record the results to zookeeper after selecting ISR.

In the scenario where the leader needs to be elected, the leader and ISR are determined by the controller. After selecting the leader, ISR is the leader decision. If the leader and ISR only exist on ZK, every broker needs to monitor the leader and ISR changes of each partition of its host on zookeeper, which is inefficient. If you don’t put it on zookeeper, then after controller fails, you need to get these information from all brokers again. Considering the possible problems in this process, it’s not reliable. So the information of leader and ISR exists in zookeeper, but when the leader is changed, the controller will first make changes in zookeeper, and then send the leader and isrrequest to the relevant broker. In this way, all partitions with changes on the broker can be included in a leader and isrrequest, that is, batch changes a batch of new information to the broker, which is more efficient. In addition, when leader changes ISR, it will first make changes on zookeeper, and then modify ISR in local memory.

1、Last Commited Offset

The last submitted location of the consumer, which will be saved in a special topic: “consumer” offsets “.

2、Current Position

The current read location of the consumer, but it has not yet been submitted to the broker. After submission, it becomes last commit offset.

3、High Watermark(HW)

This offset is the minimum Leo across all the ISR of this partition. The consumer cannot read messages that exceed HW, because this means that messages that are not fully synchronized (and therefore not fully backed up) are read. In other words, HW is the message that all nodes in ISR have copied. It is also the maximum offset of the message that consumers can get (note that not all replicas must have these messages, but only those in ISR must have them).

With the real-time change of the follower’s pull progress, HW is changing at any time. The follower always requests the leader for the data of the next offset start of the messages. Therefore, when the follower sends a fetch request that requires the offset to be more than a, the leader knows that the log end offset of the follower is at least a. At this point, it can be counted whether the Leo of all replicas in ISR has been greater than HW, and if so, the HW will be increased. At the same time, when the leader fetches the local message to the follower, he will also attach his own HW to the reponse returned to the follower. In this way, the follower also knows the HW at the leader (but in the implementation, what the follower obtains is only the HW when reading the leader’s local log, which is not guaranteed to be the latest HW). However, the HW of the leader and the follower are not synchronized, and the HW recorded by the follower may lag behind the leader.

Hight Watermark Checkpoint

As HW is always changing, if it is updated to zookeeper, it will bring efficiency problems. HW is so important that it needs to be persisted. Replicamanager starts a separate thread to record the HW values of all partitions to the file on a regular basis, that is, to do the highwatermark checkpoint.

4、Log End Offset(LEO)

It is well understood that the current latest log write (or synchronization) location.

IV. Kafka client

Kafka supports JVM languages (Java, Scala), and also provides high-performance C / C + + clients and various language clients based on librdkafka encapsulation. For example, python client: configure Kafka python. The python client also has a pure Python implementation: Kafka python.

The following is an example of python (for example, configure Kafka Python):


from confluent_kafka import Producer
p = Producer({'bootstrap.servers': 'mybroker,mybroker2'})
for data in some_data_source:
    p.produce('mytopic', data.encode('utf-8'))


from confluent_kafka import Consumer, KafkaError
c = Consumer({'bootstrap.servers': 'mybroker', 'group.id': 'mygroup',
              'default.topic.config': {'auto.offset.reset': 'smallest'}})
running = True
while running:
    msg = c.poll()
    if not msg.error():
        print('Received message: %s' % msg.value().decode('utf-8'))
    elif msg.error().code() != KafkaError._PARTITION_EOF:
        running = False

It is basically the same as ordinary message queuing.

V. offset management of Kafka

In fact, Kafka reads the message based on the offset. If the offset is wrong, it may repeatedly read the message or skip the unread message. Before 0.8.2, Kafka saved the offset in zookeeper, but we know that ZK’s write operation is very expensive, and it can’t expand linearly. Frequent write to ZK will cause performance bottleneck. Therefore, offset management is introduced in 0.8.2, and the offset is saved in a compact Kafka topic (“consumer” offsets). The consumer submits the offset to the specified broker (offset manager) by sending the offsetcommitrequest request. This request contains a series of partitions and their consumption locations (offsets). The offset manager appends the message in the form of key value to a specified topic (“consumer” offsets). Key is composed of consumergroup topic partition, and value is the offset. At the same time, in order to provide performance, a recent record will be maintained in memory, so that offsetfetchrequests can be given quickly without scanning all offset topic logs when the key is specified. If the offset manager fails for some reason, the new broker will become the offset manager and rebuild the offset cache by scanning the offset topic.

How to view consumption offset

Kafka before version 0.9 provides the Kafka consumer offset checker.sh script, which can be used to view the consumer consumption offset of a certain consumer group to one or more topics. The script calls

kafka.tools.Consumer.OffsetChecker。 This script is no longer recommended after version 0.9, but kafka-consumer-groups.sh script, which calls kafka.admin.consumergroupcommand. This script is actually to manage the consumption group, not only to view the offset of the consumption group. Only the latest kafka-consumer-groups.sh script usage is described here.

With the consumergroupcommand tool, we can use list, describe, or delete consumer groups.

For example, to list all consumption group information in all topics, use the list parameter:

$ bin/kafka-consumer-groups.sh --bootstrap-server broker1:9092 --list

To view the current consumption offset of a consumption group, use the description parameter:

$ bin/kafka-consumer-groups.sh --bootstrap-server broker1:9092 --describe --group test-consumer-group
GROUP                          TOPIC                          PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             OWNER
test-consumer-group            test-foo                       0          1               3               2               consumer-1_/


This script only supports the deletion of consumption groups that do not include any consumption groups, and only the consumption groups corresponding to the old version of consumers can be deleted (that is, the group metadata stored in zookeeper is valid). Because the essence of this script deletion operation is to delete the nodes and their children of the corresponding consumption groups in ZK.

How to manage consumption offset

The above describes how to query Kafka consumption offset by script tool. In fact, we can also query consumption offset through API.

The Kafka consumer API provides two methods for querying consumer consumption offsets:

  1. Committed (topicpartition partition): this method returns an offsetandmetadata object, through which you can get the submitted offset of the specified partition.
  2. Position (topicpartition partition): this method returns the position of the next pull position.

In addition to viewing the consumption offset, sometimes we need to specify the offset artificially, such as skipping some messages or redo some messages. Before 0.8.2, offset was stored in ZK. Only zkcli can be used to operate ZK. However, after 0.8.2, the offset is stored in Kafka’s consumer offsets queue by default, and can only be modified through the API:

Class KafkaConsumer<K,V> Kafka allows specifying the position using  seek(TopicPartition, long) to specify the new position. Special methods for seeking to the earliest and latest offset the server maintains are also available (seekToBeginning(TopicPartition…)  and  seekToEnd(TopicPartition…) respectively).

Reference: Kafka consumer offset management

The Kafka consumer API provides a way to reset consumption offsets:

  1. Seek (topic partition partition, long offset): this method is used to reset the consumption starting position to the specified offset position.
  2. Seektobeeting(): consumption starts from the beginning of the message, corresponding to the offset reset strategy


  3. Seektoend(): consumption starts from the corresponding position of the latest message, that is to say, pulling starts after waiting for the new message to be written. The corresponding offset reset strategy is


Of course, if you know the offset position to reset. One way is to get the corresponding offset according to the time stamp. Then seek.

Deployment and configuration

Kafka is written in Scala, so as long as the JRE environment is installed, it is very easy to run. Download the official compiled package directly, decompress and configure it to run directly.

I. Kafka configuration

The configuration file is server.properties in the config directory. The key configuration is as follows (some property configuration files are not available by default and need to be added by yourself):

Broker.id: each machine in the Kafka cluster (called broker) needs an independent and weightless ID
Port: listening port
Delete.topic.enable: if it is set to true, topic can be deleted; otherwise, it is not allowed
Message.max.bytes: the maximum message size allowed, which is 1000012 (1m) by default. It is recommended to adjust to 10000012 (10m).
Replica.fetch.max.bytes: the same as above. The default value is 1048576. It is recommended to set it to 10048576.
Log.dirs: the directory where Kafka data files are stored. Note that it is not a log file. It can be configured as / home / work / Kafka / data / Kafka logs
Log.cleanup.policy: the overdue data clearing policy, which is default to delete and can also be set to compact
Log.retention.hours: Data expiration time (hours). The default is 1073741824, or one week. The expired data is cleared with the rule of log.cleanup.policy. You can use log.retention.minutes to configure to the minute level.
Log.segment.bytes: data file segmentation size. The default is 1073741824 (1g).
Retention.check.interval.ms: the interval between cleaning threads to check whether the data has expired. The unit is Ms. the default value is 300000, i.e. 5 minutes.
Zookeeper.connect: the machine name of the zookeeper cluster responsible for managing Kafka: port number, multiple separated by commas

Tips send and receive large messages

The following parameters need to be modified:

  • broker:message.max.bytes

    & replica.fetch.max.bytes

  • consumer:fetch.message.max.bytes

For more details of the parameters, see the official document:


II. ZK configuration and startup

Then make sure that ZK is configured and started correctly. Kafka has its own ZK service. The configuration file is in the config / zookeeper.properties file. The key configuration is as follows:


Notes zookeeper cluster deployment

There are two things to do in the cluster deployment of ZK:

  1. Assign serverid: create a myid file in the dataDir directory. The file only contains a number from 1 to 255, which is the serverid of ZK.
  2. Configuration cluster: the format is server. {ID} = {host}: {port}: {port}, where {ID} is the serverid of ZK mentioned above.

Then start:

bin/zookeeper-server-start.sh -daemon config/zookeeper.properties。

III. start Kafka

Then you can start Kafka: jmx_port = 8999 bin / kafka-server-start.sh – daemon config / server.properties, which is very simple.


We added JMX ﹣ port = 8999 environment variable in the startup command, which can expose JMX monitoring items and facilitate monitoring.

Kafka monitoring and management

However, unlike rabbitmq or ActiveMQ, Kafka does not have a web management interface by default and only has command-line statements, which is not very convenient. However, you can install one, such as Yahoo’s Kafka Manager: a tool for managing Apache Kafka. It supports many functions:

  • Manage multiple clusters
  • Easy inspection of cluster state (topics, consumers, offsets, brokers, replica distribution, partition distribution)
  • Run preferred replica election
  • Generate partition assignments with option to select brokers to use
  • Run reassignment of partition (based on generated assignments)
  • Create a topic with optional topic configs ( has different configs than 0.8.2+)
  • Delete topic (only supported on 0.8.2+ and remember set delete.topic.enable=true in broker config)
  • Topic list now indicates topics marked for deletion (only supported on 0.8.2+)
  • Batch generate partition assignments for multiple topics with option to select brokers to use
  • Batch run reassignment of partition for multiple topics
  • Add partitions to existing topic
  • Update config for existing topic
  • Optionally enable JMX polling for broker level and topic level metrics.
  • Optionally filter out consumers that do not have ids/ owners/ & offsets/ directories in zookeeper.

The installation process is quite simple, that is to download a lot of things, it will be a long time. For details, see: Kafka Manager installation. However, none of these management platforms has permission management function.

It should be noted that kafka-manager.zkhosts configured in Kafka manager’s conf / application.conf configuration file is for its own high availability, rather than pointing to the zkhosts pointed to by Kafka cluster to be managed. So don’t forget to manually configure the Kafka cluster information to be managed (mainly the configuration name and ZK address). Install and Evaluation of Yahoo’s Kafka Manager。

Kafka manager mainly provides management interface, and monitoring depends on other applications, such as:

  1. Burrow: Kafka consumer lag checking. LinkedIn open source cusumer log monitoring, written in go language, seems to have no interface, only HTTP API, which can configure email alarm.
  2. Kafka Offset Monitor: A little app to monitor the progress of kafka consumers and their lag wrt the queue.

The purpose of both applications is to monitor the offset of Kafka.

Delete theme

There are two ways to delete Kafka theme:

1. Manually delete the theme partition folder under the ${log. Dir} directory of each node, and log in to the ZK client to delete the node corresponding to the theme to be deleted. The theme metadata is saved in the / brokers / topics and / config / topics nodes.

2. Execute the kafka-topics.sh script to delete. If you want to delete the topic completely through the script, you need to ensure that the server.properties file loaded when Kafka is started is configured with “delete. Topic. Enable = true”, which is false by default. Otherwise, the script does not actually delete the topic, but creates a topic with the same name as the topic to be deleted in the / admin / delete \ topics directory of ZK, marking the topic as deleted.

kafka-topic –delete –zookeeper server-1:2181,server-2:2181 –topic test`

Execution result:

Topic test is marked for deletion.
Note: This will have no impact if delete.topic.enable is not set to true.

At this time, if you want to delete the topic completely, you need to manually delete the corresponding files and nodes. When the configuration item is true, all file directories and metadata information corresponding to the topic will be deleted.

Automatic clearing of expired data

For traditional message queues, messages that have been consumed are usually deleted, while Kafka clusters retain all messages, whether they are consumed or not. Of course, due to disk limitations, it is impossible to keep all the data permanently (actually, it is not necessary), so Kafka provides two strategies to delete the old data. One is based on time; the other is based on the size of partition file. You can configure $kafka_home / config / server.properties to let Kafka delete the data a week ago, or you can configure Kafka to delete the old data when the partition file exceeds 1GB:

############################# Log Retention Policy #############################
# The following configurations control the disposal of log segments. The policy can
# be set to delete segments after a period of time, or after a given size has accumulated.
# A segment will be deleted whenever *either* of these criteria are met. Deletion always happens
# from the end of the log.
# The minimum age of a log file to be eligible for deletion
# A size-based retention policy for logs. Segments are pruned from the log as long as the remaining
# segments don't drop below log.retention.bytes.
# The maximum size of a log segment file. When this size is reached a new log segment will be created.
# The interval at which log segments are checked to see if they can be deleted according
# to the retention policies
# By default the log cleaner is disabled and the log retention policy will default to
# just delete segments after their retention expires.
# If log.cleaner.enable=true is set the cleaner will be enabled and individual logs
# can then be marked for log compaction.

It should be noted that the time complexity of Kafka reading specific messages is O (1), i.e. it is independent of the file size, so the file deletion here is independent of Kafka’s performance, and the selection of deletion strategy is only related to the disk and specific requirements.

Some problems of Kafka

1. Only messages in a single topic and a single partition can be ordered, but messages in all partitions of a single topic cannot be ordered. Kafka may not be appropriate if the application strictly requires ordered messages.

2. The consumption offset is tracked and submitted by the consumer, but the consumer does not write this offset to Kafka frequently, because the cost of maintaining these updates by the broker is very high, which will lead to the message may be consumed many times or not in abnormal circumstances.

The specific analysis is as follows: the message may have been consumed, but the consumer has not confirmed that the message has been consumed as the broker submit offset, and then another consumer starts to process the same partition, then it will start from the last submitted offset, causing some messages to be consumed repeatedly. But in turn, if the consumer submits the offset before batch processing the message, but hangs up when processing the message, then this part of the message is equivalent to “lost”. In general, it is difficult to process messages and submit offsets as an atomic operation, so it is not always possible to guarantee that all messages are processed just once.

3. Limited number of themes and partitions

The number of topics that Kafka cluster can handle is limited. When it reaches about 1000 topics, the performance begins to decline. These problems are basically related to Kafka’s basic implementation decision. In particular, as the number of topics increases, the amount of random io on the broker increases dramatically, because the write operation of each topic partition is actually a separate file append operation. As the number of partitions increases, the problem becomes more and more serious. If Kafka does not take over IO scheduling, the problem will be difficult to solve.

Of course, general applications will not have such a large number of themes and partitions. However, if a single Kafka cluster is used as a multi tenant resource, this problem will be exposed at this time.

4. Manually balance the partition load

Kafka’s model is very simple. A theme partition is all saved on a broker, and there may be several brokers as replicas of the partition. The same partition does not split storage between multiple machines. With the increasing number of partitions, some machines in the cluster are unlucky and will be allocated several large partitions. Kafka doesn’t have a mechanism to automatically migrate these partitions, so you have to come by yourself. Monitoring the disk space, diagnosing which partition is causing the problem, and then determining a suitable place to migrate the partition are all manual management tasks, which can not be ignored in the Kafka cluster environment.

If the cluster scale is small and the space required for data is small, this management method is barely effective. However, if the traffic increases rapidly or there is no first-class system administrator, then the situation is completely out of control.

Note: if you add new nodes to the cluster, you must also manually migrate the data to these new nodes. Kafka will not automatically migrate the partition to balance the load or storage space.

5. Follow replica only serves as a cold standby (to solve the HA problem) and cannot provide read service

Unlike es, replica shard provides read services at the same time to relieve the reading pressure of the master. Kafka because the read service is stateful (to maintain the committed offset), the follow replica does not participate in the read-write service. Just as a cold standby, solve single point problems.

6. It can only consume messages in sequence, not locate messages randomly. When there is a problem, it is not convenient to locate the problem quickly

This is actually a common problem for all message systems as asynchronous RPCs. Suppose the sender sent a message, but the consumer said I didn’t receive it, how to check it? Message queuing lacks the mechanism of random access to messages, such as getting messages according to the key of messages. This makes it not easy to troubleshoot this problem.

Recommended reading

  1. Centralized Logging Solutions Overview
  2. Logging and Aggregation at Quora
  3. Elk application in advertising system monitoring and its introduction to elastic search
  4. Centralized Logging
  5. Centralized Logging Architecture 

More contentVivo Internet technologyWeChat public address

Kafka principle and Practice

Note: for reprint, please contact wechat:labs2020Contact.