Week 17, 2018 – rabbitmq pattern and Kafka design


A way to understand concepts

As I said before, the core of learning a new thing is to master the concept. How to grasp the concept? One of my ways is to compare two similar and ambiguous concepts, so that I can understand them more quickly.

Rabbitmq mode

Rabbitmq has the following modes:
1. Work queues
Both sending and receiving messages go directly through the queue. In the time-consuming task, we put the task into the queue, and then each worker gets the task and processes it. So this work queue is also called task queues. In this way, the resource consuming task is decoupled from the application that generates the task.
The main feature of this pattern is that each task is distributed to only one worker.

Week 17, 2018 - rabbitmq pattern and Kafka design

2. Publish / subscribe
This publish / subscribe model is similar to the observer model, but not the same thing. See the difference between publish / subscribe and observer.
Here, rabbitmq introduces the concept of exchange. The producer does not directly interact with the queue, but interacts with the queue through the exchange (or binding). That is to say, the producer only interacts with the switch. After introducing the concept of switch, the message middleware can play more tricks. Publish / subscribe is one of them. The fanout switch is used here.
The main feature of this mode is: similar to broadcast, the same message can be sent to different queues, and the fanout switch does not care about the queues. As long as the queue and the fanout switch are bound, the message can be sent to different queues repeatedly.
The difference from the work queue pattern is that the publish / subscribe concept is called a message, not a task. So messages can be put into different queues repeatedly.

Week 17, 2018 - rabbitmq pattern and Kafka design

3. Routing
Routing mode is also a trick of message middleware after the introduction of switch concept. The switch used here is called direct.
In this mode, two new concepts have to be added, namely, binding key and routing key. Binding key is for a queue, which specifies the binding key when it is bound to the direct switch. The routing key is for messages. When sending messages to the direct switch, you need to specify the routing key. In this way, the routing key can match the binding key (that is, the value is equal), and the direct switch will send the message to the corresponding binding key queue.
The main feature of this mode is that it can control messages more accurately and specify which messages are sent to which queues.
The difference between the publish / subscribe mode and the publish / subscribe mode is that the publish / subscribe mode is broadcast, which sends messages to the queue of any binding switch, so it has no ability to select messages. The routing mode requires that the binding key and routing key match to send messages to the corresponding binding key queue, so it has the ability to select messages.
The same as publish / subscribe mode is that messages can be sent repeatedly.
Note: the queue can be bound with multiple routing keys

4. Topics
Of course, topic pattern is also a trick of message middleware after the concept of switch is introduced. The switch used here is called topic.
Binding key and routing key are also used here, but routing key is different_ Key cannot specify an explicit key. But the key needs to have a dot “.”, such as“ stock.usd.nyse “, ” nyse.vmw “, ” quick.orange.rabbit “。 In this mode, the binding key can be specified more widely. The structure is as follows“.orange.” 、 “.. rabbit “and” lazy. “. Where * (asterisk) can represent one word and # (well) can represent zero or more words. It is also similar to routing. As long as the routing key can match the binding key (here, it is not necessary to match the value but the pattern), the topic switch will send the message to the corresponding binding key queue.

For example, the binding key of Q1 queue is“.orange.“And Q2 is”.. rabbit “and” lazy. “. If the routing key of the message is“ quick.orange.rabbit “The message is sent to the Q1 and Q2 queues. The routing key is“ quick.orange.fox “Will only be sent to the Q1 queue. The routing key is“ lazy.pink.rabbit “The message will only be sent to Q2 queue once, and the routing key is” quick.brown.fox “If the message does not match any binding key, the message is discarded.
Note: the queue can be bound with multiple routing keys

5. Remote procedure call (RPC)
RPC can call functions remotely and wait for the server to return the result.

A note of RPC: Although RPC is widely used, it also has some shortcomings, that is, developers can not clearly know whether the function they call is a local function or a slow RPC. This confusion can easily lead to an unpredictable system and increase unnecessary complexity, leading to difficult positioning problems. If you don’t use a simple program, misuse RPC can also write very maintainable spaghetti like code..
There are three suggestions on this issue

  • It is easy to distinguish whether the function is local or remote.
  • Document and clearly record the dependencies between components.
  • Handle the network brought about by the exception, such as timeout.

When it comes to whether it is necessary to use RPC, if you can, you’d better use asynchronous pipeline instead of blocking RPC.

Rabbitmq can be used to build RPC systems. One client and one extensible RPC server. However, this function is not very commonly used, so I don’t leave space to explain it. The general principle is that you can add new message attributes to match the request and response messages.

The difference between observer mode and publish / subscribe mode

Observer model
The definition of observer pattern: a one to many combination relationship between objects, so that when the state of an object changes, all the objects that depend on it will be notified.
for instance

Suppose you are looking for a job as a software engineer and are interested in banana company. So you contacted their HR and gave him your contact number. He promised to let you know if there was any vacancy. There are several other candidates here who are equally interested in you. So everyone will know about the vacancy. If you respond to their notice, they will contact you for an interview.
This mode must include two roles: observer and observation object. Banana company is the observed subject, and you are the observers (and the same candidate as you). When the status of the observed changes (such as job vacancy), the observer will be notified, provided that the observers are registered in the subject, that is, HR of banana company must have your phone number.

Publish / subscribe mode
Subject in observer mode is like a publisher, while observer can be regarded as a subscriber. When a subject notifies an observer, it is like a publisher notifies its subscribers. That’s why many books and articles use the publish subscribe concept to explain observer design patterns. But there is another popular pattern called publish subscribe design pattern. Its concept is very similar to the observer model. The biggest difference is:
In publish subscribe mode, the sender of a message is called publishers, and the message is not sent directly to a specific receiver (subscriber).
This means that publishers and subscribers do not know each other’s existence. We need a third-party component, called message middleware, which connects subscribers and publishers, and filters and distributes all incoming messages. In other words, publish / subscribe mode is used to deal with the information exchange of different system components, even if these components do not know the existence of each other.

Kafka design


We designed Kafka in the hope that it can become a unified platform to handle all the real-time data streams that large companies may have. To do this, we have to consider a fairly wide range of use cases.

  • It needs to have high throughput to support large-scale event flow, such as real-time log aggregation.
  • It needs to handle a large amount of data backup gracefully to support the periodic data load of offline system.
  • It needs to handle low latency delivery to support traditional messaging system use cases.

We think it is a partition, distributed, real-time processing of information flow to create new information flow and transport information flow. These motivations lead to Kafka’s partition and consumer model.
Finally, it is possible that the data stream will be input to other data systems, and these systems need to provide external services, so Kafka needs to be able to ensure fault tolerance, even if there is machine downtime.

To support the above, we designed some unique elements, which are more similar to database logs than traditional messaging systems.

We’ll outline some of the elements of design in the following sections.


Don’t be afraid of the file system

Kafka relies heavily on file system to store and cache messages. People feel that “hard disk is very slow”, so that they doubt whether a persistence architecture can have competitive performance. It’s actually fast and slow, depending on how we use it. A reasonable hard disk architecture can usually be as fast as the network. It seems that the author’s Internet speed is very fast.
The key to hard disk performance is that the throughput of the disk drive is different from the latency of hard disk search in the past decade. Therefore, the linear write performance of 6 × 7200rpm SATA RAID-5 array in JBOD configuration is about 600MB / s, but the random write performance is only 100k / s, which is more than 6000 times different. These linear reads and writes are the most predictable of all usage patterns and are heavily optimized by the operating system. Modern operating systems provide read ahead and write behind operations, which support reading a large block many times and merging small logical writes to form a large physical write. A deeper discussion of this issue can be found hereACM Queue articleThey did find outSequential hard disk read / write is faster than random memory access in some cases

In order to make up for these performance differences, modern operating systems pay more and more attention to using main memory as disk cache. Modern operating systems are happy to transfer spare memory to disk cache, but it needs to bear a little bit of performance loss when memory is recycled. All hard disk read and write through the unified cache (disk cache). Without direct IO, this feature is not so easily abandoned. Therefore, even when a user enters to maintain his own data cache, the data will be copied twice in the OS page cache to store everything efficiently twice.

In addition, we are built on the basis of the JVM. Anyone who takes the time to study the use of JAVA memory will know the following two things:
1. The memory cost of an object is very high, which usually doubles (or even more) the size of the data to be stored.
2. With the increase of heap memory, Java garbage collection will become more and more tedious and slow.

It is also the result of using file system and page cache that is better than maintaining an in memory cache or other structure. By automatically accessing all the free memory, we can at least double the available cache and continue to double it, by storing a compact byte structure instead of a single object. In this way, you can use 28-30gb cache on 32GB machine without worrying about GC problems. Moreover, even if the service is restarted, the data remains hot. In contrast, the cache in the process memory needs to be rebuilt after restart (it may take 10 minutes for a 10GB cache), otherwise it needs to start from a completely cold cache (which may mean worse initialization performance). This also greatly simplifies the code, because all the logic for maintaining consistency between the cache and the file system is now in the operating system, which is more efficient and correct than trying in-process at once. If your disk usage tends to be linear read, prefetching will effectively pre manipulate these caches.

This shows a very simple design: when we run out of space, instead of keeping as much memory as possible and emptying it to the file system, the data is immediately written to the persistent log on the file system instead of being flushed to disk. In fact, it just means that it is moved to the page cache of the kernel.
Page cache as the core of the design, herearticleThis article is the design of varnish.

Only constant time delays are needed

In the message passing system, the persistent data structure is usually a consumer queue associated with a BTREE or other general random access data structure to maintain the message metadata. BTREE is a universal data structure, which can support all kinds of transactional and non transactional semantics in message passing system. But it brings a high cost: BTREE operation is O (log n). Generally, O (log n) is considered to be equal to constant time in essence, but not for hard disk operation. The disk orbit seeking can reach 10ms, and each disk can only perform one orbit seeking at a time, so the parallelism is limited. Therefore, even a small amount of disk tracking will lead to high overhead. Because the storage system mixes very fast cache operations with very slow physical disk operations, the performance of the tree structure is usually superlinear when the cache is fixed and the data increases. Doubling the data will make it more than twice as fast.
Intuitively, a persistent queue can be established in the form of simple read and append, which is usually used by logging solutions. This structure has the advantage that all operations are o (1), and read operations do not block write and read operations. This is an obvious advantage because the performance is completely decoupled from the amount of data. A service can now make full use of those large, cheap, low speed SATA drives. Although hard disks have poor track finding performance, their large-scale read and write performance is acceptable, and they have three times the capacity at one third of the price.
Accessing almost unlimited disk space without any performance penalty means that we can provide features that are not common in messaging systems. For example, in Kafka, we can keep messages for a relatively long time (for example, one week), instead of deleting messages every time we consume them. This will give consumers a lot of flexibility.


We put a lot of effort into efficiency. One of our initial use cases is to deal with website activity data, which can be a very large amount of data: each page access will produce a lot of write operations. In addition, we assume that each message is read by at least one consumer (usually many consumers), so we try to make consumption as cheap as possible.
We also found that after building and running multiple similar systems, the key to effective multi tenant business is efficiency.
We discussed the efficiency of hard disk in the previous chapter. Once bad disk access patterns are eliminated, there are two common causes of inefficiency in this type of system: too many small I / O operations and excessive byte replication.
This small IO problem occurs between the client and the server, and in the persistence operation of the server itself.
To avoid this, our protocol is built around a “message set” abstraction, which can naturally group messages together. This allows network requests to group messages and share the cost of network round trips instead of sending one message at a time. The server appends a large number of messages to its log in turn, while the consumer gets a large number of linear blocks at a time.
This simple optimization produces an order of magnitude acceleration. Batch processing leads to larger network packets, larger sequential disk operations, continuous memory blocks and so on, all of which enable Kafka to write random messages into linear write streams to consumers.
Another inefficiency is byte replication. This is not a problem at low message rates, but the impact is significant under load. To avoid this situation, we use a standardized binary message format, which is shared by producers, agents and consumers (so data blocks can be transmitted without modification).
The message log maintained by broker itself is a file directory, and each file is filled with a sequence of message sets written to disk in the same format used by producers and consumers. Maintaining this common format optimizes the most important operation: network transmission of persistent log blocks. The modern UNIX operating system provides a highly optimized code path for transferring data from page cache to socket; in Linux, this is done through sendfile system call.
To understand the role of sendfile, first of all, it is most important to understand the common data path that transfers data from a file to a socket
1. The operating system reads data from disk to page cache in kernel space.
2. The application program reads data from kernel space to user space buffer.
3. The application program returns the data to the kernel space and writes it to the socket buffer.
4. The operating system copies the data from the socket buffer to the NIC buffer sent over the network.
There are four replicates and two system kernel calls, which is of course inefficient. Sendfile is used to avoid duplication by allowing the operating system to send data directly from the page cache to the network. Therefore, in this optimized path, only the last copy is needed, one copy from disk to NIC buffer. ——Zero copy
We expect a common use case to have multiple users on a topic. With the above zero copy optimization, the data is completely copied to the page cache and reused every time it is read, instead of being stored in memory and copied to the user space every time it is read. This allows messages to be read at a rate close to the limit of the network connection.
The combination of page caching and sendfile means that on a Kafka cluster, on machines with consumers, you will see no read activity on the disk because they will provide data completely from the cache.
For more sendfile and zero copy supported by Java, pleaseclick here

End to end batch compression

In a certain case, the real bottleneck is not CPU or hard disk, but network bandwidth. This is especially true for data pipelines that need to send messages between data centers on the WAN. Of course, users can compress messages themselves without the support of Kafka. But this can lead to very poor compression ratio, especially when there are many redundant fields in the message (such as the field name in JSON and the user agent or public string in the website log). Efficient compression requires multiple messages to be compressed together, rather than each message being compressed independently.
Kafka supports this with an efficient batch format. A batch of messages can be aggregated, compressed, and sent to the server in this form. This batch of messages will be written in the form of compression, and will remain compressed in the log, and will only be decompressed by the user.
Kafka supports gzip, snappy and lz4 compression protocols. More details about compression can be found inhereFind it.

The producer

Load balancing

The producer sends the data to the broker directly without any intermediate routing layer, and the accepted broker is the leader of the partition. In order to help producers achieve this, all Kafka nodes can answer requests about which servers are available metadata, and whether the leader of a topic partition allows producers to send its requests appropriately at any given time.
The client controls which partition it wants to produce messages to. This can be done randomly to achieve a random load balancing, or it can be achieved by some semantic partition functions. We provide the interface of semantic partition, which allows users to specify the key of a partition, and use this key to hash to a partition (if necessary, the partition function can be duplicated). For example, if we select the user’s ID as available, then all the user’s information will be sent to the same partition. This in turn will make consumers have local assumptions about their consumption. This clearly designed partition allows consumers to handle their own local processes.

Asynchronous send

Batch processing is one of the main driving factors of efficiency. In order to batch process, Kafka producers will try to accumulate data in memory, and then send them together in a large batch in a request. Batch processing can be set according to a fixed number of messages or a specific delay (64K or 10ms). This allows the accumulation of more bytes to be sent out, so that only a small number of large IO operations are done on the server. This buffering is configurable, which provides a mechanism to increase throughput with additional latency.
concreteto configure)And producersapiYou can find it in this document.

The consumer

Kafka consumers work by sending a “fetch” request to the leader of the partition they want to consume. In each request, the consumer specifies the offset of the log, and then accepts back a large block of logs starting from the offset. Therefore, consumers have important control over position. If necessary, they can reset position to consume data again.

Push and pull

Our first consideration is whether consumers should pull messages from brokers or whether brokers should push messages to consumers. In this respect, Kafka follows a more traditional design, which is also used by most messaging systems, that is, the data is pushed from the producer to the broker, and the consumer pulls the data from the broker. Some log centralization systems, such as scribe and Apache flume, follow a very different push based path to push data downstream. These two methods have their own advantages and disadvantages. In push based systems, because the broker has to control the data transmission rate, different consumers may want different rates. However, the general purpose of consumers is to enable consumers to consume at the maximum speed, but in the push based system, when the consumption rate is lower than the production efficiency, consumers do not know what to do (essentially a denial of service attack (DOS)). A pull based system is very familiar, and consumers can simply control the speed.

Another advantage of the pull based system is that it can aggregate batches of data sent to consumers. The push based system must choose to send the request immediately or accumulate more data, and then send it without knowing whether the downstream users can handle it immediately. If you tune for low latency, this will result in only one message being sent at the end of the transmission and eventually buffered, which is a waste. Pull based design solves this problem, because users always extract all available messages after the current location (or some configurable maximum size) of the log. Therefore, we can get the best batch without introducing unnecessary delay.

The drawback of the pull based system is that if the broker has no data, the consumers may keep rotating. To avoid this, we provide parameters on the pull request, allowing consumers to block in the “long rotation” until the data reaches (and can choose to wait until a certain number of users can ensure the transmission size).

You may detail other possible designs, such as only pull, point-to-point. Producers write local logs to local logs, and brokers pull data from these logs. A similar “store and forward” producer is usually proposed. It’s interesting, but we don’t think it’s suitable for our target use case: it has thousands of producers. Our experience of running persistent data systems on a large scale makes us feel that involving thousands of disks in many applications does not actually make things more reliable, and it can be a nightmare to operate. In practice, we find that we can run pipelines with large-scale SLAs without requiring producer persistence.

Consumer position

Surprisingly, tracking the content used is one of the key performance points of a messaging system.
Many messaging systems store metadata about what messages are consumed in the broker. That is to say, when a message is delivered to the consumer, the broker either records the information to the local immediately or waits for the consumer’s confirmation. This is a fairly intuitive choice, and for a machine server, it’s very clear to know the status of these messages. Because the data structure used for storage in many messaging systems is poor, this is also a practical choice – because the broker knows what has been consumed, so it can delete it immediately and keep the size of the data.
It’s no small problem for brokers and consumers to agree on what they’ve already consumed. If a message is sent to the network, the broker will set it as consumed, but the consumer may fail to process the message (maybe the consumer is hung up, or the request times out, etc.), and the message will be lost. In order to solve this problem, many message delivery systems add confirmation mechanism. When a message is sent, it is marked as sent, not consumed; this is when the broker waits for a specific confirmation message from the consumer, it sets the message as consumed. Although this strategy solves the problem of message loss, it brings new problems. First, if the consumer hangs up after processing the message before sending the confirmation message, the message will be processed twice. The second problem is about performance. The broker must keep different states of each message (first, lock the message so that it will not be sent a second time, and second, the flag bit has been consumed so that it can be deleted). There are still some thorny issues to deal with. If the message is sent out, but its confirmation information has not been returned.

Kafka treatment is different. Our topic is divided into a set of ordered partitions, and each partition will only be used by one consumer in the consumer group that subscribes to it at any given time. This means that the position of the consumer in each partition is only an integer, which is the offset of the message in the next consumption. This makes the state (whether records are consumed or not) very small, with only one number per partition. This status can be checked periodically. In this way, the cost of confirming whether a message is consumed is very low.

There’s an added benefit. Consumers can reset their first position to consume data again. Although this violates the public contract of queue, it becomes a key function to many consumers. For example, if there is a bug in the consumer’s code, and some messages are found after they are consumed, then after the bug is fixed, the consumer can reuse the messages.

Offline data load

Extensible persistence allows only consumers who use batch data periodically, such as loading batch data to offline systems (such as Hadoop or relational data warehouse) on a regular basis.

Message delivery semantics

Now that we know how producers and consumers work, let’s talk about the semantic guarantees Kafka provides to producers and consumers. Obviously, the following message delivery assurance mechanisms are provided here:

  • At most once, the message may be lost, but it will never be retransmitted.
  • At least once, so that messages are unlikely to be lost, but may be retransmitted.
  • There is and only once. This is what you want. Every message will be delivered once, and only once.

It is worth noting that this can be attributed to two problems: the persistence guarantee for publishing messages and the guarantee for consuming messages.
Many systems claim to provide “yes and only once” delivery semantics, but when reading these details, you will find that most of them are misleading (they do not understand the situation that consumers or producers may hang up, that there are multiple consumers processing, or that the data written to disk may be lost).
The semantics of Kafka is very direct. When we publish a message, we “submit” the message to the log. Once the published message is submitted, as long as a broker replicates the message and writes it to the active partition, it will not be lost. The definition of the submitted message, the active partition, and the type of failure we are trying to handle are described in detail in the next section (replicas). Now let’s assume that in a perfect situation, let’s assume a perfect and lossless broker, and try to understand the guarantee for producers and consumers. If a producer attempts to publish a message and experiences a network error, it is uncertain whether the error occurred before or after the message was submitted. This is similar to the semantics of an automatically generated primary key inserted into a database table.
Before, if a producer did not receive a response that a message had submitted, it had little choice but to resend the message. This provides “at least once” delivery semantics, because if the original request actually succeeds, the message may be written to the log again during resend. Starting from, Kafka producers also support a power passing option, which ensures that the retransmission will not cause this duplicate message in the log. In order to achieve this goal, the broker assigns an ID to each producer and sends the serial number to the broker together when the producer sends the message, so that the broker can process the repeated message according to the sequence and ID. Similarly, starting from, producers support using transaction like semantics to send messages to multiple subject partitions: that is, all messages have been successfully written or failed to be written. The main application scenario of this situation is to process Kafka topics “once and only once” (as described below).
Not all use cases need such a strong guarantee. For delay sensitive usage, we allow the producer to specify the persistence level it needs. If the producer specifies to wait for the message to be submitted, it should be completed within 10ms. Then the producer can specify it to send asynchronously, or wait until the leader (but not necessarily the follower) gets the message.
Now let’s describe the semantics from the perspective of consumers. All copies have the same log and the same offset. The consumer controls its position in the log. If the consumer never crashes, it can store the position in memory. However, if the consumer crashes, we hope that the partition of this topic will replace the processing of this position. Then the new process will need to select an appropriate position to start processing.
When a consumer reads a message, there are several options to process the message and update its location.

  1. The second is that it reads the message first, then saves the position to the log, and finally processes the message. In this case, the consumer process may crash after saving its position and before saving the output generated by processing the message. In this case, the process of taking over processing will start from the saved position, even if some messages before this position are not processed. This is corresponding to the semantics of “at least once”. The failed message may not be processed.
  2. The second is that it reads the message first, then processes the message, and finally saves the position to the log. In this case, the consumer process may crash after processing the message, but before it saves its position. In this case, when the new process takes over the first few messages it receives, they may have been processed. In the case of a consumer crash, this is equivalent to the semantics of “at least once.”. In many cases, the message has a primary key, so the update is idempotent (receiving the same message twice, only rewriting a record with another copy).

What’s the meaning of “have and only once”? Get messages from Kafka topics and publish them to other topics (such as aKafka StreamsWe can take advantage of the new transaction producer function in version mentioned above. The consumer’s position is stored in a topic as a message, so we can write the offset of Kafka in the same transaction as the output topic of receiving and processing data. If the transaction is aborted, the consumer’s position will return to the original value, and the generated data of the output topic will not be seen by other consumers, depending on their “isolation level”. In the default “read”_ In the uncommitted isolation level, all messages are visible to consumers, even if they are part of the aborted transaction, but in the read_ In “committed”, the user will only return messages (and any messages that are not part of the transaction) from the committed transaction.

When writing to an external system, the limitation is the need to coordinate the consumer’s position with the actual stored output. The classic way to achieve this goal is to introduce a two-phase commit between the storage consumer position and the storage consumer output. But this can be handled more simply, and usually by having consumers store their offsets in the same location as the output. This is better because the output system that consumers may want to write to does not support two-phase commit. As an example, consider aKafka ConnectThe connector, which fills the HDFS with data and the offset of the data it reads, ensures that both the data and the offset are updated, or neither. For many other data systems that need these stronger semantics, we follow a similar pattern for those systems that need strong consistent semantics, and for those messages that have no primary key to allow the deletion of duplicate data.

Therefore, for Kafka streams, Kafka efficiently supports “have and only once” delivery, and transaction producer / consumer can be used to provide “have and only once” delivery when transferring and processing data between Kafka topics. For other target systems, the “one and only one” transfer generally needs coordination, but Kafka provides offset, which can achieve this requirement (seeKafka Connect)。 Otherwise, Kafka guarantees “at least once” delivery by default, and allows users to prohibit producers from retrying or consumers from submitting positions before processing data, so as to achieve “at least once” delivery.


Kafka uses a configurable number of servers to replicate the logs of each topic partition (you can set the replication factor by topic). This allows automatic recovery in the event of a server failure in the cluster, so messages can still be used in the event of a failure.
Other messaging systems provide replica related features. However, in our opinion, this seems to be a strategy, which is not widely used. Moreover, there is a big disadvantage: slave is not used, throughput is seriously affected, and recovery needs complicated manual configuration, etc. Kafka uses the replica function by default. In fact, themes with the replica factor set to 1 will also be treated as themes with the replica function.
The smallest unit of the replica is the partition of the theme. In the case of no failure, each partition of Kafka has a leader, with zero or more followers. The number of replicas including the leader is the replica factor. All read and write are done through the leader partition. In general, the amount of data in the partition is multiple brokers, and the number of leaders is evenly distributed when each broker. The log of the follower and the log of the leader are exactly the same – they all have the same offset and the same order of messages (of course, at any given time, there may be some messages at the end of the log that have not yet been synchronized).
Like the ordinary consumers of Kafka, followers consume information from the leader. When a follower pulls messages from a leader, it has a good feature, which makes it easy for the follower to batch apply logs to its (follower) logs.
Just like many distributed systems dealing with automatic recovery, there needs to be a clear definition of whether a node is “alive”. For Kafka, there are two conditions for node survival
1. The node must maintain its session with zookeeper (through the heartbeat mechanism of zookeeper)
2. If it is a slave, you must copy the leader, and you can’t fall too far behind.
The nodes that meet the above two conditions are more likely to be called “in sync” than vague “survive” or “fail”. The leader keeps track of these “synchronized” nodes. If the follower hangs up, gets stuck, or falls too far behind, the leader will remove it from the synchronized replica list. Yes, there is e replica.lag.time . max.ms This configuration controls how long it’s stuck and how many copies it’s behind.
In Distributed System terminology, we only try to deal with a “fail / recover” model, in which nodes suddenly stop working and then recover (perhaps unaware that they are dead). Kafka does not deal with the so-called “Byzantine” failure, in which nodes generate arbitrary or malicious responses (possibly due to certain errors).

Now, we can define a message commit more precisely. When all copies are synchronized to the partition and applied to its log, it will be considered committed. Only submitted messages are distributed to consumers. This means that consumers don’t have to worry about losing messages when the leader crashes. Producers, on the other hand, can choose to wait for a message to commit or not, depending on their trade-off between latency and persistence. Producers can use the acks configuration to control the trade-off. Note that the setting of “minimum number” of synchronous copies refers to the minimum number of messages that Kafka checks when messages are synchronized to all copies. If the producer is not strict with the confirmation requirement, the message can be used as soon as it is published, even if the number of synchronous copies has not reached the minimum. (the minimum value can be as low as one, which is the leader).

Kafka guarantees that messages will not be lost, as long as at least one synchronized copy exists at any time.
Kafka can be used in case of node failure. But when there are network partitions, they may not be available.

Replicated logs: quorum, ISR, and state machines (oh my!)

A partition is a replica log. Replica log is one of the most basic primitives of distributed data system, and there are many ways to implement it. Other systems can use the replica log as a primitive for theState machine formIt is a distributed system.
The process of agreeing on the order of a series of values (usually numbered 0, 1, 2,…) )Copy log is to model it. There are many ways to do this, but the easiest and fastest way is for the leader to select the order value. Only the leader can survive. All the followers only need to copy the values. The order is determined by the leader.
Of course, if the leader doesn’t hang up, we don’t have to follow. When the leader crashes, I choose a new leader from the followers. But the follower itself may fall behind or collapse, so we have to make sure that we choose the latest follower. If we tell the client that the message has been submitted and the leader hangs up, the new leader we choose must also contain the message that has just been submitted. This leads to a trade-off: if the leader waits for more followers to acknowledge a message before declaring it committed then there will be more potentially electable leaders

If you specify the number of confirmations and the number of logs (compared with the leader) so that there is overlap, then this is called quorum.

The most common way to do this trade-off is to use a majority vote in the submission decision and in the leader election. It’s not Kafka, but let’s explore it and understand its pros and cons. Suppose we have 2F + 1 copies. If the F + 1 node receives a message and no more than f nodes fail, the leader ensures that all messages are submitted, and the same is true when we select a new leader. This is because we select F + 1 nodes on any node. At least one node in F + 1 must contain copies of all submitted messages. The most complete node of the replica will be selected as the new leader. There are still many algorithm details to deal with (such as clearly defining the integrity of the log, how to ensure consistency when the leader crashes, and modifying the servers in the cluster), which we will ignore for the time being.
Most voting methods have a very good feature: latency depends only on the fastest servers. That is to say, if the replica factor is 3, then the delay is determined by the fastest slave, not the slowest one (a leader and the fastest one, which is a quorum).
There are many algorithms in this family, including zookeeper’s Zab, raft and viewstamped replica algorithms. As we know, the academic publication of Kafka’s algorithm is from Pacific a of Microsoft.
The disadvantage of majority voting is that it doesn’t need many failed nodes, so you can’t choose the leader. In order to tolerate the failure of one node, three nodes are needed, and to tolerate two nodes, five nodes are needed. In our experience, we think that as long as there are just enough redundant copies, we can tolerate the failure of a node, but this is not practical. In the case of 5 times of hard disk space (5 hard disks, each of which accounts for 1 / 5 of the throughput), we have to write 5 times each time, which is not practical for the problem of large amounts of data. This is why the quorum algorithm is more commonly used in the cluster configuration file, such as zookeeper, and rarely used in the original data storage. For example, the high availability of namenode in HDFS is built onMajority voteBut this expensive algorithm will not be used for its data storage.
Kafka used a slightly different method to select a quorum. Kafka dynamically maintains an ISR (in sync replicas) set in which all nodes are synchronized. Only those in this collection are suitable for election as leaders. Only when all ISRs receive a write partition, the write to that partition is considered committed. This ISR is stored in zookeeper. This is an important factor for Kafka’s usage model, where there are many partitions, and it is important to ensure the balance of leaders. ISR model and F + 1 replica, a Kafka topic can tolerate f failures (a total of F + 1 nodes).
We want to deal with more use cases, so we think this trade-off is reasonable. In practice, for tolerating f node failures, most voting and ISR methods need a common number of replica confirmations (for example, tolerating one node failure, most voting methods need three replicas and one confirmation, and ISR methods need two replicas and one confirmation). Confirming the submission without having to be confirmed by the slowest node is the advantage of most voting methods. But we think that we can optimize this problem by letting the client choose whether to block the message submission, and controlling the replica factor to increase the throughput and disk space (this problem is compared with the majority voting).
Another important design is that Kafka does not require the crash node to recover when all data is complete. In this space, it is not uncommon for the replica algorithm to rely on the existence of “stable storage”, which can not be lost in any failure recovery scenario, so consistency should be ensured. There are two main problems. First of all, hard disk failure is the most common problem in the actual operation of persistent data system. After the problem occurs, the data will not be preserved completely. Second, even if it’s not a problem, we don’t want to use fsync every time we write, because it reduces performance by two to three orders of magnitude. We allow a replica to rejoin the ISR protocol, which ensures that it must be fully resynchronized before rejoining, even if it loses unrefreshened data in a crash.

Unclear leader election: what if they all hang up? (Unclean leader election: What if they all die?)

Note that Kafka’s guarantee against data loss is based on at least one copy that remains synchronized. If all copies of a partition are lost, there is no guarantee that the data will not be lost.
However, in the actual situation, the system must do some reasonable things after all the copies are hung. It’s important to be aware of what’s going to happen if you don’t want to encounter this situation. There are two possible situations:
1. Wait for all the nodes in ISR to recover, and select a new leader (I hope the leader still holds all the data).
2. Select the first copy (which does not need to be in ISR) as the leader.

Here are the trade-offs between availability and persistence. 1、 If we wait for all ISR copies to recover, we will wait a long time.. 2、 If the data of the replica is lost, it can never be recovered. Finally, if we allow a replica recovery without synchronization to be a leader, its log is considered to be up-to-date, even if it does not contain all submitted messages. In version, the first trade-off is chosen by default, and data consistency is exchanged by waiting. This can be configured. If startup time is more important than consistency, modify this unclean.leader.election .enable。
It’s not just Kafka that’s confused. It exists with any scenario based on quorum algorithm. For example, in the scenario of majority voting, if you go to most servers and the remaining servers, you have to choose one of the two, either lose 100% of the data or lose the consistency of the data.

Availability and durability guarantees

When a producer generates a message, it can select 0, 1 or all copies for confirmation. Note that the “confirm all copies” here does not guarantee that all nodes that are assigned copies will receive messages. By default, when acks = all, the message can be confirmed as long as all current ISRs receive the message. For example, a topic is set to two copies and one failure (only one ISR left), and then all writes with acks = all succeed. If the remaining copies fail, the message will be lost. Although this ensures the maximum availability of partitions, this behavior may not be suitable for some users who prefer persistence rather than availability. As a result, we provide two top-level configurations that are more message persistence oriented than availability oriented:
1. Close unclear leader election – if all replicas become unavailable, all partitions will not be available until the latest leader becomes available. This effectively avoids the risk of message loss. Refer to the unclear leader election in the previous section.
2. Specify the minimum number of ISRs – messages will be acknowledged only when the minimum number is higher. This is to avoid the risk of message loss when a copy is written and the copy is hung. This setting only takes effect when the producer uses acks = all or ensures that the number of messages is more than this. This setup provides a trade-off between consistency and high availability. Set the minimum number of ISR higher to ensure better consistency. However, this reduces availability because partitions are not available when ISR does not meet this number.

Replica management

A copy of the appeal discussion is just a log, which is a section of the subject. Kafka, however, manages thousands of partitions. We try to balance the partitions in the cluster in a round robin way, so as to avoid that all the partitions of topics with large amount of data are on a small number of nodes. Similarly, we try to balance the leaders so that each node is the leader of its share partition.
It is also important to optimize the ledaer election process because it is a window period when services are unavailable. A simple leader election will be held once for all partitions in a node after a node fails. Instead, we choose a broker as the “controller.”. This controller detects the failure of the broker hierarchy and is responsible for modifying the leader of the partition affected by the failure. As a result, we can batch process many required leadr changes, which makes the election process cheaper and faster on a large number of partitions. If the controller fails, one of the surviving nodes will become a new controller.

Log compression

Log compression ensures that Kafka keeps at least one recent message for each key in each partition. This solves the problem of reloading data when the application or system crashes and restarts.
So far, we have only discussed a simple way to save data, that is, when the old log data exceeds a certain time or reaches a certain size, it will be deleted. This applies to each relatively independent message, such as a temporary event. However, there is also a very important type of data, that is, modifying data according to the key, a kind of variable data (such as the change of data in the database table).
Let’s talk about a specific example. A topic contains the user’s emial information. Every time the user updates their email information, we will send a message to topic, making the primary key according to their userid. Here are the messages we sent. The userid is 123. Each message corresponds to a modification of the email message (the ellipsis is the message that omits other userids).

123 => [email protected]
123 => [email protected]
123 => [email protected]

Log compression gives us a more granular data retention mechanism, so that we can ensure that only the last change of each key is retained (such as 123 = > 123) [email protected] )。 In this way, we ensure that the log contains a snapshot of the last value of all keys. This means that downstream consumers can rebuild their state without having to save all the change logs.
Let’s see some useful scenarios of log compression, and then we’ll see how it is used.
1. Database chagne subscription. We often find that a data set exists in a variety of data systems, and this system has a similar database (such as RDBMS or the trendy key value system). For example, you have a database, a cache, a search cluster, and a Hadoop cluster. In this way, every modification of the database must be mapped to the cache, the search cluster, and finally in Hadoop. In this scenario, you just need the latest log in real time. However, if you need to reload the search node into the cache or recover the down search node, you may need a complete data set.
2. Event sourcing. This is an application design style, which combines query and application design, and uses log as the main storage of the program.
3. Journaling for high availability. A local computing process can achieve fault tolerance through the change log, so that another process can reload these changes to continue processing. A concrete example is the streaming query system, such as counting, summarizing and other “grouping” operations. Real time streaming processing framework samza uses thisfunctionTo achieve the goal.
In the above scenario, we mainly deal with real-time changes. When we need to reload or reprocess occasionally, all we can do is reload all the data. Log compression provides these two functions, dealing with real-time data changes, and reloading data. This style of using log can be found inclick

The idea is simple. If we keep endless logs, save every change log in the above scenario, and get the state of every system from the beginning. With this complete log, we can recover to any point in time. But the assumption of complete log is impractical, because for those systems where each row of records is changing many times, even if the data is very small, the log will grow infinitely. We can simply discard the old logs. Although it can limit the growth of space, we can’t rebuild the state. Because the old logs are discarded, some of the recorded states can’t be rebuilt.
Compared with the coarse-grained time-based data retention strategy, the log compression strategy is a finer granularity, based on each record. The idea is to selectively delete the same key that has multiple change records. Such a log ensures that each key has at least one latest state.
The data retention policy can be set for each topic, so the retention policy of some topics in a cluster can be set to the size and time to save the data, and some topics can also be reserved by compression.
This feature is inspired by LinkedIn’s oldest and most successful infrastructure, a database change log caching system called databus.
Unlike most log structured storage systems, Kafka is designed for subscription, and the form of organizing data is also for faster linear reading and writing. Unlike databus, Kafka is a source of truth storage, which is very useful even if the upstream data source is not reusable.

Whether it’s traditional RDBMS or distributed NoSQL, the data stored in the database will always be updated. There are two ways to update the data with new records of the same key
1. Update directly (find the existing location in the database and replace the old value with the latest value).
2. Add records (keep the old values and merge them when querying, or a background thread will merge them periodically).
The method of appending records can be used to recover data when the node crashes. Another advantage is that the write performance is very high, because it is linear write.
The following is the data update method of each data system:

data system Where is the updated data appended data file Do you need compression
ZooKeeper log snapshot No, because there’s not a lot of data
Redis aof rdb No, because it’s an in memory database
Cassandra commit log data.db If necessary, the data exists in the local file
HBase commit log HFile Yes, the data exists in HDFS
Kafka commit log commit log As required, the data is stored in segments in the partition

Log compression Basics

Here is a higher-level diagram to show the logical storage structure of Kafka log. Each number in the frame is the offset of a message:

The log head is the traditional Kafka log. The log tail is the compressed log. Log head is very dense. The offset is continuous, and all messages are retained. It is worth noting that although the log tail message is compressed, it still retains the offset when it was first written. This offset will never be changed. And the offset in the compressed log is still valid in the log. Therefore, it is impossible to distinguish what the next higher offset is. For example, in the above example, 36, 37 and 38 all belong to the same location.
The above is the data update log compression, of course, log compression also supports deletion. When the content of the message sending the latest version of a key is null, the key will be deleted (to some extent, it can be regarded as an update. For example, the email information is set to null in the above example). This message is also called delete marker, which will delete the previous message with the same key. However, the deletion flag is quite special. The special thing is that it was deleted after a period of time to free up disk space. The time point of data deletion will be marked as “delete retention point”, as shown in the figure above. This figure is also very special. You can see that the two points are not pointers, nor point to a message, but between messages. Indicates that it is a point in time, not a pointer to a message.

Compression is accomplished by periodically copying log segments in the background. When clearing, the read operation will not be blocked, and no more than a certain IO can be configured, so as to avoid affecting consumers and producers. The process of compressing log segments is as follows:

What guarantees does log compression provide

Log compression guarantee:
1. Any consumer who reads the header of the log can see all the messages, and the messages in the header will not be deleted. These messages have continuous offsets. Topic min.compaction.lag The. MS parameter can be used to ensure the existence of the message within a specified time without being compressed. This provides a baseline for the time that the message stays in the head (uncompressed).
2. Keep the order of the message. Compression never reorders a message, it just removes parts of it.
3. The offset of the message will never change. It always marks the location of the message.
4. Any processing from the beginning of the log will at least see the final status of each key. In addition, as long as consumers delete.retention.ms (the default is 24 hours) when the head of the log is reached within this time, you will see the deletion flag of all deleted records. That is to say: since the removal and reading of the deletion flag occur at the same time, if you miss it delete.retention.ms At this time, consumers will miss the delete flag.

Log compression details

Log compression is performed by log cleaner. The background thread pool copies log segments and removes the records existing in log head. Each compression thread works as follows:
1. Select log head with a higher proportion than log tail.
2. Create the log summary of the last offset corresponding to each key in the log head.
3. Copy from the beginning to the end, and delete the logs of the same key during the copy process. New, clean log segments will be swapped to the log immediately, so only one extra log segment size of hard disk space is needed (not all log space).
4. The log summary of log head is actually a compact hash table. Each entity only needs 24 bytes of space. Therefore, 8g of cleaner space can handle about 366g of log head (assuming the size of each message is 1K).

Configuring the log cleaner

Kafka is a thread pool that enables log scavenger by default. If you want to enable the cleaning function of the specified topic, you can add the following properties to the log:


This can be specified when creating a topic or when modifying a topic.

The log cleaner can set how many messages are not deleted in the log head. This is enabled by setting the compression period:


If it is not set, all log segments will be compressed except the last segment by default, that is, the last log segment will not be compressed. Any activated log segment will not be compressed, even if the message time has exceeded the time configured above. Activation here refers to consumption.


Kafka cluster has the ability to compulsorily control the resources used by clients in broker. Here are the quotes for two types of customers:
1. Network bandwidth quotas, specific to bytes (from version 0.9).
2. Request rate quota, specific to CPU utilization (ratio of network to IO).

Why is quota necessary? (Why are quotas necessary?)

Producers and consumers may generate / consume a large amount of data or the request rate is very high, so that it takes up the resources of the broker, resulting in network saturation, and the broker refuses other client services. Using quotes can avoid this problem, especially in multi tenant clusters, because some low-quality customers may reduce the user experience of high-quality customers. In fact, you can limit the API in this way.

Client groups

Kafka client identity is the user principal, which is used to represent the user’s authority on the secure cluster. When there is no authentication, the broker provides the user principal through the configurable principal builder for grouping. The client application selects the client ID as the logical grouping of customers. Tuple (user, client ID) defines a security logic group, sharing user principal and Chinese ID.
Quotas can be applied to tuples (user, client ID), user or client ID groups. For a connection, the quota on the match will be applied to the connection. For example, (user = “test user”, client id = “test client”) owns a producer whose quota is 10MB / s. This 10MB bandwidth will be shared by the producer whose user is “test user” and client ID is “test client”.

Quota configuration

Quota can be configured by (user, client ID), user group or client ID group. The default quota can be overridden by any level of quota. This mechanism is similar to that each topic can cover its own. The quota of zookeeper’s / config / users can override the quota of user and (user, client ID). /Under config / clients, you can override the quota of client ID. The coverage of zookeeper will take effect in all brokers, so we don’t need to restart the server when modifying the configuration. For details, pleaseclick
The priority of quota configuration is as follows:


Broker’s( quota.producer.default , quota.consumer.default )Property to set the default network bandwidth for each client ID group. But later versions will remove these properties.
The default quota of the client ID group can be configured in zookeeper.

Network bandwidth quotas

Network bandwidth quota, specific to bytes, is shared by customers in the group. By default, each individual customer group has a fixed network bandwidth quota. This quota is configured in each broker.

Request rate quotas

Request rate quota, specific to the percentage of time. The time is the IO thread and network thread of each broker in the quota window. An N% quota represents n% of a thread, so the total number of quotas is(( num.io.threads + num.network.threads )×100)%。 Each client group uses up to n% IO threads and network threads in a quota window. Since the number of threads allocated to IO and network is based on the number of CPUs in the broker host, the per request rate quota represents the percentage of CPUs.


By default, each unique customer group will have a fixed quota configured by the cluster. This quota is defined on each broker. The reason why we decided to define these quotas by each broer instead of setting a unified quota for each client by the cluster is to facilitate the sharing of quota settings.
What happens if the broker detects that the quota exceeds? In our solution, we choose to reduce the rate rather than return the error directly. The broker will calculate the delay time to deal with the problem, which will not immediately respond to the client. This kind of processing beyond quota is transparent to the client. The client does not need to do additional operations. In fact, the extra actions of the client, if not operated well, will aggravate the problem of exceeding quota.
Both byte rate and thread utilization are monitored in multiple small windows (30 windows per second) to quickly and accurately correct quota violations.
The client byte rate is measured on multiple small windows (such as 30 windows per second) to quickly detect and correct quota violations. Generally, a large measurement window (for example, 10 windows per 30 seconds) leads to a large amount of traffic, followed by a long delay, which is not good for the user experience.

Reference and Translation:
Rabbitmq officialhttps://www.rabbitmq.com
Kafka officialhttp://kafka.apache.org/docum…