3 cheap machines write 2 million per second! Why is Kafka so fast?


3 cheap machines write 2 million per second! Why is Kafka so fast?
Author: Bing Yue, introduction: Senior Engineer of Alibaba, focusing on distributed system and high availability architecture, methodology and cognitive upgrading, practice and continuous learning. source: www.cnblogs.com/binyue/p/10308754.html

Kafka in 30 seconds

Let’s take a quick look at Kafka and some details about how it works. Kafka is a distributed messaging system. It was first established in LinkedIn, and now it is a part of Apache Software Foundation and used by various companies.

Kafka messages are stored or cached on the disk. It is generally believed that reading and writing data on the disk will reduce the performance, because addressing will consume time. However, in fact, one of the characteristics of Kafka is high throughput. Even for ordinary servers, Kafka can easily support millions of write requests per second, surpassing most message oriented middleware. This feature also makes Kafka widely used in log processing and other massive data scenarios.

The general configuration is simple. The producer sends the record to the cluster, records it and gives it to the consumer

3 cheap machines write 2 million per second! Why is Kafka so fast?

The important abstraction of Kafka is the theme. Producers publish records to topics, and consumers subscribe to one or more topics. The Kafka theme is just a pre written log on a slice. Producers attach records to these logs and consumer subscription changes. Each record is a key / value pair. The key is used to assign records to a specific log partition (unless the publisher specifies the partition directly).

This is a simple example of a single producer and consumer from two partition themes of read and write.

3 cheap machines write 2 million per second! Why is Kafka so fast?

This figure shows the logs that a producer process appends to two partitions, and a consumer reads from the same log. Each record in the log has a related entry number, which we call offset. The consumer uses an offset to describe its position in each log.

These partitions are distributed on the cluster, allowing topics to hold more data than on any one machine.

Note that unlike most messaging systems, logs are always persistent. When a message is received, it is written directly to the file system. Messages are not deleted when they are read, but are retained according to a configurable SLA (say, a few days or a week). This allows it to be used when the data consumer may need to reload the data. It can also support space saving publish subscribe, because no matter how many consumers there are, there is only a single shared log; in the traditional messaging system, every consumer usually has a queue, so adding a consumer can double your data volume. This makes Kafka very suitable for things outside the normal messaging system, such as being used as a pipeline for offline data systems such as Hadoop. These offline systems can only be loaded at intervals as part of a periodic ETL cycle, or may take several hours for maintenance. During this period, Kafka can buffer TB level unused data if necessary.

Kafka also replicates logs to multiple servers for fault tolerance. Compared with other messaging systems, an important architecture of replica implementation is that replication does not need complex configuration, which is only used in very special cases. Assume that replication is the default: we treat unreplicated data as a special case where the replication factor is exactly one.

When the producer publishes a message containing an offset, the producer receives an acknowledgement. The first record published to the partition returns offset 0, the second record 1, and grows in sequence. Consumers consume data from the offset specified location, and save the location in the log through regular submission: save the offset to prevent the consumer instance from collapsing, and another instance can recover from the offset location.

I hope this helps (if not, you can read a more complete introduction to Kafka here).

The benchmark for Kafka can refer to Apache Kafka benchmark “2 million writes per second (on three cheap machines)”:


Next, we will analyze why Kafka is so fast from two aspects of data writing and reading.

Data writing

Kafka will write all received messages to the hard disk, and it will never lose data. In order to optimize the writing speed, Kafka uses two techniques, sequential writing and mmfile (memory mapped file).

Sequential write

The speed of disk reading and writing depends on how you use it, that is, sequential reading and writing or random reading and writing. In the case of sequential read-write, the sequential read-write speed of disk is equal to that of memory.

Because the hard disk is a mechanical structure, every read and write will be addressed and written. Addressing is a “mechanical action”, which is the most time-consuming.

So hard disks hate random I / o the most and like sequential I / o the most. In order to improve the speed of reading and writing hard disk, Kafka uses sequential I / O.

Moreover, Linux has many disk read-write optimizations, including read ahead and write behind, disk cache and so on.

If you do these operations in memory, one is that the memory overhead of Java objects is very large, and the other is that with the increase of heap memory data, the GC time of Java will become very long.

There are several benefits to using disk operations:

  • The speed of disk sequential read and write is faster than that of memory random read and write.
  • The efficiency of JVM’s GC is low and the memory consumption is large. Using disks can avoid this problem.
  • After the system is cold booted, the disk cache is still available.

The following figure shows how Kafka writes data. Each partition is actually a file. After receiving the message, Kafka inserts the data into the end of the file (the empty box part)

3 cheap machines write 2 million per second! Why is Kafka so fast?

There is a flaw in this method – there is no way to delete data, so Kafka will not delete data. It will keep all data. Each consumer has an offset for each topic to indicate which data has been read.

3 cheap machines write 2 million per second! Why is Kafka so fast?

Two consumers:

  • Consumer1 has two offsets corresponding to partition0 and partition1 (assuming each topic has a partition).
  • Consumer2 has an offset corresponding to partition2.

This offset is saved by the client SDK, and Kafka’s broker completely ignores the existence of this thing.

In general, the SDK will save it to zookeeper, so you need to provide the address of zookeeper to the consumer.

If you don’t delete the hard disk, it will be full, so kakfa provides two strategies to delete the data:

  • Time based
  • Based on partition file size

Please refer to its configuration document for specific configuration.

Memory Mapped Files

Even if it is written to the hard disk sequentially, the access speed of the hard disk cannot catch up with the memory. So Kafka’s data is not written to the hard disk in real time, it makes full use of the paging storage of modern operating system to improve the I / O efficiency.

Memory mapped files (hereinafter referred to as MMAP) is also translated into memory mapped files. In 64 bit operating system, it can generally represent 20g data files. Its working principle is to directly use the page of the operating system to realize the direct mapping of files to physical memory.

After the mapping is completed, your operations on physical memory will be synchronized to the hard disk (when the operating system is appropriate).

Through MMAP, a process reads and writes memory like a hard disk (virtual machine memory, of course), and it doesn’t have to care about the size of memory. We have virtual memory for us.

In this way, I / O can be greatly improved, and the cost of copying user space to kernel space is saved. (the read of the calling file will put the data into the memory of kernel space first, and then copy it into the memory of user space.)

But there is also an obvious defect – unreliability. The data written to MMAP is not actually written to the hard disk. The operating system will write the data to the hard disk only when the program actively calls flush.

Kafka provides a parameter producer.type To control whether it is active flush or not:

  • If Kafka writes to MMAP, flush immediately, and then returns to producer, it is called synchronization.
  • If Kafka does not call flush and returns producer immediately after writing MMAP, it is called async.

data fetch

What optimizations does Kafka make when reading disks?

Zero copy based on sendfile

In the traditional mode, when a file needs to be transferred, the specific process details are as follows:

  • Call the read function, and the file data is copied to the kernel buffer.
  • The read function returns, and the file data is copied from the kernel buffer to the user buffer
  • The write function is called to copy the file data from the user buffer to the socket related buffer of the kernel.
  • The data is copied from the socket buffer to the relevant protocol engine.

The above details are the traditional read / write mode for network file transfer. We can see that in this process, the file data actually goes through four copy operations:

Hard disk > kernel buf > User buf > socket related buffer > protocol engine

The sendfile system call provides a way to reduce the number of copies and improve the file transfer performance.

In kernel version 2.1, the sendfile system call is introduced to simplify the data transfer between two local files on the network.

The introduction of sendfile not only reduces data replication, but also reduces context switching.

sendfile(socket, file, len);

The operation process is as follows:

  • Sendfile system call, the file data is copied to the kernel buffer.
  • Then copy from the kernel buffer to the socket related buffer in the kernel.
  • Finally, copy the socket related buffer to the protocol engine.

Compared with the traditional read / write mode, sendfile introduced by version 2.1 kernel has reduced the number of files from kernel buffer to user buffer, and then from user buffer to socket related buffer.

After kernel version 2.4, the result of file descriptor is changed. Sendfile implements a simpler way and reduces one copy operation again.

In Apache, nginx, lighttpd and other web servers, there is a sendfile related configuration. Using sendfile can greatly improve the file transfer performance.

Kafka stores all messages in one file. When the consumer needs data, Kafka sends the file directly to the consumer. With MMAP as the file reading and writing method, Kafka sends it directly to sendfile.

Batch compression

In many cases, the bottleneck of the system is not CPU or disk, but network IO, especially for the data pipeline that needs to send messages between data centers in Wan.

Data compression will consume a small amount of CPU resources, but for Kafka, network IO should be considered more

  • Because every message is compressed, but the compression rate is relatively low, Kafka uses batch compression, that is, multiple messages are compressed together instead of a single message.
  • Kafka allows the use of recursive message sets, bulk messages can be transmitted in compressed form, and the compressed format can be maintained in the log until it is decompressed by consumers.
  • Kafka supports a variety of compression protocols, including gzip and snappy.


The secret of Kafka’s speed is that it turns all messages into a batch file, compresses them in a reasonable batch, reduces network IO loss, and improves I / O speed through MMAP.

When writing data, because a single partition is added at the end, the speed is optimal; Read data with sendfile direct output.