Three cheap machines write 2 million per second! Why is Kafka so fast?


Three cheap machines write 2 million per second! Why is Kafka so fast?
Author: Bing Yue, profile: Ali senior engineer, focus on distributed system and high availability architecture, methodology and cognitive upgrading, practice and continuous learning. source:

A brief review of Kafka in 30 seconds

Let’s take a quick look at Kafka and some details about how it works. Kafka is a distributed messaging system that was originally established on LinkedIn and is now part of the Apache Software Foundation and is used by various companies.

Kafka’s messages are stored or cached on the disk. It is generally believed that reading and writing data on the disk will reduce the performance, because addressing will consume time. However, in fact, one of the features of Kafka is high throughput. Even for ordinary servers, Kafka can easily support million level write requests per second, which is more than most message oriented middleware. This feature also makes Kafka widely used in massive data scenarios such as log processing.

The general configuration is very simple. Producers send records to the cluster, send their records, and pass them on to consumers

Three cheap machines write 2 million per second! Why is Kafka so fast?

Kafka’s important abstraction is the theme. Producers publish records to topics, and consumers subscribe to one or more topics. Kafka topics are just pre written logs on a fragment. Producers attach records to these logs and consumer subscription changes. Each record is a key / value pair. The key is used to assign records to a specific log partition (unless the publisher specifies the partition directly).

This is a simple example of a single producer and consumer reading and writing topics from two partitions.

Three cheap machines write 2 million per second! Why is Kafka so fast?

This figure shows the logs that a producer process appends to two partitions and a consumer reads from the same log. Each record in the log has an associated entry number, which we call the offset. The consumer uses an offset to describe its position in each log.

These partitions are distributed on clusters, allowing topics to hold more data than on any one machine.

Note that, unlike most messaging systems, logs are always persistent. When a message is received, it is written directly to the file system. Messages are not deleted when they are read, but are retained according to a configurable SLA (say, a few days or a week). This allows for use in situations where data consumers may need to reload data. It can also support space saving publish and subscribe, because no matter how many consumers have a single shared log; in traditional messaging systems, each consumer usually has a queue, so adding a consumer can double your data volume. This makes Kafka very suitable for things outside the normal messaging system, such as the pipeline for offline data systems such as Hadoop. These offline systems can only be loaded as part of the periodic ETL cycle, or may take several hours to maintain, during which Kafka can buffer TB level unused data if needed.

Kafka also makes fault tolerance by copying logs to multiple servers. Compared with other messaging systems, an important architecture of replica implementation is that replication does not require complex configurations, which are only used in very special situations. Suppose replication is the default: we treat unreplicated data as a special case where the replica factor happens to be one.

When a producer publishes a message containing an offset, the producer receives an acknowledgement. The first record published to the partition returns offset 0 and the second record 1, growing in sequence. The consumer consumes data from the offset specified location and saves the location in the log by periodic submission: save the offset in case the consumer instance crashes, and another instance can recover from the offset location.

I hope this helps (if not, you can read a more complete introduction to Kafka here).

For the benchmark of Kafka, you can refer to Apache Kafka benchmark “write 2 million per second (on three cheap machines)”:…

Next, we will analyze why Kafka is so fast from two aspects of data writing and reading.

Data writing

Kafka will write all received messages to the hard disk, and it will never lose data. In order to optimize the writing speed, Kafka adopts two technologies, sequential write and mmfile (memory mapped file).

Sequential write

The speed of disk reading and writing depends on how you use it, that is, sequential or random. In the case of sequential read-write, the disk’s sequential read-write speed is equal to the memory.

Because the hard disk is a mechanical structure, every time you read and write, you will address and then write. Among them, addressing is a “mechanical action”, which is the most time-consuming.

So hard disk hate random I / O most, like sequential I / O. In order to improve the speed of reading and writing hard disk, Kafka uses sequential I / O.

Moreover, Linux has a lot of disk read and write optimization, including read ahead and write behind, disk cache, etc.

If you do these operations in memory, one is that the memory cost of Java objects is very high, and the other is that with the increase of heap memory data, the GC time of Java will become very long.

There are several benefits of using disk operations:

  • Disk sequential read-write speed is faster than random read-write memory.
  • The GC efficiency of the JVM is low and the memory consumption is large. Using disks can avoid this problem.
  • After the system is cold booted, the disk cache is still available.

The following figure shows how Kafka writes data. Each partition is actually a file. After receiving the message, Kafka will insert the data to the end of the file (the virtual box part)

Three cheap machines write 2 million per second! Why is Kafka so fast?

There is a defect in this method – there is no way to delete data, so Kafka will not delete data. It will keep all data. Each consumer has an offset for each topic to indicate the number of data read.

Three cheap machines write 2 million per second! Why is Kafka so fast?

Two consumers:

  • Consumer1 has two offsets corresponding to partition0 and partition1 respectively (assuming one partition for each topic).
  • Consumer2 has an offset corresponding to partition2.

This offset is saved by the client SDK. Kafka’s broker completely ignores the existence of this object.

Generally, the SDK will save it to zookeeper, so the address of zookeeper needs to be provided to the consumer.

If the hard disk is not deleted, it will be full. Therefore, kakfa provides two strategies to delete data:

  • Time based
  • Based on partition file size

For specific configuration, please refer to its configuration document.

Memory Mapped Files

Even if the hard disk is written sequentially, the access speed of the hard disk cannot catch up with the memory. Therefore, Kafka’s data is not real-time written to the hard disk, it makes full use of modern operating system paging storage to improve the I / O efficiency.

Memory mapped files (hereinafter referred to as MMAP) is also translated into memory mapped files. In 64 bit operating systems, it can generally represent 20g data files. Its working principle is to directly use the page of the operating system to realize the direct mapping from files to physical memory.

After mapping, your operations on physical memory will be synchronized to the hard disk (operating system, when appropriate).

Through MMAP, processes can read and write memory (virtual machine memory, of course) just like reading and writing hard disk, and do not need to care about the size of memory. There is virtual memory for us.

In this way, I / O can be greatly improved and the overhead of copying from user space to kernel space is saved. (calling the read of the file will first put the data into the memory in kernel space, and then copy it to the memory in user space)

However, there is an obvious defect – unreliable. The data written to MMAP is not really written to the hard disk. The operating system will write the data to the hard disk only when the program calls flush actively.

Kafka provides a parameter producer.type To control whether it is active flush:

  • If Kafka immediately flushes after writing to MMAP, and then returns to producer, it is called synchronization.
  • If Kafka returns producer immediately after writing MMAP, and does not call flush, it is called async.

data fetch

What optimizations does Kafka make when reading disks?

Zero copy based on sendfile

In the traditional mode, when a file needs to be transferred, the specific process details are as follows:

  • The read function is called, and the file data is copied to the kernel buffer.
  • The read function returns the file data from the kernel buffer to the user buffer
  • The write function is called to copy the file data from the user buffer to the kernel socket related buffer.
  • The data is copied from the socket buffer to the relevant protocol engine.

The above details are the traditional read / write method for network file transmission. We can see that in this process, the file data has actually undergone four copy operations:

Hard disk > kernel buf > User bucket > socket related buffer > protocol engine

The sendfile system call provides a method to reduce the above multiple copies and improve the file transfer performance.

In kernel version 2.1, the sendfile system call is introduced to simplify the data transfer on the network and between two local files.

The introduction of sendfile not only reduces data replication, but also reduces context switching.

sendfile(socket, file, len);

The operation process is as follows:

  • Sendfile system call, the file data is copied to the kernel buffer.
  • Then copy from the kernel buffer to the socket related buffer in the kernel.
  • Finally, copy the socket related buffer to the protocol engine.

Compared with the traditional read / write mode, sendfile introduced by the 2.1 kernel has reduced the file copy from the kernel buffer to the user buffer, and then from the user buffer to the socket related buffer.

After kernel version 2.4, the result of file descriptor is changed, and sendfile implements a simpler way to reduce one copy operation again.

In Apache, nginx, lighttpd and other web servers, there is a sendfile related configuration. Using sendfile can greatly improve the file transfer performance.

Kafka stores all messages in one file after another. When consumers need data, Kafka directly sends the file to the consumer. With MMAP as a file reading and writing method, Kafka directly transmits it to sendfile.

Batch compression

In many cases, the bottleneck of the system is not CPU or disk, but network IO, especially for the data pipeline that needs to send messages between data centers on WAN.

Data compression will consume a small amount of CPU resources. However, for Kafka, network IO should consider:

  • Because each message is compressed, but the compression rate is relatively low, Kafka uses batch compression, that is, multiple messages are compressed together instead of a single message.
  • Kafka allows the use of recursive message sets, bulk messages can be transmitted in compressed form and can be kept in compressed format in the log until decompressed by the consumer.
  • Kafka supports a variety of compression protocols, including gzip and snappy.


The secret of Kafka speed lies in that it turns all the messages into a batch file, performs reasonable batch compression, reduces network IO loss, and improves I / O speed through MMAP.

When writing data, the speed is the best because a single part is added at the end; when reading data, it is output directly with sendfile.