Talk about Kafka: why is Kafka so fast?

Time:2021-9-10

Welcome to my official account, old Chou chat architecture, the principle of Java’s back-end technology stack, source analysis, architecture and various Internet high concurrency, high-performance and high availability solutions.

1、 Foreword

We all know that Kafka is based on disk storage, but Kafka officials also say that it has the characteristics of high performance, high throughput and low delay, and its throughput is often tens of millions. Are you a little confused? It is generally believed that reading and writing data on the disk will reduce the performance, because addressing will consume more time. How does Kafka achieve its throughput of tens of millions?

Kafka’s high performance is the result of cooperation in many aspects, including macro architecture, distributed partition storage, ISR data synchronization, and the efficient use of disk and operating system characteristics.

Don’t worry. Next, Lao Zhou will take you to find out from the two dimensions of data writing and reading.

2、 Sequential write

There are two ways of disk reading and writing: sequential reading and writing or random reading and writing. In the case of sequential read-write, the sequential read-write speed of the disk is the same as that of the memory.

Because the disk is a mechanical structure, every read and write will address – > write, where addressing is a “mechanical action”. In order to improve the speed of reading and writing disks, Kafka uses sequential I / O.

Talk about Kafka: why is Kafka so fast?
Kafka uses a segmented, append only log to basically limit its read and write operations to sequential I / O, which makes it fast on various storage media. There has always been a widespread misconception that disks are slow. In fact, the performance of storage media (especially rotating mechanical hard disks) depends largely on access mode. On a 7200 RPM SATA mechanical hard disk, the performance of random I / O is about 3 to 4 orders of magnitude lower than that of sequential I / O. In addition, generally speaking, modern operating systems provide read ahead and delayed write technologies: preloading data in multiples of large data blocks, and merging multiple small logical write operations into one large physical write operation. Because of this, the performance gap between sequential I / O and random I / O is still obvious in flash and other solid-state nonvolatile storage media, although it is far less obvious than rotating storage media.

Here is the performance comparison chart on the famous academic journal ACM queue:https://queue.acm.org/detail.cf

Talk about Kafka: why is Kafka so fast?

The following figure shows how Kafka writes data. Each partition is actually a file. After receiving the message, Kafka will insert the data at the end of the file (virtual box):

Talk about Kafka: why is Kafka so fast?
This method adopts a read-only design, so Kafka will not modify or delete data. It will retain all data. Each consumer has an offset for each topic to indicate which data is read.

Talk about Kafka: why is Kafka so fast?
Sequential disk read-write is the most regular disk use mode, and the operating system has also made a lot of optimization for this mode. Kafka uses sequential disk read-write to improve its performance. Kafka’s message is constantly appended to the end of the local disk file rather than written randomly, which significantly improves Kafka’s write throughput.

3、 Page cache

Even if it is written to the hard disk sequentially, the access speed of the hard disk cannot catch up with the memory. Therefore, Kafka’s data is not written to the hard disk in real time. It makes full use of paging storage of modern operating system to improve I / O efficiency by using memory. Specifically, it is to cache the data in the disk into memory and change the access to the disk into the access to memory.

Kafka receives the network data from the socket buffer. The application process does not need intermediate processing and is directly persistent. You can use MMAP memory file mapping.

3.1 Memory Mapped Files

abbreviationmmap, its function is briefly described as follows:Map disk files to memory, and users can modify disk files by modifying memory

Its working principle is to directly use the page of the operating system to realize the direct mapping from disk files to physical memory. After the mapping is completed, your operations on the physical memory will be synchronized to the hard disk (the operating system at the appropriate time).

Talk about Kafka: why is Kafka so fast?
Through MMAP, the process reads and writes memory (virtual machine memory, of course) like reading and writing hard disk. Using this method, you can get a great I / O improvement and save the overhead of copying from user space to kernel space.

MMAP also has an obvious defect: it is unreliable. The data written in MMAP is not really written to the hard disk. The operating system will write the data to the hard disk only when the program actively calls flush.

Kafka provides a parameter producer.type to control whether it is active flush:

  • If Kafka immediately flush after writing to MMAP, and then return to producer to call synchronization (sync);
  • Immediately after writing MMAP, return to producer and call async without calling flush.

3.2 Java NiO support for file mapping

Java NiO provides a mappedbytebuffer class that can be used to implement memory mapping.

Mappedbytebuffer can only be obtained by calling filechannel’s map(). There is no other way.

Filechannel. Map () is an abstract method. The specific implementation is that filechannelimpl. Map () can view the JDK source code by itself. Its Map0 () method calls the MMAP API of the Linux kernel.

Talk about Kafka: why is Kafka so fast?
Talk about Kafka: why is Kafka so fast?
Talk about Kafka: why is Kafka so fast?
3.3 precautions for using mappedbytebuffer class

The file mapping of MMAP is released only when GC is full. When closing, you need to manually clear the memory mapping file. You can call the sun.misc.cleaner method.

When a process is ready to read the contents of a file on disk:

  • The operating system will first check whether the page where the read data is located is in the page cache. If it exists (hits), it will directly return the data, so as to avoid I / O operations to the physical disk;
  • If there is no hit, the operating system will send a read request to the disk, store the read data page in the page cache, and then return the data to the process.

If a process needs to write data to disk:

  • The operating system will also detect whether the page corresponding to the data is in the page cache. If it does not exist, the corresponding page will be added to the page cache first, and finally the data will be written to the corresponding page.
  • The modified page becomes a dirty page. The operating system will write the data in the dirty page to disk at the appropriate time to maintain the consistency of the data.

For a process, it will cache the required data inside the process. However, these data may still be cached in the page cache of the operating system, so the same data may be cached twice. Moreover, unless direct I / O is used, page caching is difficult to prohibit.

When using page caching, the page cache will remain valid even if the Kafka service is restarted, but the in-process cache needs to be rebuilt. This also greatly simplifies the code logic, because maintaining the consistency between page cache and files is the responsibility of the operating system, which is more safe and effective than in-process maintenance.

Page caching is widely used in Kafka, which is one of the important factors for Kafka to achieve high throughput.

The message is first written to the page cache, and the operating system is responsible for the disk flushing task.

4、 Zero copy

A typical source of application inefficiency is byte data copies between buffers. Kafka uses the binary message format shared by producers, brokers and consumers, so data blocks can flow end-to-end without modification even in a compressed state. Although eliminating the structural differences between communication parties is a very important step, it itself can not avoid the copy of data.

Kafka uses Java’s NiO framework, especiallyjava.nio.channels.FileChannelInsidetransferToThis method solves the previously mentioned problem of data copying on UNIX like systems such as Linux. This method can transmit byte data directly from the source channel to the receiving channel without the help of the application as the transmission intermediary. To understand the improvements brought by NiO, consider the traditional mode as two separate operations: the data in the source channel is read into the byte buffer, and then written to the receive channel:

File.read(fileDesc, buf, len);
Socket.send(socket, buf, len);

This process can be described as follows:
Talk about Kafka: why is Kafka so fast?
Although the above process seems simple enough, it still needs 4 user state and kernel state context switches to complete the copy operation, and it needs 4 data copies to complete the operation. The following diagram outlines the context switching in each step.

Talk about Kafka: why is Kafka so fast?
Let’s look at the details in more detail:

  • The initial read () call resulted in a context switch from user state to kernel state. The DMA (direct memory access) engine reads the file and copies its contents to the buffer in the kernel address space. This buffer is not the same as the one used in the code snippet above.
  • Before returning from read (), the data in the kernel buffer will be copied to the user state buffer. At this point, our program can read the contents of the file.
  • The next send () method switches back to the kernel state and copies the user state buffer data to the kernel address space — this time to a different buffer associated with the target socket. In the background, the DMA engine will take over this operation and asynchronously copy the data from the kernel buffer to the protocol stack for network transmission by the network card. The send () method does not wait for this operation before returning.
  • The send() call returns and switches back to the user state.

Although mode switching is inefficient and requires additional copies, in many cases, the performance of the intermediate kernel buffer can actually be further improved. For example, it can be used as a read ahead cache to asynchronously preload data blocks, so that requests can be run at the front end of the application. However, when the amount of data requested greatly exceeds the size of the kernel buffer, the kernel buffer will become a performance bottleneck. Instead of copying data directly, it forces the system to swing between user state and kernel state until all data is transmitted.

In contrast, zero copy can be processed in a single operation. The code snippet in the previous example can now be rewritten as a one-line program:

fileDesc.transferTo(offset, len, socket);

The zero copy method can be illustrated by the following figure:

Talk about Kafka: why is Kafka so fast?
In this mode, the number of context switches is reduced to one. Specifically, the transferto () method instructs the data block device to read the data into the read buffer through the DMA engine, and then copy the data in the buffer to another kernel buffer and write it to the socket in stages. Finally, DMA copies the data from the socket buffer to the NIC buffer.

Talk about Kafka: why is Kafka so fast?
As a result, we have reduced the number of copies from 4 to 3, and only one copy takes up CPU resources. We have also reduced the number of context switches from 4 to 2.

Read the filechannel after the disk file is read from the OS kernel buffer and directly transfer it to the socketchannel for sending; The bottom layer is sendfile. This is how consumers read data from brokers.

Specifically, Kafka’s data transmission is completed through transportlayer, and its subclass plaintexttransportlayer realizes zero copy through the transferto and transferfrom methods of filechannel of Java NiO.

Talk about Kafka: why is Kafka so fast?
Note: transferto and transferfrom do not guarantee that zero copy can be used, which requires the support of the operating system.

This is a huge improvement, but it has not yet achieved complete “zero copy”. We can make further optimization by using Linux kernel version 2.4 or higher and network cards supporting gather operation, so as to achieve real “zero copy”. The following schematic diagram can illustrate:
Talk about Kafka: why is Kafka so fast?
Calling the transferto () method causes the device to read data into the kernel read buffer through the DMA engine, as in the previous example. However, through the gather operation, the data copy between the read buffer and the socket buffer will no longer exist. Instead, the NIC is given a pointer to the read buffer, along with the offset and length, and all data will be extracted by DMA and copied to the NIC buffer. In this process, copying data in the buffer will not occupy any CPU resources.

The performance comparison between the traditional method and the zero copy method in the file size range of MB bytes to GB bytes shows that the performance of the zero copy method is improved by 2 to 3 times compared with the traditional method. But what’s more amazing is that Kafka only realizes this under a pure JVM virtual machine without using local libraries or JNI code.

5、 Broker performance

5.1 logging batch processing

Sequential I / O is very fast on most storage media, which is almost comparable to the peak performance of network I / O. In practice, this means that a well-designed persistence layer of log structure will be able to follow the speed of network traffic. In fact, Kafka’s bottleneck is usually network rather than disk. Therefore, in addition to the underlying batch processing capability provided by the operating system, Kafka’s clients and brokers will combine multiple read and write log records into one batch before sending them through the network. Batch logging dilutes the overhead of network round-trip by using larger packets and improving bandwidth efficiency.

5.2 batch compression

When compression is enabled, the impact of batch processing is particularly obvious, because compression efficiency usually becomes higher with the increase of data size. Especially when text-based data formats such as JSON are used, the compression effect will be very significant, and the compression ratio can usually reach 5 to 7 times. In addition, logging batch processing is largely completed as an operation on the client side, which transfers the load to the client, which not only improves the network bandwidth efficiency, but also greatly improves the disk I / O utilization of brokers.

5.3 non forced flush buffer write operation

Another underlying reason that contributes to Kafka’s high performance and is worth further exploration: Kafka’s disk write operation before confirming the write success ack will not really call the fsync command; Generally, you only need to ensure that the log record is written to the I / O buffer to reply to the ACK signal to the client. This is a little-known but crucial fact: in fact, this is what makes Kafka behave like a memory message queue – because Kafka is a disk based memory message queue (limited by buffer / page cache size).

On the other hand, this form of write is not safe because a write failure of the replica may result in data loss, even if the logging seems to have been confirmed successfully. In other words, unlike relational databases, confirming the success of a write operation is not equal to the success of persistence. What really makes Kafka persistent is the design of running multiple synchronized replicas; Even if one copy fails to write, other copies (assuming multiple copies) can still remain available, provided that the write failure is irrelevant (for example, multiple copies fail to write at the same time due to a common upstream failure). Therefore, the combination of I / O non blocking method without fsync and redundant synchronous copies makes Kafka have high throughput, persistence and availability at the same time.

6、 Stream data parallelism

The efficiency of log structure I / O is a key factor affecting performance, which mainly affects write operations; Kafka’s parallel processing of topic structure and consumer group is the basis of its reading performance. This combination results in very high overall end-to-end messaging throughput. Concurrency is deeply rooted in Kafka’s partition scheme and the operation of consumer groups. This is an effective load balancing mechanism in Kafka, which distributes data partitions approximately evenly to each consumer instance in the group. Compare this with more traditional MQ: in the equivalent setting of rabbitmq, multiple concurrent consumers may read data from the queue by polling, but in doing so, the order of message consumption will be lost.

The partitioning mechanism also allows Kafka brokers to scale horizontally. Each partition has a special leader; Therefore, any important topic (with multiple partitions) can use the entire broker cluster for write operations, which is another difference between Kafka and message queue; The latter uses clusters to obtain availability, while Kafka will really balance the load among brokers to obtain availability, persistence and throughput.

The producer specifies the partition when publishing log records. Suppose you are publishing messages to a topic with multiple partitions. (there may also be topics with a single partition, which will not be a problem.) this can be done by directly specifying the partition index, or indirectly by logging the key value, which can be deterministic hashed to a consistent (i.e. the same every time) partition index. Log records with the same hash value will be stored in the same partition. Assuming that a topic has multiple partitions, those log records with different hash values are likely to be stored in different partitions. However, due to hash collision, log records with different hash values may finally be stored in the same partition. This is the essence of hash. If you understand the principle of hash table, it should be obvious.

The actual processing of logging is done by a consumer operation in the (optional) consumer group. Kafka ensures that a partition can be assigned to at most one consumer in its consumer group. (we say “most” because we consider that all consumers are offline.) when the consumer in the first consumer group subscribes to the topic, it will consume the data of all partitions under the topic. When the second consumer joins the subscription immediately after it, it will obtain roughly half of the partition of the topic, reducing the previous load of the first consumer by half. This allows you to process event streams in parallel and add consumers as needed (ideally, using automatic scaling mechanism), provided you have partitioned the event streams properly.

The control of logging throughput is generally achieved in the following two ways:

  • Partition scheme of topic. Topics should be partitioned to maximize the number of independent sub event streams. In other words, the order of logging should be kept only where absolutely necessary. If any two log records are not reasonably related in a sense, they should not be bound to the same partition. This implies that you should use different key values, because Kafka will use the logged key values as a hash source to derive its consistent partition mapping.
  • The number of consumers in a group. You can increase the number of consumers in the consumer group to balance the load of inbound logging. The upper limit of this number is the number of topic partitions. (of course, you can add more consumers if you like, but the partition count will set an upper limit to ensure that each active consumer is assigned to at least one partition, and the extra consumers will always remain idle.) please note that consumers can be processes or threads. Depending on the type of workload performed by the consumer, you can use multiple independent consumer threads or process records in the thread pool.

If you always wanted to know whether Kafka is fast, how it has its recognized high-performance label, or whether it can meet your use scenarios, I believe you should have the answer you need now.

In order to make things clear enough, it must be noted that Kafka is not the fastest (that is, the fastest throughput) messaging middleware, and there are other platforms with greater throughput – some are based on Software – and some are implemented in hardware. Apache pulsar is a promising technology. It has scalability and can achieve better throughput delay effect while providing the same message sequencing and persistence guarantee. The fundamental reason for using Kafka is that it is still unparalleled as a complete ecosystem. It shows excellent performance and provides a rich, mature and evolving environment. Although Kafka is quite large, it is still growing at an enviable speed.

Kafka’s designers and maintainers have done a great job in creating a performance-oriented solution. Most of its design / concept elements are conceived early, almost nothing is thought of afterwards, and nothing is additional. From workload allocation to client to broker log structure persistence, batch processing, compression, zero copy I / O and stream data level parallelism – Kafka challenges almost all other message oriented middleware (commercial or open source). And the most amazing thing is that it does these things without sacrificing the persistence, logging order and semantics of at least one delivery.

7、 Summary

7.1 MMAP and sendfile

  • The Linux kernel provides and implements zero copy APIs.
  • MMAP maps disk files to memory, supports reading and writing, and memory operations are reflected on disk files.
  • Sendfile transfers the data read to the kernel space to the socket buffer for network transmission.
  • Rocketmq uses MMAP when consuming messages; Kafka uses sendfile.

7.2 why is Kafka so fast?

  • Partition reads and writes sequentially, making full use of disk characteristics, which is the foundation.
  • The data produced by producer is persisted to the broker, and MMAP file mapping is adopted to realize sequential and fast writing.
  • The customer reads data from the broker and uses sendfile to read the disk file to the OS kernel buffer, and then directly goes to the socket buffer for network transmission.
  • Broker Performance Optimization: log record batch processing, batch compression, non forced refresh buffer write operation, etc.
  • Stream data parallelism