1、 Overview of Kafka persistence
Kakfa relies on the file system to store and cache messages. The traditional idea of hard disk is that hard disk is always slow. Can file system based architecture provide excellent performance? In fact, the speed of the hard disk depends entirely on how it is used. At the same time, Kafka has the following disadvantages based on JVM memory:
- Object memory overhead is very high, usually twice or more than the data to be stored
- As the amount of data in the heap increases, the speed of GC becomes slower and slower
In fact, the performance of disk linear write is far greater than that of arbitrary location write. Linear read-write is greatly optimized by the operating system (read ahead, write behind and other technologies), even faster than random memory read-write. Therefore, unlike the common design of caching data in memory and then brushing it to the hard disk, Kafka directly writes the data to the file system log
- Write operations: appending data sequentially to a file
- Read operation: read from file
The benefits are as follows:
- Read operations do not block write and other operations, and data size does not affect performance
- Hard disk space is less limited than memory space
- Linear access disk, fast, can save longer time, more stable.
2、 Analysis of Kafka’s persistence principle
A topic is divided into multiple partitions. Each partition is an append only log file at the storage level. Messages belonging to a partition are directly appended to the tail of the log file. The position of each message in the file is called offset.
As shown in the following figure, we previously created mytopic1 with three partitions. We can go to the corresponding log directory to view.
Kafka logs are divided into index and log (as shown in the figure above). Two appear in pairs: index file stores metadata, and log stores messages. The metadata of the index file points to the migration address of the message in the corresponding log file; for example, 2128 refers to the second piece of data in the log file, with the offset address of 128; while the physical address (specified in the index file) + offset address can locate the message.
We can use the tools provided by Kafka to view the data information in the log file