1. Overview of Kafka message persistence
Kakfa relies on the file system to store and cache messages. The traditional concept of hard disk is that hard disk is always very slow. Can file system based architecture provide excellent performance? In fact, the speed of a hard disk depends entirely on how it is used. At the same time, Kafka has the following disadvantages based on JVM memory:
- The memory overhead of an object is very high, usually twice or more than the data to be stored
- With the increase of data in the heap, the speed of GC becomes slower and slower
In fact, the performance of disk linear write is much better than that of write at any location. Linear read-write is optimized by the operating system (read ahead, write behind and other technologies), even faster than random memory read-write. Therefore, unlike the common design of caching data in memory and then brushing it to the hard disk, Kafka directly writes the data to the log of the file system
- Write operation: appending the data sequence to the file
- Read operation: read from file
The benefits of this approach are:
- Read operations do not block write and other operations, and data size does not affect performance
- Hard disk space is less limited than memory space
- Linear access disk, fast, can save longer, more stable
2. Analysis of Kafka’s persistence principle
A topic is divided into multiple partitions. Each partition is an append only log file at the storage level. Messages belonging to a partition are directly appended to the tail of the log file. The position of each message in the file is called offset.
As shown in the figure below, we created mytopic1 with three partitions. We can go to the corresponding log directory to view.
Kafka logs are divided into index and log (as shown in the figure above), which appear in pairs: index file stores metadata and log stores messages. The index file metadata points to the migration address of message in the corresponding log file; For example, 2128 refers to the second data in the log file, and the offset address is 128; The physical address (specified in the index file) + offset address can locate the message.
We can use Kafka’s own tool to view the data information in the log file