Detailed explanation of juicefs data reading and writing process

Time:2021-12-26

For the file system, its read-write efficiency has a decisive impact on the overall system performance. In this paper, we will introduce the read-write request processing flow of juicefs to let you have a further understanding of the characteristics of juicefs.

Write process

Juicefs does multi-level splitting for large files (seeHow does juicefs store files)To improve reading and writing efficiency. When processing write requests, juicefs writes data to the memory buffer of the client and manages it in the form of chunk / slice. A chunk is a continuous logical unit split by 64 MIB according to the offset in the file. Different chunks are completely isolated. Each chunk will be further divided into slices according to the actual situation of the application write request; When a new write request is continuous or overlapped with an existing slice, it will be updated directly on the slice. Otherwise, a new slice will be created.

Slice is a logical unit that starts data persistence. When flushing, it will first split the data into one or more consecutive blocks according to the default 4 MIB size and upload them to the object storage. Each block corresponds to an object; Then update the metadata again and write the new slice information. Obviously, in the case of application sequential writing, onlyOneThe slice that keeps growing can only flush once in the end; At this time, the write performance of the object storage can be maximized.

With a simpleJuicefs benchmarkFor example, in the first stage, 1 gib file is written in order using 1 MIB io. The form of data in each component is shown in the following figure:

be careful: compression and encryption in the figure are not turned on by default. To enable related functions, you need to add them when you use the format file system--compress valueor--encrypt-rsa-key valueOptions.

Here’s another one for testingstatsThe index chart recorded by the command can more intuitively see the relevant information:

Phase 1 in the figure above:

  • The average IO size for object store writes isobject.put / object.put_c = 4 MiB, equal to the default size of the block
  • The ratio of metadata transactions to object storage writes is aboutmeta.txn : object.put_c ~= 1 : 16, corresponding to 1-dimensional metadata modification and 16 object storage uploads required by slice flush, it also shows that the amount of data written in each flush is 4 MIB * 16 = 64 MIB, that is, the default size of chunk
  • The average request size of the fuse layer is aboutfuse.write / fuse.ops ~= 128 KiB, consistent with its default request size limit

Compared with sequential writing, random writing in large files is much more complex; May exist in each chunkMultiple discontinuitiesSlice makes it difficult for data objects to reach the size of 4 MIB on the one hand, and metadata needs to be updated many times on the other hand. At the same time, when too many slices have been written in a chunk, the comparison will be triggered to try to merge and clean up these slices, which will further increase the burden of the system. Therefore, in such scenarios, the performance of juicefs will be significantly worse than that of sequential writing.

The writing of small files is usually uploaded to the object storage when the file is closed, and the corresponding IO size is generally the file size. It can also be seen from phase 3 of the above indicator diagram (creating 128 KIB small files):

  • The size of the put stored by the object is 128 KIB
  • The number of metadata transactions is roughly twice the put count, corresponding to one create and one write of each file

It is worth mentioning that for such objects with less than one block, juicefs will try to write to the local cache while uploading--cache-dirSpecify, which can be memory or hard disk), in order to improve the speed of subsequent possible read requests. It can also be seen from the indicator diagram that when creating small files, there is the same write bandwidth under the blockcache, and most of them are hit in the cache during reading (phase 4), which makes the reading speed of small files look very fast.

Since the write request can be returned by writing to the client memory buffer, generally speaking, the write delay of juicefs is very low (tens of microseconds). The action of uploading to the object storage is automatically triggered internally (a single slice is too large, the number of slices is too large, the buffer time is too long, etc.) or actively triggered by the application (closing files and calling)fsyncEtc.). The data in the buffer can only be released after being persisted. Therefore, when the write concurrency is large or the object storage performance is insufficient, the buffer may be full, resulting in write blocking.

Specifically, the size of the buffer is determined by the mount parameter--buffer-sizeSpecified, 300 MIB by default; The real-time value can be found in the usage BUF column. When the usage exceeds the threshold, the juicefs client will actively add about 10ms waiting time for write to slow down the write speed; If the used amount exceeds twice the threshold, new writes will be suspended until the buffer is released. Therefore, when the write delay increases and the buffer exceeds the threshold for a long time, it is usually necessary to try to set a larger delay--buffer-size。 In addition, by increasing--max-uploadsParameter (the maximum concurrent number uploaded to the object storage, the default is 20) may also increase the bandwidth written to the object storage, so as to speed up the release of the buffer.

Writeback mode

When the requirements for data consistency and reliability are not high, it can also be added during mounting--writebackTo further improve system performance. When writeback mode is on, Slice flush only needs to write to the local staging directory (shared with the cache) to return, and the data is asynchronously uploaded to the object storage by the background thread. Please note that the write back mode of juicefs is different from the general understanding of write memory first, and the data needs to be written to the local cache directory (the specific behavior depends on the hardware where the cache directory is located and the local file system). In other words, the local directory is the cache layer of object storage.

When the write back mode is enabled, the size check of uploaded objects will be skipped by default, and all data will be kept in the cache directory as aggressively as possible. This is especially useful in some scenarios where a large number of intermediate files will be generated (such as software compilation).

In addition, juicefs v0 Version 17 also adds--upload-delayParameter, which is used to delay the time when data is uploaded to the object storage and cache it locally in a more aggressive way. If the data is deleted by the application within the waiting time, there is no need to upload it to the object storage, which improves the performance and saves the cost. At the same time, compared with the local hard disk, juicefs provides back-end protection. When the cache directory capacity is insufficient, it will still automatically upload data to ensure that errors will not be perceived on the application side. This function is very effective in dealing with scenarios with temporary storage requirements such as spark shuffle.

Read process

When processing read requests, juicefs generally stores and reads objects in the way of 4 MIB block alignment to achieve certain read ahead functions. Meanwhile, the read data will be written to the local cache directory, For future use (as shown in the second stage of the indicator diagram, blockcache has a high write bandwidth). Obviously, during sequential reading, the data obtained in advance will be accessed by subsequent requests, and the cache hit rate is very high, so it can give full play to the reading performance of object storage. At this time, the flow of data in various components is shown in the following figure:

be careful: after the read object arrives at the juicefs client, it will be decrypted and then decompressed, which is the opposite of writing. Of course, if relevant functions are not enabled, the corresponding process will be skipped directly.

When doing random small IO reading in large files, this strategy of juicefs is not efficient. On the contrary, the actual utilization of system resources will be reduced due to read amplification and frequent write and expulsion of local cache. Unfortunately, it is difficult for a general caching strategy to achieve high enough revenue in such scenarios. At this time, one direction that can be considered is to increase the overall capacity of the cache as much as possible, in order to achieve the effect of almost completely caching the required data; In the other direction, you can turn off the cache directly (set--cache-size 0)And improve the reading performance of object storage as much as possible.

The reading of small files is relatively simple, usually reading a complete file in one request. Because small files are directly cached when they are written, the access mode read shortly after writing, such as juicefs bench, will basically hit the local cache directory, and the performance is very impressive.

summary

The above is the content related to the juicefs read-write request processing process to be briefly described in this paper. Due to the different characteristics of large files and small files, juicefs greatly improves the overall performance and availability by implementing different read-write strategies for files of different sizes, so as to better meet the needs of users for different scenarios.

Recommended reading:How to play fluid + juicefs in kubernetes cluster

If you are helpful, please pay attention to our projectJuicedata/JuiceFSYo! (0ᴗ0✿)