Hudi clustering data aggregation (I)



The business scenario of the data lake mainly includes the analysis of databases, logs and files. There are two important aspects in managing the data Lake: write throughput and query performance. Here we mainly explain the following problems:

1. In order to obtain better write throughput, data is usually written directly to files. In this case, many small data files will be generated. Although the use of small files can increase the parallelism of writing and read files in parallel to improve the reading speed, there will be a small amount of data that needs to be read from multiple small files, which increases a lot of Io. 

    2. The data is written into the file in the way of entering the data lake. On the same file, the data locality is not the best. Data is related to the incoming batch, and the data of similar batches will be associated, rather than the data to be queried frequently. Therefore, small file size and lack of data locality will reduce query performance.

   3. In addition, the performance of many file systems (including HDFS) degrades when there are many small files.

hudi clustering

Hudi supports clustering to improve write throughput without affecting query performance. This function can rewrite data in different ways:

1. The data is written to the small file first. After certain conditions are met (such as elapsed time, number of small files, number of commits, etc.), the small file is spliced into a large file.

2. By sorting the data on different columns to change the data layout on disk, the correlation between data has been improved and the query performance can be improved.


(the user can set the limit of small files to 0, which can force the data to enter a new file group.)

Timeline of cow table

In the example flowchart above, the partition status over time (T5 to T9) is shown. The main steps are as follows:

  1. At T5, a partition in the table has five file groups F0, F1, F2, F3 and F4, which are created at T0, T1, T2, T3 and T4 respectively. Assume 100MB per filegroup. Therefore, the total data in the partition is 500MB.
  2. The clustering operation is requested at T6. Similar to compression, we created a “t6.clustering.requested” file in the metadata with “clustering plan”, which contains all file groups involved in cluster operations across all partitions. For example: {partitionpath: {“datestr”}, oldfilegroups: [{fileid: “F0”, time: “t0”}, {fileid: “F1”, time: “T1”},…], newFileGroups: [“c1”, “c2”] }
  3. Assume that the maximum file size after clustering is configured to be 250MB. The cluster redistributes all the data in the partition to two filegroups: C1 and C2. At this time, these filegroups are “false” and will not be visible to the query until T8 clustering is completed.
  4. Note that records in filegroups can be split into multiple filegroups. In this example, some records from the F4 filegroup go to the new filegroups C1, C2 at the same time.
  5. When the cluster is in progress (T6 to T8), any update inserts involving these filegroups will be rejected.
  6. Writing a new data file c1-t6 Parquet and c2-t6 After parquet, if the global index is configured, we will add entries for all keys with new locations in the record level index. The new index entry will not be visible to other writes because there is no associated commit yet.
  7. Finally, we create a submission metadata file “T6. Commit”, which contains the file groups modified by this submission (F0, F1, F2, F3, F4).
  8. Note: filegroups (F0 through F4) are not deleted from the disk immediately. Cleaner will be in the archive T6 Clean up these files before commit. Also, clustering updates all views and source data files.

Timeline of MOR table

This method also supports mor table, and the process is very similar to cow table.

Clustering is in parquet format.

Clustering steps

Overall, two steps are required:

  1. Clustering scheduling: creating clustering schedules
  2. Execute clustering: execute the plan. Create a new file and replace the old file.

Clustering scheduling

  1. Identify files that meet the cluster criteria
    1. Filter specific partitions (give priority to the latest or old partitions according to the configuration)
    2. Any file with size > targetfilesize is not eligible
    3. Any files with pending compression / clustering plans are not eligible
    4. Any filegroup with log files does not qualify for clustering (this restriction may be lifted later)
  2. The files that meet the clustering conditions are grouped according to specific conditions. The data size of each group is expected to be a multiple of “targetfilesize”. Grouping is done as part of the policy defined in the plan:
    1. Group files according to the record key range. Because the key value range is stored in parquet footer, this can be used for some queries / updates.
    2. Documents are grouped according to submission time.
    3. Group files with custom columns and overlapping values (specify columns to sort)
    4. Group random files
    5. We can limit the group size to improve parallelism
  3. Filter groups based on specific criteria (similar to orderandfilter in the comparison strategy)
  4. Finally, the clustering plan is saved to the timeline.

Performing clustering

  1. Read the clustering plan and view the number of “clustering groups” (for parallelism).
  2. Create clustering file for inflight status
  3. For each group:
    1. Use strategyparams to instantiate the appropriate policy class (for example: sortcolumns)
    2. The policy class defines a partition, which we can use to create buckets and write data.
  4. To create a replacecommit:
    1. The OperationType is set to “clustering”.
    2. Extend metadata and store additional fields to track important information (policy classes can return these additional metadata information)
      1. Policy for merging files
      2. Track replacement files