Hudi clustering data aggregation (II)

Time:2022-1-14

Small file merging and parsing

Execution code:

import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._

val t1 = "t1"
val basePath = "file:///tmp/hudi_data/"
val dataGen = new DataGenerator(Array("2020/03/11"))
//Generate 100 pieces of random data
val updates = convertToStringList(dataGen.generateInserts(100))
val df = spark.read.json(spark.sparkContext.parallelize(updates, 1));

df.write.format("org.apache.hudi").
    options(getQuickstartWriteConfigs).
    option(PRECOMBINE_FIELD_OPT_KEY, "ts").
    option(RECORDKEY_FIELD_OPT_KEY, "uuid").
    option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
    option(TABLE_NAME, t1).
    //Each time the data is written, a new file is generated		
    option("hoodie.parquet.small.file.limit", "0").
    //Clustering is performed after each operation
    option("hoodie.clustering.inline", "true").
    //Clustering is performed every 4 submissions
    option("hoodie.clustering.inline.max.commits", "4").
    //Specifies the maximum size of the build file
    option("hoodie.clustering.plan.strategy.target.file.max.bytes", "1073741824").
    //Specifies a small file size limit that can be used for clustering operations when a file is smaller than this value
    option("hoodie.clustering.plan.strategy.small.file.limit", "629145600").
    mode(Append).
    save(basePath+t1);

//Create a temporary view to view the total number of data in the current table
spark.read.format("hudi").load(basePath+t1).createOrReplaceTempView("t1_table")
spark.sql("select count(*) from t1_table").show()

In the above example, the trigger frequency for clustering is specified: it is triggered every 4 submissions, and the file related size is specified: the maximum size of the generated new file and the minimum size of the small file.

Execution steps:

1. Generate data and insert data.

To view files on the current disk:

View the number of data in the table:

View the number of files read by the SQL execution on spark Web:

Therefore, there are 100 pieces of data in the current table. A data file is generated on the disk. When querying the data of the table, only one file is read.

2. Repeat the above operation twice.

To view files on the current disk:

View the number of data in the table:

View the number of files read by the SQL execution on spark Web:

Therefore, so far, we have submitted three write operations and generated one data file each time. A total of three data files have been generated. When querying all data, we need to read data from three files.

3. Insert data again:

To view files on the current disk:

View the number of data in the table:

View the number of files read by the SQL execution on spark Web:

Conclusion:

1. Hoodie.com is configured parquet. small. file. After limit, a data file will be generated every time new data is submitted.

2. Before clustering, you need to read all files every time you read all the data in the table.

3. After submitting the data for the fourth time, clustering is triggered to generate a larger file. At this time, when reading all the data, you only need to read the merged large file. Yes Under the Hoodie folder, you can also see the submission of replacecommit:

Small file merging + sort columns parsing

Execution code:

import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._

val t1 = "t1"
val basePath = "file:///tmp/hudi_data/"
val dataGen = new DataGenerator(Array("2020/03/11"))

var a = 0;
for (a  50").show()
}

Perform code analysis

This code has been modified in several places compared with the previous code:

1. Added for loop:

Because we already know that after 8 submissions, small files will merge large files, so we need a for loop to do 8 submissions, and we can just look at the results directly.

2. Added Hoodie clustering. plan. strategy. sort. Columns configuration:

This is the main test point this time. This configuration sorts the specified columns.

That is, during clustering, Hudi will re read all files and sort them according to the specified columns, so that relevant data can be gathered together and better query and filtering can be done (as will be demonstrated later). The comparison we need to make is to query data with fare as a condition and observe the number of files that Hudi will read before and after clustering.

The result we want is that before clustering, because there is no data processing according to fair, the data meeting the filtering conditions will be distributed in various files, so a large number of files will be read and the filtering effect is poor. After clustering, the data will be redistributed according to the fair column. If the data that meets the filtering conditions are more concentrated, the read data will be less and the filtering effect will be better.

3. Modified Hoodie clustering. plan. strategy. target. file. Max.bytes and Hoodie clustering. plan. strategy. small. file. limit

What we want to measure is the filtering effect before and after clustering, so the number of files cannot be changed (otherwise, after four files are combined into one file, only one file will be read when reading data, so we can’t see whether sort has effect). Therefore, set this value to two approximate values to trigger clustering, The number of files can be the same before and after clustering.

Execution results:

To view the current disk file:

View the SQL filtering results for the 5th time:

View the SQL filtering results for the 6th time:

View the SQL filtering results for the 7th time:

To view the last SQL filtering result:

Conclusion:

1. Before clustering, all data will be read when filtering the fair column.

For example, when the fifth filtering is performed, the table has 50000 rows of data in total, and Hudi will scan 50000 rows of data; When the sixth filtering is performed, there are 60000 rows of data in the table, and Hudi will scan 60000 rows of data; When the seventh filter is performed, there are 70000 rows of data in the table, and Hudi will scan 70000 rows of data,

2. After clustering, when the number of data files remains unchanged (there are 8 data files before and after), the rearrangement data of sort columns can be effectively applied during the eighth filtering, reducing the 80000 rows of data that should have been scanned to 50405 rows of data, and the filtering effect is significantly improved!!