Introduction: This article introduces the performance advantages of Databricks Enterprise Edition Delta Lake. With these features, the query performance of Spark SQL can be greatly improved and the query speed of Delta tables can be accelerated.
Li Jingui (Jinxi) Alibaba Cloud Open Source Big Data Platform Development Engineer
Wang Xiaolong (Xiaolong) Alibaba Cloud Open Source Big Data Platform Technical Expert
Databricks is the world’s leading Data+AI company, the founding company of Apache Spark, and the largest code contributor to Spark. The core is to build enterprise-level Lakehouse products around Spark, Delta Lake, MLFlow and other open source ecosystems. In 2020, Databricks and Alibaba Cloud jointly created a cloud-based fully managed big data analysis & AI platform based on Apache Spark – Databricks Data Insight (DDI, Databricks DataInsight), providing users with data analysis, data engineering, data science and artificial intelligence, etc. In terms of services, build an integrated Lakehouse architecture.
Delta Lake is a transaction-supporting data lake product developed internally by Databricks since 2016, and it was officially open-sourced in 2019. In addition to the community-led open source version of Delta Lake OSS, Databricks commercial products also provide the enterprise version of the Spark&Detla Lake engine. This article will introduce how the product features provided by the enterprise version optimize performance and facilitate efficient access to Lakehouse.
An optimized solution to the small file problem
Frequently performing merge, update, and insert operations in Delta Lake, or continuously inserting data into the Delta table in a stream processing scenario, will cause a large number of small files to be generated in the Delta table. On the one hand, the increase in the number of small files will reduce the amount of data that Spark reads serially each time, reducing the reading efficiency; on the other hand, it will increase the metadata of the Delta table, slow down the metadata acquisition, and reduce the table read efficiency.
In order to solve the problem of small files, Databricks provides three optimization features to solve the problem of small files in Delta Lake from the two dimensions of avoiding the generation of small files and automatically/manually merging small files.
Feature 1: Optimize the writing of the Delta table to avoid the generation of small files
In the open source version of Spark, when each executor writes data to a partition, it will create a table file for writing, which will eventually result in many small files in a partition. Databricks optimizes the writing process of the Delta table. For each partition, a dedicated executor is used to merge other executors’ writing to the partition, thereby avoiding the generation of small files.
This feature is controlled by the table property delta.autoOptimize.optimizeWrite:
Can be specified when creating the table
CREATE TABLE student (id INT, name STRING) TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true);
You can also modify table properties
ALTER TABLE table_name SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true);
This feature has two advantages:
- Improve the throughput of writing data by reducing the number of table files to be written;
- Avoid the generation of small files and improve query performance.
Its disadvantages are also obvious. Because an executor is used to merge the writing of table files, the parallelism of table file writing is reduced. In addition, the introduction of an additional layer of executors needs to shuffle the written data, which brings additional s expenses. Therefore, scenarios need to be evaluated when using this feature:
- Scenarios where this feature is applicable: Scenarios where SQL statements such as MERGE, UPDATE, DELETE, INSERT INTO, and CREATE TABLE AS SELECT are frequently used;
- Scenarios where this feature is not applicable: writing data of more than TB level.
Feature 2: Automatically merge small files
In stream processing scenarios, such as streaming data into the lake, it is necessary to continuously insert the arriving data into the Delta table, and each insertion will create a new table file to store the newly arriving data, assuming that it is triggered every 10s Once, the number of table files generated by such a stream processing job in one day will reach 8640, and since the stream processing job is usually long-running, running the stream processing job for 100 days will generate millions of table files. For such a Delta table, only the maintenance of metadata is a big challenge, and the query performance deteriorates sharply.
In order to solve the above problems, Databricks provides the function of automatically merging small files. After each data is written to the Delta table, it will check the number of table files in the Delta table. If the small files in the Delta table (size < 128MB are considered as If the number of small files reaches the threshold, a small file merge will be performed to merge the small files in the Delta table into a new large file.
This feature is controlled by the table property delta.autoOptimize.autoCompact, which is the same as the feature delta.autoOptimize.optimizeWrite. It can be specified when creating a table, or it can be modified on the created table. The threshold of automatic merging is controlled by spark.databricks.delta.autoCompact.minNumFiles, the default is 50, that is, the number of small files reaches 50 and the table file will be merged; the maximum size of the merged file is 128MB, if you need to adjust the size of the merged target file , which can be achieved by adjusting the configuration spark.databricks.delta.autoCompact.maxFileSize.
Feature 3: Manually merge small files
Automatic small file merging is triggered when the Delta table is written and the small files in the table reach the threshold after writing. In addition to automatic merging, Databricks also provides the Optimize command to enable users to manually merge small files, optimize the table structure, and make the structure of the table file more compact. In terms of implementation, Optimize uses the bin-packing algorithm. This algorithm not only merges small files in the table, but also creates more balanced table files (similar in size). For example, if we want to optimize the table file of the Delta table student, we only need to execute the following command:
The Optimize command not only supports the merging of small files in the entire table, but also supports the merging of table files in specific partitions. For example, we can only merge small files in partitions whose date is greater than 2017-01-01:
OPTIMIZE student WHERE date >= '2017-01-01'
From the experimental data on Databricks data insight products, Optimize can improve query performance by more than 8x.
Query optimization technology comparable to enterprise-level databases
Databricks has also made many optimizations in data query, including:
Feature 1: Data Skipping
In data query systems, there are two classic query optimization techniques: one is to process data at a faster speed, and the other is to reduce the amount of data that needs to be scanned by skipping irrelevant data. Data Skipping belongs to the latter optimization technology, which skips irrelevant table files through the statistical information of table files, thereby improving query performance.
When adding a table file to the Delta table, Delta Lake will store the statistical information of the first 32 columns of the data in the table file in the metadata of the Delta table, including the maximum and minimum values of the data column, and the number of rows that are null , when querying, Databricks will use these statistics to improve query performance. For example: For the x column of a Delta table, suppose the minimum value of the x column in a table file of the table is 5, and the maximum value is 10. If the query condition is where x < 3, then according to the statistical information of the table file, we can It is concluded that the table file must not contain the data we need, so we can skip the table file directly, reduce the amount of scanned data, and improve query performance.
The implementation principle of Data Skipping is similar to that of the Bloom filter. It judges whether there may be data that needs to be queried in the table file through query conditions, thereby reducing the amount of data that needs to be scanned. If the queried data cannot exist in the table file, you can skip it directly. If the queried data may exist in the table file, you need to scan the table file.
In order to skip and query irrelevant table files as much as possible, we need to narrow the min-max gap of the table files so that similar data can be gathered in the files as much as possible. To give a simple example, suppose a table contains 10 table files. For the x column in the table, its value is [1, 10]. If the distribution of the x column of each table file is [1, 10] ], then for the query condition: where x < 3, any table file cannot be skipped, therefore, performance improvement cannot be achieved, and if the min-max of each table file is 0, that is, in the x column of table file 1 The distribution is [1, 1], and the x column distribution of table file 2 is [2, 2]…, then for the query condition: where x < 3, 80% of the table files can be skipped. Inspired by this idea, Databricks supports the use of Z-Ordering to aggregate data, narrow the min-max gap between table files, and improve query performance. Below we introduce the principle and use of Z-Ordering optimization.
Feature 2: Z-Ordering optimization
As explained in the previous section, in order to skip irrelevant table files as much as possible, the columns used as query conditions in the table file should be as compact as possible (that is, the min-max gap is as small as possible). Z-Ordering can achieve this function. It can store associated information in the same set of files in multiple dimensions. So to be precise, Z-Ordering is actually a data layout optimization algorithm, but combined with Data Skipping, it Can significantly improve query performance.
The use of Z-Ordering is very simple. For the table events, if the columns eventType and generateTime are often used as query conditions, then execute the command:
OPTIMIZE events ZORDER BY (eventType, generateTime)
The Delta table will use the columns eventType and generateTime to adjust the data layout so that the eventType and generateTime in the table file are as compact as possible.
According to our experiments on Databricks DataInsight, using Z-Ordering optimization can achieve a 40-fold performance improvement. For specific test cases, refer to the official documentation of Databricks Data Insight at the end of this article.
Feature 3: Bloom filter index
Bloom filter is also a very useful Data-skipping technique. This technology can quickly determine whether the table file contains the data to be queried, and skip the file in time if it does not, thereby reducing the amount of scanned data and improving query performance.
If a Bloom filter index is created on a column of the table, and where col = “something” is used as the query condition, then when scanning files in the table, we can use the Bloom filter index to draw two conclusions: file The line col = “something” must not be included in the file, or the file may contain the line col = “something”.
- When it is concluded that the file does not contain the line of col = “something”, the file can be skipped, thereby reducing the amount of scanned data and improving query performance.
- The engine only processes the file when it concludes that the file may contain a line with col = “something”. Note that this is only to judge that the file may contain target data. The Bloom filter defines an indicator to describe the probability of misjudgment, that is, the probability that the file contains the data to be queried, but in fact the file does not contain the target data, and it is called FPP (False Positive Probability). : false positive probability).
Databricks supports file-level Bloom filters. If a Bloom filter index is created on certain columns of a table, each table file in the table will be associated with a Bloom filter index file, and the index file is stored in the same directory as the table file. in the _delta_index subdirectory. Before reading the table file, Databricks will check the index file, and judge whether the table file contains the data to be queried according to the above steps. If it does not contain it, it will be skipped directly, otherwise it will be processed.
The creation of a Bloom filter index is similar to the creation of a traditional database index, but requires specifying the false positive probability and the number of possible values for the column:
CREATE BLOOMFILTER INDEX ON TABLE table_name FOR COLUMNS(col_name OPTIONS (fpp=0.1, numItems=50000000))
According to our experiments on Databricks DataInsight, using the Bloom filter index can achieve a performance improvement of more than 3 times. For the test case, refer to the official documentation of Databricks Data Insight at the end of the article.
Feature 4: Dynamic file pruning
Dynamic File Pruning (DFP) is similar to Dynamic Partition Pruning. They both perform pruning during the join execution phase of dimension tables and fact tables, reducing the amount of scanned data and improving query efficiency.
Let’s take a simple query as an example to introduce the principle of DFP:
SELECT sum(ss_quantity) FROM store_sales JOIN item ON ss_item_sk = i_item_sk WHERE i_item_id = 'AAAAAAAAICAAAAAA'
In this query, item is a dimension table (with a small amount of data), store_sales is a fact table (with a very large amount of data), and the where query condition acts on the dimension table. If DFP is not enabled, the logical execution plan of the query is as follows:
As can be seen from the figure above, the store_sales table is scanned first, and then joined with the rows of the filtered item table. Although the result is only more than 40,000 pieces of data, more than 8 billion pieces of data in the table store_sales are scanned. data. For this query, the intuitive optimization is: first query the i_item_id = ‘AAAAAAAAICAAAAAAA’ data rows in the table item, and then filter the i_item_sk value of these data rows as the query condition of ss_item_sk in the table store_sales in the SCAN stage of the table store_sales, combined with The Data Skipping technology we introduced above can greatly reduce the scanning of table files. This idea is the fundamental principle of DFP. The logical execution plan after starting DFP is shown in the following figure:
It can be seen that after DFP is enabled, the filter condition is pushed down to the SCAN operation, and only more than 6 million data in store_sales are scanned. From the results, after DFP is enabled, the performance of this query has been improved by 10 times , In addition, Databricks also tested TPC-DS for this feature. The test found that after enabling DFP, the 15th query of TPC-DS achieved an 8-fold performance improvement, and 36 queries achieved a performance improvement of 2 times or more.
The previous article briefly introduced the performance advantages of Databricks Enterprise Edition Delta Lake. With these features, the query performance of Spark SQL can be greatly improved and the query speed of Delta tables can be accelerated.
This article is the original content of Alibaba Cloud and may not be reproduced without permission.