Li Chengxiang: The practice and exploration of bilibili in the query acceleration of lake and warehouse integration



Guide:This article mainly introduces some practices of Bilibili to explore query acceleration and index enhancement under the integrated architecture of data lake and data warehouse. The main contents include:

  • What is the integrated lake and warehouse architecture
  • Bilibili's current integrated lake and warehouse architecture
  • Under the integrated structure of lake and warehouse, data sorting and organization optimization
  • Practical exploration of index enhancement and optimization under the integrated lake and warehouse architecture

01 What is the integration of lake and warehouse

When we talk about the integration of lake and warehouse, it involves two concepts of data lake and data warehouse.

What is a data lake? Generally speaking, it has the following characteristics:

  • There is a unified storage system, all data is put into this unified storage system, there is no data island.
  • Supports any data type and is relatively free, including structured, semi-structured and unstructured data. These different types of data can be unified into the storage system.
  • It is open to multiple computing engines, including real-time, offline analysis, etc. The computing engines are very rich.
  • There are more flexible data processing interfaces. There is data processing based on SQL at this level, and it can also be based on Spark, providing a lower level, based on Dataset or even RDD, and even more for machine learning scenarios to parse data. You can also use the file system directly to read files directly through the API.
  • Due to its flexibility, data quality is relatively low and difficult to manage.

Therefore, based on the flexibility and ease of use of the data lake, it is very suitable for exploration and innovation based on unknown data.

What is a data warehouse? It has the following features:

  • Strong format (schema), model data beforehand
  • There is a closed data format and storage, not open to other engines
  • High query efficiency: tightly integrated optimization of storage and computing, rich indexing/pre-computing and other support
  • Data warehousing, whether it is real-time or offline, or the storage organization of data is determined by the data warehouse rather than the writing task
  • High data quality, easy operation and maintenance management, high construction cost

Therefore, the efficient and reliable characteristics make data warehouses very suitable for analysis and decision-making based on known data.

Like most Internet companies, Bilibili’s previous big data platforms were based on the open source Hadoop ecosystem. HDFS is used for storage, and Hive, Spark, and Presto are used as computing engines. In my opinion, this is a typical data lake architecture. However, we also have a lot of requirements for interactive analysis. In order to solve these requirements, we will introduce a specific distributed data warehouse, and the engine uses ClickHouse.

This will introduce new problems.

For example, for a data product or data service, in order to provide better query efficiency and performance, the data needs to be stored from HDFS to ClickHouse first, which makes the entire process longer. Secondly, it will bring about the problem of data redundancy, because the data has one copy on HDFS and one copy on ClickHouse. Third, during the warehousing process, some operations may be performed on the data, making the data change, making it difficult to correlate with other data in HFDS, resulting in the appearance of data islands. In the past two years, data lake storage formats like Iceberg, Hudi and Delta Lake have been gradually introduced by many companies to solve the above problems. For us, we mainly use the Iceberg engine to solve the gap between the data lake and the data warehouse.

In our opinion, there are three goals for the integration of lakes and warehouses:

First, hopefully it's still as flexible as a data lake. The main reason is that we still use unified HDFS storage, which is seamlessly compatible with the previous SQL on Hadoop ecosystem, including data access based on Spark or Flink ETL, API access at various levels such as SQL/ML/DataSet, and Presto/Spark/Hive, etc. support for computing engines.

Second, want it to be as efficient as a data warehouse. For tables written in Iceberg format, we hope that it can achieve or approach the efficient query efficiency of dedicated distributed data warehouses, including overall enhancement and optimization of data distribution organization, indexing, pre-computing, computing-storage integration/caching, etc. . Like Iceberg itself provides the ability of coarse-grained transactions, which enables us to have the additional ability to support more changes, read and write, and enter and exit warehouses in real time.

Third, I hope it can be as free as the wind. This is for users, I hope it can be intelligent, the user's threshold for use is lower, and the ease of use is higher.. Previously, if users wanted to make the data analysis and query of ETL results more efficient, they needed to pay attention to many aspects, such as whether the written data would be too many small files, and how to write the SQL logic in the ETL to make the written data sorted. , whether to do pre-computation, etc., and these aspects have nothing to do with the user's business logic.

Users want to pursue higher query efficiency. In the original solution, users need to consider these things. That is to say, users need to become a big data expert to solve these problems, and they need to develop their own. There are additional ETL tasks, therefore, the threshold for users is relatively high. The second step, it needs to be more automated. For users, it is only necessary to consider the commonly used filter fields, aggregation dimensions and statistical items. This automatic service in the background can help solve problems such as data organization and sorting. In the third stage, I hope to be intelligent. Users only need to pay attention to what fields are in the table, what type and other business information. The latter data organization, index building, and precomputing can all be automated. Users don't need to care about these, just pay attention to how to write SQL to express business logic based on a certain table.


02 Integrated structure of lake and warehouse

In Bilibili, the core of the integrated lake and warehouse architecture is Iceberg, which is the final result of our choice among Hudi, Delta Lake and Iceberg. The entire data processing process architecture is roughly as follows: real-time data is stored in Kafka, ETL is implemented through Flink, data is written to HDFS in Iceberg format, and offline data is written to HDFS through Spark. For users, when data is written to Iceberg, we have a self-developed Magnus service in the background, which will perform continuous overall organizational optimization for the Iceberg table where the data is landed on HDFS. How to optimize will be introduced in detail in the next section, mainly using Spark tasks.

On the analysis side, we use Trino as the query engine, which is the name of PrestoSQL after it was renamed. At the same time, we also use Alluxio, because the metadata and index data in Iceberg are relatively small compared to the original data, so we cache them in Alluxio to facilitate cache acceleration. Magnus is our data management service based on Iceberg, the core of lake and warehouse integration. Its main task is to optimize and manage Iceberg data, including the display of basic information, such as tables/partitions/files and Snapshots. There is also a scheduler inside it for the scheduling of data optimization jobs. Whether it is an offline batch task or a real-time task, when the data is written to the Iceberg table, these commit events will be sent to Magnus, and there will be a queue in it, and the scheduler will consume the queue according to the specified policy. Decide which Iceberg tables to do the corresponding work, and pull up the corresponding Spark task to do specific things.


Previously, based on Iceberg, Hudi or Delta Lake's integrated lake and warehouse architecture, the real-time data warehouse was the one that everyone paid more attention to or had more application scenarios. Because of their support for coarse-grained transactions, they address the ability to provide (near) real-time data storage. For us, in addition to some more important and relatively independent data products and services, we will put them in a special distributed data warehouse like ClickHouse for deletion and query. In fact, in the process of data warehouse construction, we still have a large number of business scenarios based on the previous Hadoop data lake architecture. Our students in the data warehouse development department build data warehouses based on this data lake architecture, such as Different layered modeling from ods to dwd etc. In fact, for most of the data warehouse modeling work, the data is still written to HFDS, and then Presto or Spark are used for analysis.

In this scenario, the goal of our integrated lake and warehouse architecture is to speed up query performance so that its efficiency can reach or be close to that of a dedicated distributed data warehouse. We analyze the open-source lake and warehouse integration solution and the performance gap of distributed data warehouses, which involves the performance comparison of Runtime engine, storage, and pre-computing. There are many performance-related factors involved. This article mainly shares some exploration and practice on sorting organization and indexing in storage. The reason for choosing these two factors is mainly that our survey results believe that they are the most obvious part of the performance gap between the open source lake and warehouse integration solution and the distributed data warehouse.


03 Sorting and Organization of Data

First, let's look at the sorted organization of the data.

Speaking of typical data analysis scenarios, what we are sharing this time is a multi-dimensional analysis scenario based on star schema as a benchmark. The entire data model is a fact table plus multiple dimension tables, and the query mode is relatively fixed. First associate, then filter, then aggregate, and finally sort the results. The filter condition can be equal value filter or range filter, and the filter field can be high cardinality field or low cardinality field.

Therefore, in this typical multi-dimensional analysis scenario, the problem that we often encounter in our actual business is: how do we read only the required data when executing a query under various types of fields and various filter conditions, and Not doing a full table scan? There are two key points here. Through data organization and indexing, we can only query and read the part of the data that is logically required by SQL.


As far as indexes are concerned, such as Iceberg, it already provides file-level MinMax indexes by default, and in Meta files, it records the Min and Max values ​​of each column.

For example, we have four files, and the corresponding Max and Min values ​​will be recorded in the Meta file. Then for the query case in the figure below, we can filter the files by age=17, and filter the data files according to age. If the data is sorted, then in this query, the filtering effect during reading will be very good, and it is not necessary to read the three files, because through Iceberg's Meta file, it can be judged that we only need to read file 2, improve query efficiency.


However, after we do the sorting, for example, we have another query that needs to be filtered according to the obd field, which will cause problems. Usually, there are more than one fields filtered in the user's query. When we are setting the sorting, when the three fields a, b and c are used for sorting globally, the data aggregation of a is the best, and the later The less aggregated the fields are. If the cardinality of field a is relatively high, then for field b, or even the following field c, there may be no filtering effect at all. When using these fields for specific queries, filtering is basically ineffective.

This is a problem that MinMax index sorting often encounters. One solution is to use the projection method – sorting again according to another field to save the data. In Bilibili, the method of introducing Z-Order sorting is adopted.

What exactly is Z-Order? For example, we have three fields a, b, and c. When sorting by these three fields, we hope to ensure their aggregation at the same time, instead of giving priority to a, followed by b and c. We hope that the three fields can have a certain degree of aggregation in the final sorting result.

How to do it? In fact, Z-Order is to map the multi-dimensional data that is not naturally ordered into one-dimensional data for comparison. The mapped one-dimensional data can ensure that each original dimension can ensure its aggregation to the same degree. As a simple example, as shown in the figure below, the interleaved group values ​​of the two dimensions of X and Y are formed to form an Interleave Index and then a new value is obtained, which is called Z-Value. From the figure, you can see that for the data of the two fields X and Y, the generated z-value will show a Z-shaped nesting. Such a structure can ensure the aggregation of the two fields of X and Y at the same time during segmentation.


This kind of aggregation will bring a new problem: the data types we support are not only Int types, how can we
Convert various types of data such as Int/Long/String/Date/Timestamp to positive integer, perform Interleave Index calculation and calculate the corresponding Z-Value?

Therefore, a prerequisite for implementing Z-Order is to ensure that the data is mapped to a positive integer in an order-preserving manner. For Int type data, the first bit can be reversed to achieve; for other types of data, the implementation method is different. For example, for String type, a fixed first few bits are used to sort.

But order-preserving mapping also has problems:

First, numerical information may be lost in the process of going from the original value to the mapped value. For example, for String type data, if only the first few bits are taken, the following information will be lost.

Second, the distribution of mapped values ​​cannot be guaranteed to be positive from 0, resulting in z-values ​​that do not conform to the nested distribution of the Z-Order curve. For example, the value of X is 0, 1, 2, 3, 4, 5, 6, 7, and the value of Y is 8, 16, 24, 32. The calculated z-value sorting effect is actually the same as the data. According to order by y, the effect of x is the same. That is to say, this sorting does not bring additional benefits, and the aggregation of X cannot be guaranteed.


Therefore, we introduce the calculation method of Boundary-based interleave Index. It mainly makes some transformations to Spark RangePartitioner and implements a new sorting method. Take the following figure as an example. We need to perform Z-Order sorting on the two fields, city and age. We sample data from these two fields. After sampling, we sort each field before continuing to sample. After sampling the boundary, for the incoming data, Spark's shuffle partition will compare this value with the boundary, and take the index value of the boundary to calculate its z-value. Because we are calculating according to the index value of the boundary, the z-value must be a positive integer starting from zero.


The following figure is an effect rendering of Z-Order. Specifically, we perform Z-Order sorting on the three fields shown, and then we find that it can achieve more than 80 percent data skipping, that is, percent Eighty percent of the data does not need to be read when querying.


In addition, we also support sorting based on Hibert curve. Z-Order sorting has a defect that it will have connecting lines with large spans, which will lead to when the file is cut, if the large-span connecting lines are included in a file , it will cause the Min and Max spans of this file to be large, and the effect of data skipping will be greatly reduced. The Hibert curve does not have the problem that the span of the connecting line is very large, and the effect is better than Z-Order, as shown in the following figure.


After we support Z-Order sorting organization, plus MinMax index, their effect in Star Schema Benchmark (SSB) is shown in the figure below. It can be found that the number of file reads and the query speed have been greatly improved. However, the more Z-Order sorting fields, the worse the sorting effect will be, so we recommend 2-4. If the data is not organized and sorted, the filtering effect of the MinMax index will be very limited.


04 Index enhancements

For Iceberg, we introduced support for file-level BloomFilter indexes, where each field of each table can create a BloomFilter filter. For SSB, we added two additional tests, one for equal-valued data queries and the other for range-filtered data queries. As shown in the figure, for the equivalent data query after adding BloomFilter, the number of files read is greatly reduced, and the query speed is also greatly improved. But for range queries, the BloomFilter index does not support filtering data files based on range filter conditions.


Therefore, we have introduced support for BitMap indexes.

As shown in the totalprice field in the figure below, we introduce the BitMap index. When the filter condition is that the totalprice is less than 19, we can perform an OR operation on the BitMaps of 2 and 18, and then determine whether the operation result is greater than zero. If it is greater than zero, it means that the file contains the required query data, and this data needs to be read.

BitMap has an additional benefit: when there are two filter conditions, we need to query the data whose totalprice is less than 19 and the city is "United ST005". For these two queries, the data required for a separate query may exist in a certain file, but when we perform BitMap on these two conditions and perform operations, we will find that the data in a certain file that satisfies these two conditions at the same time does not exist. exists, so we don't need to read this file. That is to say, the intersection and union operation of BitMap can better filter out more data files in the case of complex filtering conditions.


But BitMap has two more important problems that need to be solved in function implementation.

First, when performing range filtering, for example, if you need to query the price of less than 51, you need to encode and calculate the data between 2 and 50. A large number of reads and writes and calculations will greatly affect the query efficiency. Second, for each cardinality, the corresponding BitMap needs to be stored, and the storage cost is relatively high, especially for fields with high cardinality.

For the first problem, we introduced the BitMap of Range Encoded to solve the problem. For example, for the value of the price field with a value of 18, we store not its BitMap, but the result of the OR operation between it and a BitMap with a value of 2. Simply put, if a value in the BitMap field is 1, then all the following values ​​are 1. With this encoded BitMap, we can ensure that for range queries, we can optimize to only need at most two BitMap values ​​to retrieve values ​​within any conditional range.


To solve the second problem, we introduce BitMap index of Bit-Slice Encoded.

In the example in the figure below, we see that the value of price has specific values ​​such as 2, 18, 20, 33 and the corresponding BitMap, and these specific values ​​are segmented by bit. "Comp 0" in the figure represents the zeroth bit . For example, for the value whose zeroth bit is 8, there are 18 and 188, then perform the BitMap OR operation on these two values ​​and store the result in the corresponding Bit Map whose value is 8 in Comp 0. The other bits are also stored in the same way. In this way, if we have 256 bases before, after Bit-SLice Encoded, we only need 30 BitMaps instead of the previous 256 BitMaps.


We can also evolve Bit-Slice Encoded BitMap from decimal to binary digit representation, which will not be described in detail here.

In general, we can combine Bit-Slice Encoded and Range Encoded BitMap. For binary Bit-Slice Range Encoded BitMap, we can convert a 256-base BitMap into a result that requires only 9 BitMaps.


Based on the above two points, BitMap solves the problem of high cardinality range query in unsorted data organization. Its SSB results show that the query efficiency is improved by 1-10 times, and the number of files read is reduced by 0-400 times. That is to say, not only the query performance is improved, but also the load of the computing engine can be greatly reduced, and the hardware resources can have more storage.


The above work is not only implemented by the corresponding expansion and transformation of the Iceberg project and related Spark projects, but also the interface expansion at the SQL level. For example, it supports new APIs, including in Iceberg, syntax extensions to Spark 3, and supports sorting between files through distributed by. The sorting methods include hash, Range, Z-Order and Hibert curve. Locally ordered by supports sorting within files. That is to say, users can customize the sorting method between files and within files. The follow-up is that Magnus performs the corresponding data optimization tasks through the optimizer. We support the BitMap and BloomFilter file-level indexes mentioned above. Through Iceberg's Actions, you can pull up the corresponding Spark tasks to do the corresponding write index and delete index operations.


Through the organization of data sorting and index support, we also summarize the configuration strategy in the data scenario of multidimensional analysis, as shown in the following figure. These strategies enable us to support any number of fields, any filter type, and in most multi-dimensional analysis scenarios, only access as few files as possible to speed up queries.


That's all for today's sharing, thank you all.
This article was first published on the WeChat public account "DataFunTalk".