Details of kylin’s cube building process


1 foreword

When using kylin, the most important step is to create the model definition of the cube, that is, to specify the measurement, dimension and some additional information, and then build the cube. Of course, we can also set the partition field according to a string field in the original table (the format of this field must be the day format, indicating the meaning of the date), so that a cube can be carried out many times Build, each build will generate a segment, each segment corresponds to a time interval cube, the time intervals of these segments are continuous and not coincident. For a cube with multiple segments, merge can be performed, which is equivalent to merging the segments within a time interval into one. Let’s start to analyze the build process of the cube.

2 cube example

Take the mobile phone sales as an example, table sale records the sales of each mobile phone brand in each country every year. Table phone is the mobile phone brand, table country is the country list, and the two tables are associated with the sale table through foreign keys. These three tables form the star model, where sale is the fact table, and phone and country are the dimension tables.

Details of kylin's cube building process

Now we need to know the total sales volume of each brand of mobile phones in China in 2010-2012. Then the query SQL is:

SELECT b.`name`, c.`NAME`, SUM(a.count)
LEFT JOIN PHONE AS b ON a.`pId`=b.`id` 
LEFT JOIN COUNTRY AS c ON a.`cId`=c.`id` 
Where A. 'time' > = 2010 and A. 'time' < = 2012 and C. 'name' = "China"

Among them, time,, country are dimensions, and a.count is measurement. The number of mobile brands can be used to indicate the base of mobile brand column. The sales volume of each mobile phone brand in each country in each year can be used as a cube, and all the cubes form a cube, as shown in the following figure:
Details of kylin's cube building process

The figure above shows a cube with three dimensions. Each small cube represents a cube ID, which stores the results after the aggregation of measurement columns. For example, Apple’s sales volume in China in 2010 is a cube.

3 Introduction to the entrance

Details of kylin's cube building process

After creating a cube on kylin’s web page, you can click the action drop-down box to perform the build or merge operation. Both operations will call the rebuild interface of the cube. The parameters to be called include:

  1. The cube name is used to uniquely identify a cube. In the current kylin version, the cube name is globally unique, not unique under each project;
  2. The two time intervals of starttime and Endtime of this build identify that the data source of the segment of this build only selects the data within this time range; for the build operation, starttime is not needed because it always selects the end time of the last segment as the start time of the current segment.
  3. BuildType indicates the type of operation, which can be “build”, “merge” and “Refresh”.

4 build cube process

The build process of cube in kylin is to calculate all dimension combinations in advance, store them in HBase, and trade space for time. The rowkey corresponding to htable is the combination of various dimensions. The indicators exist in the column. In this way, the query SQL of different dimension combinations is transformed into the range scan based on rowkey, and then the indicators are summarized and calculated to achieve rapid analysis and query. The whole process is as follows:
Details of kylin's cube building process

The main steps can be divided into several stages in sequence:

  1. A plurality of cuboid files are calculated according to the user’s cube information;
  2. Htable is generated according to cuboid file;
  3. Update cube information;
  4. Recycle temporary files.

The input of each stage operation depends on the output of the previous step, so these operations are all performed in sequence. The contents of these stages are divided into 11 steps as follows:

4.1 create intermediate flat hive table

In this step, a new hive external table will be created. Then, according to the star model defined in the cube, the values of dimensions and measures will be queried and inserted into the newly created table. This table is an external table, and the data file of the table (stored in HDFS) will be the input of the next subtask.

4.2 redistribute flat hive table

In the previous step, hive will generate data files in the HDFS folder. Some files are very large, some are small, even empty. Imbalanced file distribution can lead to imbalanced Mr Jobs: some mappers work quickly, but others are very slow. To balance jobs, kylin adds this step of “reallocating” data. First, kylin gets the number of rows in the intermediate table, and then, according to the number of rows, it reallocates the amount of data needed by the file. By default, kylin allocates one file per million lines.

4.3 extract fact table distinct columns

In this step, the distinct value of each dimension column in the fact table is calculated according to the hive intermediate table generated in the previous step, and written to the file. It starts an MR task to complete, and its associated table is the temporary table created in the previous step. If the distinct value of a dimension column is relatively large, it may lead to oom during the execution of the MR task.

4.4 create dimension dictionary

This step is to calculate the sub dictionary information of all dimensions according to the distinct column file and dimension table generated in the previous step, and compress the code in the way of dictionary tree to generate dimension dictionary. The sub dictionary is designed to save storage.
Each cuboid member is stored in HBase in the form of key value. Key is a combination of dimension members, but generally dimensions are values such as strings (such as trade names). Therefore, memory consumption can be reduced by converting each dimension value to a unique integer. After finding out the corresponding key from HBase, the real member value can be obtained according to the dictionary.

4.5 save cube statistics

Calculate and count all dimension combinations and save them. Each dimension combination is called a cuboid. In theory, an n-dimensional cube has two N-power dimension combinations. Referring to an example on the Internet, a cube contains four dimensions: time, item, location and supplier. Then there are 16 types of cubes:

Details of kylin's cube building process

4.6 create htable

There are several things to consider when creating an htable:

  1. Settings for the column cluster.
  2. The compression mode of each column cluster.
  3. Deploy coprocessor.
  4. The size of each region in htable.

In this step, the setting of the column cluster is based on the user’s creation of the cube. The data key stored in HBase is the combination of dimension members, value is the result of the corresponding aggregation function, and the column cluster is for value. Generally, only one column cluster will be set when the cube is created, which contains the results of all the aggregation functions;
LZO compression is used by default when creating htable. If LZO is not supported, compression will not be performed. More compression methods are supported in later kylin versions;
Kylin is strongly dependent on the coprocessor of HBase, so it is necessary to deploy coprocessor for the table by creating htable. This file will be uploaded to the HDFS where HBase is located first, and then associated in the table’s meta information. This step is prone to errors. For example, if coprocessor cannot be found, the entire regionserver will fail to start, so special care should be taken. The division of region is already in the previous step Step 2 determines that there is no dynamic extension, so kylin uses the following interface to create htable:
public void createTable(final HTableDescriptor desc , byte [][] splitKeys)

4.7 build cube with spark

In kylin’s cube model, each cube is made up of multiple cubes. In theory, a cube with n common dimensions can be made up of two cubes of n power. Then we can calculate the lowest level of the cube, that is, the cube with all dimensions (equivalent to executing a group Query by all dimension columns), and then calculate the top-level cuboid according to the bottom-level cuboid layer by layer, until the top-level cuboid is calculated (equivalent to executing a query without group by). In fact, kylin’s execution principle is like this at this stage, but it needs to abstract these into MapReduce model and submit spark job for execution.
Use spark to generate data of each dimension combination (cuboid).
Build Base Cuboid Data;
Build N-Dimension Cuboid Data : 7-Dimension;
Build N-Dimension Cuboid Data : 6-Dimension;
Build N-Dimension Cuboid Data : 2-Dimension;
Build Cube。

4.8 convert cuboid data to hfile

After creating htable, data will be inserted into the table through the insertion interface. However, because of the huge amount of data in cuboid, frequent insertion will have a great impact on the performance of HBase. So kylin first converts cuboid file into htable file, and then associates the file with htable through bulkload, which can greatly reduce Hb The load of ASE is completed by an MR task.

4.9 guide hfile into HBase table (load hfile to HBase table)

Load the hfile file into htable, which is completely dependent on HBase’s tools. After this step is completed, the data has been stored in HBase. The key format consists of cuboid number + the ID of each member in the dictionary tree. Value may be saved in multiple column groups, including the value of the measurement calculated by group by according to these members in the original data.

4.10 update cube info

Update the status of the cube, including whether the cube is available and the data statistics of this build, including the time when the build is completed, the number of records entered, the size of the input data, the size of the data saved in HBase, etc., and persist these information in the metadata database.

4.11 clear hive intermediate table (hive cleanup)

Whether this step is successful or not will not have any impact on correctness, because after the previous step, this segment can be found in this cube, but a lot of garbage files are generated in the whole execution process, including:

  1. Temporary hive table;
  2. Because the hive table is an external table, the files storing the table need to be deleted additionally;
  3. In the step of fact distinct, write the data to HDFS to prepare for the establishment of the child dictionary. At this time, it can also be deleted;
  4. A file will be generated during rowkey statistics and can be deleted at this time;
  5. When generating hfile, the path of file storage is different from that of HBase. Although load is a remove operation, the upper directory still exists and needs to be deleted.

This completes the entire build process.
Details of kylin's cube building process