Introduction of kylin 4.0 topn implementation principle

Time:2021-1-17

#Introduction

Apache kylin is an open source distributed analysis engine, which provides SQL query interface on Hadoop and multi-dimensional analysis (OLAP) capability to support large-scale data. It can query huge data sets in sub seconds.

Topn measurement has been added since kylin 1.5, until kylin 3. X. the implementation has not changed much. To understand the principle of topn implementation before kylin 3, please refer to the following article:

https://www.infoq.cn/article/2016/08/Apache-Kylin-Top-N/?utm_source=tuicool

In September 2020, the Apache kylin community released kylin 4.0.0-alpha. This article will introduce the implementation of topn in Apache kylin 4.0.0-alpha in detail.

background

Let’s start with a typical topn application scenario. When we do data analysis on the e-commerce platform, we often need to check which sellers are among the top 100 in sales. An example of SQL query is as follows:

SELECT kylin_sales.part_dt, seller_id

FROM kylin_sales

GROUP BY

kylin_sales.part_dt, kylin_sales.seller_id

ORDER BY SUM(kylin_sales.price) desc LIMIT 100;

In the scenario of large amount of data, if you want to request topn data, if you first calculate all the sum (price) after group by, and then sort the sum (price), the total calculation overhead here is very large.

Introduction of kylin 4.0 topn implementation principle

Introduction to topn

Through the introduction of the principle of topn implementation of kylin 3. X, we know that the topn of kylin 3 and previous versions uses the space saving algorithm, and has been optimized on this basis. The code implementation can be viewed org.apache.kylin . measure.topn.TopNCounter .
Kylin 4.0 continues to use the space saving algorithm and optimizes it based on the topncounter of kylin 3. X. however, the current topn also has errors, which will be described in detail later.

Topn implementation

Currently, kylin4’s topn udaf registration is in org.apache.kylin . engine.spark.job . cuboidaggregator? Agginternal, the code is as follows:

def aggInternal(ss: SparkSession,
                  dataSet: DataFrame,
                  dimensions: util.Set[Integer],
                  measures: util.Map[Integer, FunctionDesc],
                  isSparkSql: Boolean): DataFrame = {
      //Ellipsis
      measure.expression.toUpperCase(Locale.ROOT) match {
        //Ellipsis
        case "TOP_N" =>
          // Uses new TopN aggregate function
          // located in kylin-spark-project/kylin-spark-common/src/main/scala/org/apache/spark/sql/udaf/TopN.scala
          val schema = StructType(measure.pra.map { col =>
            val dateType = col.dataType
            if (col == measure) {
              StructField(s"MEASURE_${col.columnName}", dateType)
            } else {
              StructField(s"DIMENSION_${col.columnName}", dateType)
            }
          })

          if (reuseLayout) {
            new Column(ReuseTopN(measure.returnType.precision, schema, columns.head.expr)
              .toAggregateExpression()).as(id.toString)
          } else {
            new Column(EncodeTopN(measure.returnType.precision, schema, columns.head.expr, columns.drop(1).map(_.expr))
              .toAggregateExpression()).as(id.toString)
          }
       //Ellipsis
        case _ =>
          max(columns.head).as(id.toString)
      }
    }.toSeq
//Ellipsis
    if (reuseLayout) {
      val columns = NSparkCubingUtil.getColumns(dimensions) ++ measureColumns(dataSet.schema, measures)
      df.select(columns: _*)
    } else {
      df
    }
  }

In fact, the initial implementation of topn is org.apache.kylin . engine.spark.job . topnudaf, but you can see that the current implementation of topn is in org.apache.spark . sql.udaf.BaseTopN . Scala. The latest implementation mainly fixes the performance problems of the old implementation. For details, please seeKYLIN-4760

The topn of kylin 4.0 is implemented by spark udaf. The following is the relationship between class interfaces. You can see that the final implementation is basetopn, and the inheritance is typediimperativeaggreg ate. Then basetopn has two subclasses: encodetopn and reusetopn. When building from flattable, no topn has been built in flattable. Here, encodetopn will be called, and then reusetopn will be called when building the next layer of cube from the already built cube to avoid repeated calculation. The interface diagram is as follows:

Introduction of kylin 4.0 topn implementation principle

The main reason for inheriting typediimperativeaggregate to implement topn instead of userdefineaggregatefunction is that userdefinedaggregatefunction converts the internal internalrow type of catalyst to the row type, and then uses the user’s own update method for processing. Then typediimperativeaggregate needs to do its own serialization and deserialization processing, which reduces one layer of conversion.

Introduction to topncounter

As mentioned above, the space saving algorithm is implemented in topncounter. Here we will give a brief introduction to the implementation of topncounter. When the basetopn object is initialized, the topncounter object is created, and the user saves the rows that meet the topn condition in the calculation process. The concept corresponding to spark udaf is aggregate buffer. Update, merge and eval are all topncounters. Topncounter needs to specify the capacity when initializing, and the recommended size is nTopNCounter.EXTRA_ SPACE_ Rate, where n is the size defined by topn, extra_ SPACE_ Rate is the recommended additional space adjustment parameter. The default value is 10. That is to say, if the topn (10,4) is defined, the initialization size of topncounter is 10 10 = 100 。

The processing flow of topn is shown in the figure below:

Introduction of kylin 4.0 topn implementation principle

Update () mainly passes the incoming rows through the TopNCounter.offer () insert the contents of a line into the topncounter object. Merge is to de merge the two groups after the operation of update (). Finally, it is called when eval() TopNCounter.sortAndRetain () to sort and adjust the size of topncounter, and finally get the aggregate result.

storage

Kylin 4.0 currently uses parquet for storage. We define topn (10,4), TopNCounter.EXTRA_ SPACE_ Rate is set to 1. The mapping relationship between dimensions and measures listed in cuboid is as follows:

0 -> seller_id

1 -> item_id

2 -> id

3 -> price

4 -> Count

5 -> TopN

The following is the cuboid content with topn only and sum only:

Introduction of kylin 4.0 topn implementation principle

It is worth noting that in the second row, the count is 11, but in fact the topn column only stores 10 values, because the capacity of topncount is only 10 * extra_ SPACE_ Rate = 10, more than 10 content will not be stored, which is the reason for the current topn error. You can see that topn puts the calculated dimension and the group by dimension together, and then stores them in the form of an array.

Introduction of kylin 4.0 topn implementation principle

For sum measure, kylin is the aggregate value after the sum directly stored.