What is kylin? It’s fast to check the data

Time:2021-7-23

preface

Wechat search【Java3y】Paying attention to this man with dreams and praising attention is my greatest support!

The text has been included in my GitHubhttps://github.com/ZhongFuCheng3y/3y, there are more than 300 original articles, recently in seriesInterviews and projectsSeries!

I’d like to start with you todaykylin(kylin).

Due to work needs, some time agokylinNow let’s take notes. (my words may help you get startedkylinAt least you should know after reading thiskylinWhat do you do.

Not much BB, let’s go

What is kylin? It's fast to check the data

Kylin introduction

kylinIt’s our people who lead and contribute to our countryApacheOpen source projects of the foundation, so we will haveChinese documentstudy:

http://kylin.apache.org/cn/

From the official we can see that rightkylinIntroduction of the project:Apache Kylin™Is an open source, distributed analytical data warehouse, providesHadoop/SparkAboveSQLQuery interface andMultidimensional analysis (OLAP)Ability toSupport large-scale dataFirst developed by eBay and contributed to the open source community, it canSub secondQuery huge tables.

See this introduction, can only use two words to describekylinIt’s tough. Where is the bull? Let’s talk about it next

At first glance, some students may not knowOLAPWhat is it? Let me explain it briefly(Hadoop / spark / SQL / big dataThese words can be seen every day. Even if you don’t understand its principle, you know what these things are for and what they are for, right?)

noticeOLAPI have to mention its brotherOLTPLet’s take a brief look at their full names and the Chinese Translation:

  • OLTP:On-Line TransactionProcessing (online transaction processing)
  • OLAP:On-Line AnalyticalOnline analytical processing

I’m afraid we can’t understand the Chinese translation, but we can find the difference between them. One is “affair“And one is”analysis

From the application level, we can simply think that OLTP is mainly used forBusiness systemFor example, order placing / transaction (bank transfer, etc.). OLAP is mainly used forData warehouse system, supportComplex analysis operationsIt focuses on decision support and provides intuitive query results.

I’ll draw another mind map for you to see, and you’ll basically understand:

What is kylin? It's fast to check the data

See here, you should be rightOLAPI have a basic understanding. Let’s go back to the above sentence:Multidimensional analysis (OLAP)Ability toSupport large-scale dataWhat do you think of in your first reaction?

Sanwai’s first thought wasHiveHiveThe bottom isHDFS: support large scale data).

Well, when it comes toHiveYou’ll find outkylinIn the first half,Hivebe likealmostAll can be supported, except for the last sentence, “it can be in theSub secondQuery the huge table.

That’s right. You can find out from herekylinThe purpose of this method is as follows:It can query huge tables in sub seconds to complete data analysis and decision-making

Every runHiveWe may all have to run for a few minutes (for example, my SQL is poorly written, and running for half an hour is also a common thing). We hope from the perspective of businessData for analysisCan run faster to support this demandkylinIt’s on fire.

What is kylin? It's fast to check the data

I’ll take it as an exampleHiveTo extendkylin, exceptkylinIs there no other choice? Obviously not.

When I first entered the company, I make complaints about it.HiveRunning too slowly, the little brother next door told me: you use itprestoAh, our big data platform supports it.

What is kylin? It's fast to check the data

OLAPThere are many tool frameworks provided. Let’s have a brief understanding

What is kylin? It's fast to check the data

As we all know, implementationHiveIt’s actually runningMap-ReduceGo on a missionHDFSGet the data. The process of execution involvescalculationandstorage

Some people think thatHiverunMap-ReduceIt’s too slow to calculate, so you don’t have toMap-ReduceUsing other computing engines, such asMPPArchitecture to run, but the storage has not changed

Some people think that storage inHDFSIt’s too slow to get the data. Change the storage place and don’t get it from the databaseHDFSTake

Some people think this shit,calculationandstorageI have changed, with my framework one-stop to solve you

Some people think that,HadoopEcology is OK. I’ll aggregate it first. When you check it, you can take the aggregated data directly. It’s also very fast

Due to the different business scenarios and backgrounds of each company, each company has its own characteristicsOLAPThe advantages of the framework are different, so there are so manyOLAPTechnology is shining and heating

Getting started with kylin

From the front, we already know why there are so manyOLAPIn essence, the data we want to analyze can be found by usFaster, andkylinIt’s one of these technologies.

As can be seen from the above picturekylinIs completely dependentHadoopEcological, that’s rightkylinHowincrease speedWhat’s wrong with it? The answer is:Prepolymerization

Suppose we start fromMySQLRetrieval date is greater than2020-10-20All the data, as long as weDate columnWith the index, we can quickly find out the relevant data.

But if we start fromMySQLRetrieval date is greater than2020-10-20All data ofAnd how much did each user spend in this period of time andAs long as the amount of data is large, no matter how you build the index, the query speed is not satisfactory.

So if I pressdayFirst, make statistics for each user and write it into a table. When the user searches by date, will it be fast (because I have already presseddayAggregate data once, the number of this table will be greatly reduced compared with the original table.)

kylinThat’s itPrepolymerizationThis way of thinking to improve the speed of the query, so that it can be used in theSub secondImplement query response.

Then we usekylinWhat are the steps in the process? The government has already answered for us:

  1. Define a star or snowflake model on a dataset
  2. Build on a defined data tablecube
  3. Use standardSQLadoptODBCJDBCorRESTFUL APIFor query, query results can be obtained only with sub second response time

In the above steps, you may not know the following wordsStar model, snowflake model, cubeLet me briefly explain:

In the field of data warehouse, our main table is calledFact sheetThe table on which the foreign key of the fact table depends is calledDimension table

What is kylin? It's fast to check the data

star schema “: all dimension tables areDirect linkTo the fact sheet( (above)

Snowflake model“YesOne or more dimension tables are not directly connected to the fact table, and need to connect to the fact table through other dimension tables (figure below)

What is kylin? It's fast to check the data

staykylinIn this paper, the angle of data analysis is calleddimensionThe index to be analyzed is calledmeasure

What is kylin? It's fast to check the data

All right, let’s seecubeWhat do you mean:

What is kylin? It's fast to check the data

A cube is called an OLAP cubeThe above two-dimensional tables we can form adata cube This data cube isCube

OneCubeIt can be made up of differentangleTo see, it seems that these multiple angles are from a complete perspectiveCubeFor example:

What is kylin? It's fast to check the data

Combined with the above:CubeIn fact, it is a cube constructed from a dataset through different dimensions (although the pictures are all three-dimensional, but you build it)CubeCan be much more than 3D)

kylinThat’s rightCubeThis cube is used to obtain data, which is also very clear from the official statement. It can be obtained through theJDBC/RESTfulTo get the data.

thatkylinWhere is the aggregated data stored (there must be a storage place)? stayHBaseIt’s on. If you haven’t studied HBase, you can read my previous articlesIntroduction to HBase

What is kylin? It's fast to check the data

usekylinSteps:

  • First of all, you have to have dataHive/Kafka), inKylinDefine the corresponding data model (structure) on the
  • adoptkylinThe system configuration needs to aggregate and count the fields (this is the dimension and measurement mentioned above), and then build theCube(here is the picture.)kylinPre aggregation, define the dimensions that need to be counted, and calculate them in advance)
  • kylinWill store the data inHBaseYes, you canJDBC/RESTfulTo query data

Using kylin

Common QA are also listed on the official websitehttp://kylin.apache.org/cn/docs/gettingstarted/faq.html

althoughkylinIt can support multi-dimensional aggregation, but we are building itCubeIt’s usually rightCubeconductprune(i.e. reduce cuboid generation)

Let’s say we have 10 dimensions, then we don’t have any optimized dimensionsCubeThere will beThe tenth power of 2 = 1000 + unitsCuboid。

The maximum number of physical dimensions (excluding derived dimensions) of cube is 63, but it is not recommended to use cube with more than 30 dimensions, which will cause dimension disaster.

The commonly used pruning method is implemented by the aggregation group configuration, and in the aggregation group, mandatory is used more often.

For example, I hadA、B、CIf I don’t optimize the three dimensions, my combination should have 7, which are(A)(B)(C)(AB)(ABC)(AC)(BC)If I specifyAIf the dimension is mandatory, the final combination will be(A)(AB)(ABC)(AC)。 Mandatory index refers to:The specified field must be included in the query criteria

In addition to mandatory dimension, there are hierarchy dimension and joint dimension to help uspruneGenerally, the mandatory dimension and the joint dimension are used more.


Let’s find outkylinThe data has been aggregated and stored in theHBaseSo it’s quite fast to query, but buildCubeThis process is actually quite slow (it’s normal for more than ten minutes to half an hour).

This will lead to delays(CubeIt takes time to build, and it’s impossible to request a build in secondsCube)Is that tolerable? That means the latest data has to waitCubeThe task is scheduled andCubeData can only be found after the construction is completed

Voice over: cube is usually built by requesting kylin’s API in the way of timing task.

Kylin has no built-in scheduling level. You can trigger the regular build of cube from external scheduling service through rest API, such as Linux commandcrontab, Apache airflow, etc.

But in the new erakylinIt is already supported in versionrealtime_olapIt’s too late,kylinIt stores real-time data and HBase datamergeAfter that, the return is realizedrealtime

What is kylin? It's fast to check the data

last

This article is rightkylinMade a simple introduction, the details still have to see the official website (Chinese, easy to read, the document is also very good). If necessary, I would like to add the following details

reference material:

Sanwai has organized all the interview knowledge points, resume template and original articles into an e-book, with a total of 1263 pages! Click belowlinkJust take it directly

Content of PDF documentThey are all hand fight, if you don’t understand anything, you can directlyCome and ask me

What is kylin? It's fast to check the data