Wechat search【Java3y】Paying attention to this man with dreams and praising attention is my greatest support!
The text has been included in my GitHub：https://github.com/ZhongFuCheng3y/3y, there are more than 300 original articles, recently in seriesInterviews and projectsSeries!
I’d like to start with you today
Due to work needs, some time ago
kylinNow let’s take notes. (my words may help you get started
kylinAt least you should know after reading this
kylinWhat do you do.
Not much BB, let’s go
kylinIt’s our people who lead and contribute to our country
ApacheOpen source projects of the foundation, so we will haveChinese documentstudy:
From the official we can see that right
kylinIntroduction of the project:
Apache Kylin™Is an open source, distributed analytical data warehouse, provides
Hadoop/SparkAboveSQLQuery interface andMultidimensional analysis (OLAP)Ability toSupport large-scale dataFirst developed by eBay and contributed to the open source community, it canSub secondQuery huge tables.
See this introduction, can only use two words to describe
kylinIt’s tough. Where is the bull? Let’s talk about it next
At first glance, some students may not know
OLAPWhat is it? Let me explain it briefly（
Hadoop / spark / SQL / big dataThese words can be seen every day. Even if you don’t understand its principle, you know what these things are for and what they are for, right?)
OLAPI have to mention its brother
OLTPLet’s take a brief look at their full names and the Chinese Translation:
- OLTP：On-Line TransactionProcessing (online transaction processing)
- OLAP：On-Line AnalyticalOnline analytical processing
I’m afraid we can’t understand the Chinese translation, but we can find the difference between them. One is “affair“And one is”analysis」
From the application level, we can simply think that OLTP is mainly used forBusiness systemFor example, order placing / transaction (bank transfer, etc.). OLAP is mainly used forData warehouse system, supportComplex analysis operationsIt focuses on decision support and provides intuitive query results.
I’ll draw another mind map for you to see, and you’ll basically understand:
See here, you should be right
OLAPI have a basic understanding. Let’s go back to the above sentence:Multidimensional analysis (OLAP)Ability toSupport large-scale dataWhat do you think of in your first reaction?
Sanwai’s first thought was
HiveThe bottom is
HDFS: support large scale data).
Well, when it comes to
HiveYou’ll find out
kylinIn the first half,
Hivebe likealmostAll can be supported, except for the last sentence, “it can be in theSub secondQuery the huge table.
That’s right. You can find out from here
kylinThe purpose of this method is as follows:It can query huge tables in sub seconds to complete data analysis and decision-making
HiveWe may all have to run for a few minutes (for example, my SQL is poorly written, and running for half an hour is also a common thing). We hope from the perspective of businessData for analysisCan run faster to support this demand
kylinIt’s on fire.
I’ll take it as an example
kylinIs there no other choice? Obviously not.
When I first entered the company, I make complaints about it.
HiveRunning too slowly, the little brother next door told me: you use it
prestoAh, our big data platform supports it.
OLAPThere are many tool frameworks provided. Let’s have a brief understanding
As we all know, implementation
HiveIt’s actually running
Map-ReduceGo on a mission
HDFSGet the data. The process of execution involves
Some people think that
Map-ReduceIt’s too slow to calculate, so you don’t have to
Map-ReduceUsing other computing engines, such as
MPPArchitecture to run, but the storage has not changed
Some people think that storage in
HDFSIt’s too slow to get the data. Change the storage place and don’t get it from the database
Some people think this shit,
storageI have changed, with my framework one-stop to solve you
Some people think that,
HadoopEcology is OK. I’ll aggregate it first. When you check it, you can take the aggregated data directly. It’s also very fast
Due to the different business scenarios and backgrounds of each company, each company has its own characteristics
OLAPThe advantages of the framework are different, so there are so many
OLAPTechnology is shining and heating
Getting started with kylin
From the front, we already know why there are so many
OLAPIn essence, the data we want to analyze can be found by usFaster, and
kylinIt’s one of these technologies.
As can be seen from the above picture
kylinIs completely dependent
HadoopEcological, that’s right
kylinHowincrease speedWhat’s wrong with it? The answer is:Prepolymerization
Suppose we start from
MySQLRetrieval date is greater than
2020-10-20All the data, as long as weDate columnWith the index, we can quickly find out the relevant data.
But if we start from
MySQLRetrieval date is greater than
2020-10-20All data ofAnd how much did each user spend in this period of time andAs long as the amount of data is large, no matter how you build the index, the query speed is not satisfactory.
So if I press
dayFirst, make statistics for each user and write it into a table. When the user searches by date, will it be fast (because I have already pressed
dayAggregate data once, the number of this table will be greatly reduced compared with the original table.)
kylinThat’s itPrepolymerizationThis way of thinking to improve the speed of the query, so that it can be used in theSub secondImplement query response.
Then we use
kylinWhat are the steps in the process? The government has already answered for us:
- Define a star or snowflake model on a dataset
- Build on a defined data table
- Use standard
RESTFUL APIFor query, query results can be obtained only with sub second response time
In the above steps, you may not know the following words
Star model, snowflake model, cubeLet me briefly explain:
In the field of data warehouse, our main table is calledFact sheetThe table on which the foreign key of the fact table depends is calledDimension table。
「star schema “: all dimension tables areDirect linkTo the fact sheet（ (above)
「Snowflake model“YesOne or more dimension tables are not directly connected to the fact table, and need to connect to the fact table through other dimension tables (figure below)
kylinIn this paper, the angle of data analysis is calleddimensionThe index to be analyzed is calledmeasure」
All right, let’s see
cubeWhat do you mean:
A cube is called an OLAP cubeThe above two-dimensional tables we can form adata cube This data cube is
CubeIt can be made up of differentangleTo see, it seems that these multiple angles are from a complete perspective
Combined with the above:
CubeIn fact, it is a cube constructed from a dataset through different dimensions (although the pictures are all three-dimensional, but you build it)
CubeCan be much more than 3D)
CubeThis cube is used to obtain data, which is also very clear from the official statement. It can be obtained through the
RESTfulTo get the data.
kylinWhere is the aggregated data stored (there must be a storage place)? stayHBaseIt’s on. If you haven’t studied HBase, you can read my previous articlesIntroduction to HBase
- First of all, you have to have data
KylinDefine the corresponding data model (structure) on the
kylinThe system configuration needs to aggregate and count the fields (this is the dimension and measurement mentioned above), and then build the
Cube(here is the picture.)
kylinPre aggregation, define the dimensions that need to be counted, and calculate them in advance)
kylinWill store the data in
HBaseYes, you can
RESTfulTo query data
Common QA are also listed on the official websitehttp://kylin.apache.org/cn/docs/gettingstarted/faq.html
kylinIt can support multi-dimensional aggregation, but we are building it
CubeIt’s usually right
Cubeconductprune(i.e. reduce cuboid generation)
Let’s say we have 10 dimensions, then we don’t have any optimized dimensions
CubeThere will be
The tenth power of 2 = 1000 + unitsCuboid。
The maximum number of physical dimensions (excluding derived dimensions) of cube is 63, but it is not recommended to use cube with more than 30 dimensions, which will cause dimension disaster.
The commonly used pruning method is implemented by the aggregation group configuration, and in the aggregation group, mandatory is used more often.
For example, I had
A、B、CIf I don’t optimize the three dimensions, my combination should have 7, which are
（A）（B）（C）（AB）（ABC）（AC）（BC）If I specify
AIf the dimension is mandatory, the final combination will be
（A）（AB）（ABC）（AC）。 Mandatory index refers to:The specified field must be included in the query criteria
In addition to mandatory dimension, there are hierarchy dimension and joint dimension to help uspruneGenerally, the mandatory dimension and the joint dimension are used more.
Let’s find out
kylinThe data has been aggregated and stored in the
HBaseSo it’s quite fast to query, but build
CubeThis process is actually quite slow (it’s normal for more than ten minutes to half an hour).
This will lead to delays（
CubeIt takes time to build, and it’s impossible to request a build in seconds
Cube）Is that tolerable? That means the latest data has to wait
CubeThe task is scheduled and
CubeData can only be found after the construction is completed
Voice over: cube is usually built by requesting kylin’s API in the way of timing task.
Kylin has no built-in scheduling level. You can trigger the regular build of cube from external scheduling service through rest API, such as Linux command
crontab, Apache airflow, etc.
But in the new era
kylinIt is already supported in version
realtime_olapIt’s too late,
kylinIt stores real-time data and HBase data
mergeAfter that, the return is realized
This article is right
kylinMade a simple introduction, the details still have to see the official website (Chinese, easy to read, the document is also very good). If necessary, I would like to add the following details
Sanwai has organized all the interview knowledge points, resume template and original articles into an e-book, with a total of 1263 pages! Click belowlinkJust take it directly
Content of PDF documentThey are all hand fight, if you don’t understand anything, you can directlyCome and ask me