CDH + kylin Trilogy 3: kylin official demo

Time:2021-7-5

Welcome to my GitHub

https://github.com/zq2599/blog_demos

Content: all the original articles and supporting source code, including Java, docker, kubernetes, Devops, etc;

This paper is the final part of CDH + kylin trilogy series

  1. CDH + kylin Trilogy 1: preparationPrepare machine, script and installation package;
  2. CDH + kylin trilogy II: deployment and setup: complete the deployment of CDH and kylin, and make relevant settings in the management page;

Now Hadoop and kylin are ready. Next, practice kylin’s official demo;

Yarn parameter setting

After setting the memory parameters of yarn, you must restart yarn to make it effective, otherwise the tasks submitted by kylin will not be executed due to resource constraints;

About kylin’s official demo

  1. The following figure is part of the script for the official demo (create_ sample_ Create hive table based on HDFS data
    在这里插入图片描述
  2. Visible through scriptKYLIN_SALESIt is a fact table, the other is a dimension table, and kylin_ Account and kylin_ There is association between country, so the dimension model conforms to snowflake schema;

Import sample data

  1. SSH login CDH server
  2. Switch to HDFS account:su – hdfs
  3. Execute the Import command:${KYLIN_HOME}/bin/sample.sh
  4. The import is successful. The console output is as follows:
    在这里插入图片描述

Check the data

  1. Check data, executebeelineEnter conversation mode (officially recommended by hive)beelineReplace hive cli:
    在这里插入图片描述
  2. Enter the link URL in beeline session mode:!connect jdbc:hive2://localhost:10000, enter the account number according to the prompthdfs, password enter directly:
    在这里插入图片描述
  3. By ordershow tablesTo view the current hive table, it has been created:
    在这里插入图片描述
  4. Find out the earliest and latest time of the order, which will be used later when building the cube. Execute SQL:select min(PART_DT), max(PART_DT) from kylin_sales;It can be seen that the earliest2012-01-01, latest2014-01-01, the whole query takes time87 seconds
    在这里插入图片描述

Build cube:

After data preparation, you can build kylin cube

  1. Visit kylin website:http://192.168.50.134:7070/kylin
  2. Load the meta data, as shown in the following figure:
    在这里插入图片描述
  3. As shown in the red box below, the data is loaded successfully
    在这里插入图片描述
  4. In the model page, you can see the fact table and dimension table. As shown in the following figure, you can create a MapReduce task to calculate the dimension table kylin_ Cardinality of each column of account:
    在这里插入图片描述
  5. Go to the horn page (Port 8088 of the CDH server), as shown in the following figure, you can see that a MapReduce type task is in progress:
    在这里插入图片描述
  6. The above tasks can be completed quickly (more than 10 seconds). At this time, refresh the kylin page, and you can seeKYLIN_ACCOUNTThe cardinality data of the table has been calculated (hive query gets account)_ The number of IDS is 10000, but the cardinality value in the figure below is 10420. Kylin uses the approximate algorithm of hyperloglog to calculate cardinality, which has errors with the accurate value. The cardinality of the other four fields is consistent with hive query results
    在这里插入图片描述
  7. Next, start building the cube:
    在这里插入图片描述
  8. Date range. Just now hive query result is2012-01-01reach2014-01-01Note that the deadline should exceed January 1, 2014
    在这里插入图片描述
  9. On the monitor page, you can see the progress:
    在这里插入图片描述
  10. Go to the yarn page (Port 8088 of the CDH server) to see the corresponding tasks and resource usage:
    在这里插入图片描述
  11. After the build is completed, the ready icon will appear
    在这里插入图片描述

query

  1. First, try to query the earliest and latest time of the transaction. How long does the query take to execute on hive87 secondsAs shown in the figure below, the results are consistent and time-consuming0.14 seconds
    在这里插入图片描述
  2. The following SQL is an official example of kylin to compare the response time. The orders are aggregated by date, sorted by date, and then queried by kylin and hive respectively
select part_dt, sum(price) as total_sold, count(distinct seller_id) as sellers from kylin_sales group by part_dt order by part_dt;
  1. Kylin query time consuming0.13 seconds
    在这里插入图片描述
  2. Hive query, the same results, time-consuming40.196 seconds
    在这里插入图片描述
  3. Finally, let’s look at the resource usage. During the cube construction process, 18G of memory is used:
    在这里插入图片描述
    So far, CDH + kylin has been completed from deployment to experience, and the CDH + kylin trilogy series is over. If you are learning kylin, I hope this article can give you some reference.

Welcome to the official account: programmer Xin Chen

Wechat search “programmer Xinchen”, I’m Xinchen, looking forward to enjoying the Java world with you
https://github.com/zq2599/blog_demos