Analysis of the actual combat of Flink x Zeppelin, hive streaming

Time:2021-1-28

It’s been three weeks since Flink 1.11 was officially launched, and the most attractive feature is hive streaming. It happened that zeppelin-0.9-preview2 was also released not long ago, so I wrote a practical analysis of Flink hive streaming on Zeppelin. This article mainly from the following parts to share with you:

  • The meaning of hive streaming
  • Checkpoint & Dependency
  • Write Kafka
  • Hive Streaming Sink
  • Hive Streaming Source
  • Hive Temporal Table

The meaning of hive streaming

Many students may wonder why hive streaming has such a high status in Flink 1.11? What can it bring us? In fact, in the field of big data, there have been two architecturesLambdaAndKappa:

  • LambdaArchitecture – stream batch separation, static data is synchronized to hive data warehouse through timing scheduling, and real-time data is synchronized to hive and consumed by real-time computing engine, which leads to a problem.
  • Data caliber problem
  • The output delay of offline computing is too large
  • Data redundancy storage
  • KappaArchitecture – all use real-time computing to produce data, and historical data is calculated by tracing the consumption sites of messages. There are also many problems. After all, there is no architecture for once and for all.
  • Message oriented middleware can’t keep all the historical data. The same data is line storage, which takes up too much space
  • Real time calculation of historical data is inadequate
  • Unable to analyze ad hoc

In order to solve these problems, the industry launched a real-time data warehouse, to solve most of the pain points, but there are still some places unable to do. For example, what about the calculation of historical data? How can I do ad hoc analysis? So in the industry, real-time data warehouse and offline data warehouse exist in parallel, which brings more problems: multiple models, inconsistent data output, calculation of historical data and so on.

Hive streaming can solve these problems! We don’t need multiple models any more; we don’t need the same indicator because it involves historical data. Write real-time SQL and offline SQL again; ad hoc can do it. How can we do it? Just read the table produced by hive streaming!

Next, let’s start with parameter configuration, then write hive by streaming, then read hive table by streaming, and finally join hive dimension table. After experiencing the whole process, you will surely have a deeper understanding of hive streaming and be able to realize its function.

Checkpoint & Dependency

Because only after the checkpoint is completed, the file will change from in progress state to finish state. Therefore, we need to configure checkpoint reasonably. It is very simple to configure checkpoint in Zeppelin.

%flink.conf

#Checkpoint configuration

pipeline.time-characteristic EventTime
execution.checkpointing.interval 120000
execution.checkpointing.min-pause 60000
execution.checkpointing.timeout 60000
execution.checkpointing.externalized-checkpoint-retention RETAIN_ON_CANCELLATION

#Relying on jar package configuration

flink.execution.packages org.apache.flink:flink-connector-kafka_2.11:1.11.0,org.apache.flink:flink-connector-kafka-base_2.11:1.11.0

And because we need to read data from Kafka, we add Kafka’s dependency.

Write Kafka

Our data comes from Tianchi data set and exists on local disk in CSV format, so we need to write them to Kafka first.

First, create the tables of CSV source and Kafka sink

%flink.ssql
SET table.sql-dialect=default;
DROP TABLE IF EXISTS source_csv;
CREATE TABLE source_csv (
user_id string,
theme_id string,
item_id string,
leaf_cate_id string,
cate_level1_id string,
clk_cnt int,
reach_time string
) WITH (
 'connector' = 'filesystem',
 'path' = 'file:///Users/dijie/Downloads/Cloud_Theme_Click/theme_click_log.csv',
 'format' = 'csv'
 
 )
%flink.ssql
SET table.sql-dialect=default;
DROP TABLE IF EXISTS kafka_table;
CREATE TABLE kafka_table (
user_id string,
theme_id string,
item_id string,
leaf_cate_id string,
cate_level1_id string,
clk_cnt int,
reach_time string,
ts AS localtimestamp,
WATERMARK FOR ts AS ts - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'theme_click_log',
'properties.bootstrap.servers' = '10.70.98.1:9092',
'properties.group.id' = 'testGroup',
'format' = 'json',
'scan.startup.mode' = 'latest-offset'

)

Because the registered table can be read and written, so I added the watermark when creating the table. Because the time stamp in the source data is very old, I use the current time minus 5 seconds as my watermark.

As you can see, I specified the SQL dialect as default at the beginning of the statement. Why? Are there any other dialects? Don’t worry. Listen to me.

In fact, in the previous version, Flink can get through with hive, including building tables on hive. However, many syntax and hive are incompatible, including the built tables that cannot be viewed in hive. The main reason is that dialects are not compatible. Therefore, in Flink 1.11, in order to reduce the learning cost (syntax incompatibility), you can use DDL to build hive table and query in hive. Flink supports dialect, and the default is default. Just like before, if you want to build hive table and support query, please use hive dialect. For details, please refer to the link below.

Hive Dialect:
https://ci.apache.org/project…

Then read the data from the CSV and write it to Kafka.

%flink.ssql(type=update)

insert into kafka_table select * from source_csv ;

Take another look at Kafka to see if the data has been poured in

No problem, so let’s write hive.

Hive Streaming Sink

Build a hive sink table and remember to switch the dialect to hive, otherwise there will be problems.

%flink.ssql
SET table.sql-dialect=hive;
DROP TABLE IF EXISTS hive_table;
CREATE TABLE hive_table (
user_id string,
theme_id string,
item_id string,
leaf_cate_id string,
cate_level1_id string,
clk_cnt int,
reach_time string
) PARTITIONED BY (dt string, hr string, mi string) STORED AS parquet TBLPROPERTIES (

 'partition.time-extractor.timestamp-pattern'='$dt $hr:$mi:00',
 'sink.partition-commit.trigger'='partition-time',
 'sink.partition-commit.delay'='1 min',
 'sink.partition-commit.policy.kind'='metastore,success-file'

);

Let’s explain the parameters a little bit

  • partition.time-extractor.timestamp-pattern: partition time extractor, which is consistent with the partition field in DDL;
  • sink.partition-commit.trigger: partition trigger type, optional process time or partition time. Process time: the above parameters and watermark are not required. When the current time is greater than the partition creation time+ sink.partition – commit.delay Partition time: you need to define the watermark in the source table when watermark > the partition time extracted+ sink.partition – commit.delay Time defined in, submit partition;
  • sink.partition-commit.delay: equivalent to delay time;
  • sink.partition-commit.policy.kind: how to submit? Generally, after successful submission, you need to notify Metastore so that hive can read your latest partition data. If you need to merge small files, you can also customize class by implementing the partitioncommitpolicy interface.

Next, let’s insert the data into the hive table we just created

%flink.ssql

insert into hive_table select  user_id,theme_id,item_id,leaf_cate_id,cate_level1_id,clk_cnt,reach_time,DATE_FORMAT(ts, 'yyyy-MM-dd'), DATE_FORMAT(ts, 'HH') ,DATE_FORMAT(ts, 'mm') from kafka_table

Let the program run for a while ~ let’s pour a cup of Java in 1995 ☕ ️ 。

Then take a look at our HDFS and see what’s under the path.

You can also use hive to query by yourself. I’ll sell it first and use hive streaming to read the data later.

Hive Streaming Source

Because the hive table has been created, you can use it directly when reading data here. The difference is that you need to use table hints to override parameters.

The biggest disadvantage of hive streaming source is that it can’t read the new files under the read partition. In short, the read partition will not be read again. It seems like a pit, but if you think about it carefully, it’s in line with the characteristics of confluence.

As usual, let’s talk about the meaning of parameters

  • stream-source.enable: obviously, indicates whether to turn on stream mode.
  • stream-source.monitor-interval: monitor the interval between new file / partition generation.
  • stream-source.consume-order: you can select Create time or partition time; create time refers not to the partition creation time, but to the file / folder creation time in HDFS; partition time refers to the partition time; for non partitioned tables, you can only use create time. The introduction on this side of the official website is a little vague, which makes people mistakenly think that they can find the new files under the partition that they have read. In fact, after my testing and looking at the source code, I found that it can’t.
  • stream-source.consume-start-offset: indicates which partition to read from.

Let’s take a look at the data~

You have to carry the line set, otherwise you can’t use table hints.

Hive Temporal Table

After watching streaming source and streaming sink, let’s finally try hive as dimension table.

In fact, using hive dimension table is very simple. As long as the table exists in hive, it can be used as dimension table. Parameters can be completely covered by table hints.

  • lookup.join.cache.ttl: indicates the cache time. It is worth noting that hive dimension table will cache all data of dimension table in TM memory. If the dimension table is large, it is easy to oom; if TTL time is too short, it will load data frequently, which will have a great impact on performance.

Because it is a left join, data that does not exist in the dimension table will be filled with null. Take another look at the DAG

If you look at the picture frame, you can see that the dimension table associated with lookupjoin is used here.

If you write wrong SQL statements, you lose the for system_ Time as of A.P, then the DAG graph will look like this:

This is not a dimension table join. In fact, it is more like a stream and batch join.

Write at the end

The improvement of hive streaming means breaking through the last barrier of the integration of streaming and batching. It can not only do OLAP analysis of historical data, but also spit out the results in real time. This is undoubtedly a good news for ETL developers. It is believed that in the next few days, more enterprises will complete the construction of their real-time data warehouse.

Reference documents:

[1]https://ci.apache.org/project…
[2]https://github.com/apache/zep…

Note download:

https://github.com/lonelyGhos…

Finally, I’d like to introduce the nail group of “Flink on Zeppelin”. If you have any questions, you can discuss them in the nail group. Apache Zeppelin PMC Chien Feng is also in the nail group. If you have any questions, you can directly ask questions in the nail group

Author’s introduction:

Dijie, senior data expert of mushroom street, is responsible for real-time computing platform of mushroom street. Currently, focus is in Flink on Zeppelin, Apache Zeppelin contributor.

Recommended Today

Review of SQL Sever basic command

catalogue preface Installation of virtual machine Commands and operations Basic command syntax Case sensitive SQL keyword and function name Column and Index Names alias Too long to see? Space Database connection Connection of SSMS Connection of command line Database operation establish delete constraint integrity constraint Common constraints NOT NULL UNIQUE PRIMARY KEY FOREIGN KEY DEFAULT […]