Guide: Flink has provided the function of integrating with hive since 1.9.0. With the iteration of several versions, the function of integrating with hive has been further deepened in the latest Flink 1.11, and it has begun to try to integrate the streaming computing scene with hive.
This paper mainly shares the new features of hive docking in Flink 1.11, and how to use Flink to transform hive data warehouse in real time, so as to achieve the goal of batch flow integration. The main contents include:
·Background of the integration of Flink and hive
·New features in Flink 1.11
·Build hive batch flow integrated data warehouse
1、 The integration background of Flink and hive
Why should we integrate Flink and hive? The original intention was that we wanted to exploit Flink’s ability in batch processing. As we all know, Flink has been a successful engine in stream computing, and many users use it. In Flink’s design concept, batch computing is a special case of stream processing. That means that if Flink is good at stream computing, its architecture can also support batch computing scenarios. In the scenario of batch computing, SQL is a very important entry point. Because students who do data analysis are more used to using SQL for development than to writing programs like datastream or dataset.
Hive, the SQL Engine of Hadoop ecosystem, is a de facto standard. Most of the user environment will use hive’s functions to build data warehouse. Some relatively new SQL engines, such as spark SQL and impala, actually provide the ability to integrate with hive. In order to connect to the existing user scenarios, we think that for Flink, connecting hive is also an indispensable function.
Therefore, in Flink 1.9, we began to provide the function of integrating with hive. Of course, in version 1.9, this function is released as a trial version. In Flink version 1.10, the functions integrated with hive are available for production. At the same time, when Flink 1.10 was released, we used a 10TB tpc-ds test set to compare Flink with hive on MapReduce
The blue box indicates the time for Flink, and the orange box indicates the time for hive on MapReduce. The final result is that Flink improves the performance of hive on MapReduce by about 7 times. Therefore, it is verified that Flink SQL can well support batch computing scenarios.
Next, we will introduce the design architecture of Flink docking hive. When docking hive, it needs several levels, namely:
·It can access hive’s metadata;
·Read and write hive table data;
· Production Ready ；
1. Access hive metadata
Students who have used hive should know that the metadata of hive is managed through the hive metadata store. So it means that Flink needs to get through the communication with the history Metastore. In order to better access hive metadata, a brand-new catalog API is proposed in Flink.
This new interface is a universal design. It is not only for docking hive metadata, but also for docking metadata of different external systems.
Moreover, in a Flink session, multiple catalogs can be created, and each catalog corresponds to an external system. Users can specify which catalogs to define in the Flink table API or in the yaml file if they are using SQL client. Then, when the SQL client creates the tableenvironment, it will load these catalogs. Tableenvironment manages these different catalog instances through catalog manager. In this way, the SQL client can use these catalogs to access the metadata of the external system in the subsequent process of submitting SQL statements.
The above figure lists the implementation of two catalogs. One is genericlnmemorycatalog, which stores all metadata in the memory of the Flink client. Its behavior is similar to that of Flink before the appearance of the catalog interface. That is, the life cycle of all metadata is the same as the session cycle of SQL client. When the session ends, the metadata created in the session will be lost automatically.
The other is hivecatalog, which hive focuses on. Behind the hivecatalog is an instance of the hive meta store, which communicates with the hive meta store to read and write metadata. In order to support multiple versions of history, the APIs of different versions of history Metastore may be incompatible. Therefore, a new hiveshim is added between the hivecatalog and the hive meta store. Different versions of hive can be supported by the hiveshim.
On the one hand, hivecatalog allows Flink to access hive’s own metadata; on the other hand, it also provides Flink with the ability to persist metadata. In other words, hivecatalog can be used to store both hive metadata and Flink metadata. For example, if you create a Kafka table in Flink, this table can also be saved in hivecatalog. In other words, it provides Flink with the ability to persist metadata. Before hivecatalog, there was no persistence capability.
2. Read and write hive table data
With the ability to access hive metadata, another important aspect is reading and writing hive table data. Hive’s table is stored in Hadoop’s file system, which is an HDFS or other file systems. As long as the Hadoop file system interface is implemented, the hive table can be stored in theory.
·When reading data, hivetablesource is implemented
·When writing data, hivetablesink is implemented
One of the design principles is to reuse hive’s original input / output format and serde as much as possible to read and write hive’s data. There are two main advantages in this way. One is that reuse can reduce the workload of development. Another advantage of reuse is to ensure the compatibility of written data with hive as much as possible. The target is the data written by Flink, and hive must be able to read normally. Conversely, the data written by hive can be read by Flink normally.
3. Production Ready
In Flink 1.10, the function of docking hive has realized production ready. The realization of production ready is mainly considered to be complete in function. The specific functions are as follows:
2、 New features in Flink 1.11
Next, we will introduce some new features of hive in Flink version 1.11.
1. Simplified dependency management
The first thing to do is to simplify dependency management using hive connector. One of the pain points of hive connector is that it needs to add several jar package dependencies, and different versions of hive need to add different jar packages. For example:
The first figure shows the jar package to be added to hive 1.0.0. The second figure is the jar package to be added with hive 2.2.0. It can be seen that no matter from the number of jar packages, versions, etc., different hive versions add different jar packages. So if you don’t read the document carefully, it’s easy to cause dependency errors. Once adding errors, such as adding less or wrong version, some strange and difficult to understand errors will be reported. This is also one of the most exposed problems when users use hive connector.
So we hope to simplify dependency management and provide users with a better experience. Specifically, starting from Flink version 1.11, some pre typed hive dependency packages will be provided:
Users can select the corresponding dependency package according to their hive version.
If the hive you are using is not an open source version of hive, you can still use 1.10 to add a single jar package yourself.
2. The enhancement of hive dialect
Hive dialect was introduced in Flink 1.10, but it is rarely used because the function of hive dialect in this version is relatively weak. The only function is whether to allow the creation of switches for partitioned tables. If hive dialect is set, partition tables can be created in Flink SQL. If not set, creation is not allowed.
Another key is that it does not provide hive syntax compatibility. If hive dialect is set and partition table can be created, DDL for creating partition table is not hive syntax.
In Flink 1.11, the function of hive dialect is enhanced. The goal of the enhancement is: when users use the Flink SQL client, they can get a similar experience with hive cli or beeline. That is, when using the Flink SQL client, you can write some hive specific syntax. In other words, when users migrate to Flink, hive’s script does not need to be modified at all.
In order to achieve the above goals, the following improvements have been made in Flink 1.11
·We parameterized dialog. At present, the parameters support default and hive. Default is the dialog of Flink itself, and hive is the dialog of hive.
·Both SQL client and API can be used.
·Can be flexible to do dynamic switching, switching is statement level. For example, after the session is created, if you want to write the first statement with flick’s dialog, set it to default. After executing a few lines of statements, if you want to write with hive’s dialect, you can set it to hive. When switching, there is no need to restart the session.
·Compatible with hive common DDL and basic DML.
·Provides a similar experience to hive cli or beeline.
3. Turn on hive dialect
The above figure shows how to open hive dialog in SQL client. The initial dialog can be set in SQL client. It can be set in yaml file, or it can be switched dynamically after SQL client is up.
You can also open hive dialog through the Flink table API
You can see that you can get the config through the tableenvironment and then set it on.
4. Syntax supported by hive dialect
Hive dialect syntax is mainly enhanced in DDL. Because it is not very available to write DDL to operate hive metadata through Flink SQL in 1.10, we should focus on DDL to solve this problem.
Currently, the DDLS supported are as follows:
5. Stream data to hive
In Flink 1.11, we also make the streaming data scene and the function of combining with hive. Through the combination of Flink and hive, we can help hive data warehouse to carry out real-time transformation.
Stream data write hive is implemented with the help of streaming file sink. It is completely SQL based and does not require user code development. Stream data write hive also supports partitioned and non partitioned tables. Hive data warehouse is generally offline data, and users have high requirements for data consistency, so it supports exactly only semantics. Streaming data write hive has a delay of about 5-10 minutes. If you want the latency to be as low as possible, one result is that more small files will be generated. Small files are not friendly to HDFS. When there are too many small files, the performance of HDFS will be affected. In this case, you can do some merging operations of small articles.
There are several configurations for streaming data write hive
For partitioned tables, you need to set the parameter of partition commit delay. The meaning of this parameter is to control how long each partition contains data, such as days, hours, etc.
Partition commit trigger indicates when partition commit is triggered. In version 1.11, process time and partition time trigger mechanisms are supported.
Partition commit policy indicates how partitions are committed. For hive, you need to submit the partition to Metastore so that the partition is visible. Metastore policy only supports hive table. There is also a success file mode, which tells the downstream job partition that the data is ready. Users can also customize and implement a submission method by themselves. In addition, policy can specify multiple, for example, Metastore and success file can be specified at the same time.
Let’s look at the implementation principle of downstream data write hive
There are mainly two parts. One is the streaming filewriter, which can be used to write data. It will distinguish between buckets. Here, the buck is similar to hive’s partition concept, and each subtask will write data to a different bucket. Each bucket written by subtask may maintain three kinds of files at the same time. In progress files indicates the file being written, pending files indicates that the file has been written but has not been submitted, and finished files indicates that the file has been written and has been submitted.
The other is the streaming filecommittee, which is executed after the streaming filewriter. It is used to commit partitions, so it is not needed for non partitioned tables. When a partition data of streamingfilewriter is ready, streamingfilewriter will send a commit message to streamingfilecommittee, which tells streamingfilecommittee that the data is ready. Then the commit trigger and commit policy are used to trigger the commit.
Here is a specific example:
In the example, we created a new one called hive_ Table, which has two partitions DT and hour. DT represents the string of date and hour represents the string of hour. The commit trigger is set to partition time, commit delay is set to 1 hour, and commit policy is set to meta store and success file.
6. Flow consumption hive
In Flink 1.10, the way to read hive data is batch. From version 1.11, it provides streaming way to read hive data.
By constantly monitoring hive data table to see if there is any new data, if there is, incremental data consumption will be carried out.
If you want to enable streaming consumption for a hive table, you can open it in the table property, or you can use the dynamic options function added in 1.11 to dynamically specify whether the hive table needs to open streaming reading when querying.
Hive supports partitioned and non partitioned tables. For non partitioned tables, it monitors the addition of new files in the table directory and reads them incrementally. For partition table, check whether there is a new partition added by monitoring the partition directory and Metastore. If there is a new partition, the new partition data will be read out. It should be noted that reading new partition data is one-time. That is, after a new partition is added, the data of the partition will be read out at one time. After that, the data of the partition will no longer be monitored. So if you need to stream hive’s partition table with Flink, you should ensure that the data of the partition is complete when it is added.
Streaming hive data also requires additional parameters. First of all, you need to specify the consumption order. Because the data is read incrementally, you need to specify the order in which you want to consume the data. At present, two consumption orders, create time and partition time, are supported.
Users can also specify the starting point of consumption, which is similar to Kafka’s function of specifying offset, and from which time point data they want to start consumption. When Flink consumes data, it will check and only read the data after this time point.
Finally, you can specify the monitoring interval. Because at present, the addition of new monitoring data is to scan the file system, maybe you want to monitor it less frequently, which will cause greater pressure on the file system. So you can control an interval.
Finally, let’s look at the principle of down stream consumption. Let’s look at the non partitioned table of streaming consumption
The continuous file monitoring function in the figure will continuously monitor the files under the non partitioned table directory and interact with the file system. Once new files are added, splits will be generated for these files, and the splits will be sent to continuous FileReader operator. After the FileReader operator gets the splits, it will actually consume the data in the file system, and then the read data will be sent to the downstream for processing.
For streaming consumption, there is not much difference between partitioned and non partitioned tables. Hivecontinuous monitoring function scans the file system continuously, but it scans the directory of the new partition. When it finds a new partition directory, it will go further to the Metastore to check whether the partition has been submitted to the Metastore. If it has already been submitted, the data in the partition can be consumed. Then the data in the partition will be generated and split to the continuous FileReader operator, and then the data can be consumed.
7. Associated hive dimension table
Another scenario of hive combining with streaming data is to associate hive dimension tables. For example, when consuming streaming data, join with an offline hive dimension table.
The syntax of Flink’s temporary table is used to associate hive dimension table. That is to say, hive dimension table is regarded as temporary table, and then join with streaming table. To learn more about temporary table, check Flink’s website.
The implementation of associating hive dimension table is that each sub task stores hive table in memory, which is to cache the whole hive table. If the hive dimension table size exceeds the available memory of the sub task, the job fails.
When hive dimension table is associated, hive dimension table may be updated, so users are allowed to set the timeout of hive table cache. After this time, sub task reloads the hive dimension table. It should be noted that this scenario is not suitable for frequent update of hive dimension table, which will cause great pressure on HDFS file system. So it is suitable for the case that hive dimension table is updated slowly. Generally, the cache timeout is set at a long time, usually at the hour level.
This diagram shows the principle of associating hive dimension table. Streaming data represents streaming data, and lookupjoinrunner represents join operator. It will get the join key of streaming data and pass the join key to the file system lookupfunction.
The file system lookup function is a table function. It interacts with the underlying file system and loads the hive table. Then it queries the join key in the hive table to determine which rows of data can be joined.
Here is an example of associating hive dimension tables:
This is an example of Flink’s official website. The flow table is orders and latesttates is hive’s dimension table.
3、 Hive batch flow integrated data warehouse
As can be seen from the above introduction, in Flink 1.11, the functions of hive data warehouse and batch flow integration are mainly developed. Because Flink is a stream processing engine, we hope to help users better combine batch and stream, realize the real-time transformation of hive data warehouse, and make it more convenient for users to mine the value of data.
Before Flink 1.11, Flink would do batch calculations with hive, and only supported offline scenarios. One of the problems in offline scenarios is that the delay is relatively large, and the scheduling of batch jobs is usually done through some scheduling frameworks. In this way, in fact, the delay will have a cumulative effect. For example, after the first job runs, you can run the second job… In this way, you can execute it in turn. So the end-to-end delay is the sum of all jobs.
After 1.11, with the support of hive’s streaming processing capability, we can make a real-time transformation of hive data warehouse.
For example, online data can be written to hive in real time by using Flink as ETL. After the data is written to hive, a new Flink job can be added to do real-time query or near real-time query, and the result can be returned quickly. At the same time, other Flink jobs can also use the data written into hive data warehouse as dimension table to integrate with other online data to get the analysis results.
Li Rui, a famous Alibaba “Tianli”, is a technical expert of Alibaba and a member of Apache hive PMC. Before joining Alibaba, he worked for Intel, IBM and other companies, and mainly participated in hive, HDFS, spark and other open source projects.