A set of SQL to handle data warehouse? Flink has a new try

Time:2020-10-6

Data warehouse is a kind of basic service which must be provided after the company’s data develops to a certain scale, and it is also the basic link of “data intelligence” construction. Rapid access to data feedback is not only conducive to improving the product and user experience, but also conducive to the company’s scientific decision-making, so the real-time access to data is particularly important.

At present, most of the enterprise’s data warehouse construction is offline and real-time. The real-time data warehouse with low delay is required for business, and offline data warehouse is used for complex business. The architecture is very complex and requires the use of a lot of systems and computing frameworks, which requires enterprises to reserve talents in many aspects, resulting in high cost of talents, and it is difficult to find out problems. End users also need to be familiar with a variety of grammars. This paper analyzes the current data warehouse architecture, explores whether offline and real-time data warehouse can be considered together, and whether Flink’s unified architecture can solve most problems.

At the end of the article, you can download e-books.

Data warehouse architecture

A set of SQL to handle data warehouse? Flink has a new try

Data warehouse can be divided into three layers: ODS (original data layer), DW (data warehouse layer) and ads (application data layer).

1. ODS (operation data store) layer

The original data transferred from log or business DB can be synchronized to the data warehouse by CDC (change data capture) tool in the traditional offline data warehouse. Using a set of unified Kafka to undertake this role can make the data more real-time into the data warehouse, also can unify the real-time and offline in this layer.

2. DW (data warehouse) layer

DW layer is generally divided into DWD layer and DWS layer

  • DWD (data warehouse detail) layer: detailed data layer. The data in this layer should be cleaned, clean and accurate. It contains the same information as ODS layer, but it follows the standard schema definition of data warehouse and database.
  • DWS (data warehouse service) layer: the summary data layer. This layer may be slightly aggregated, which may be the structural data of star or snowflake model. This layer has done some business layer calculations. Users can calculate the data required for data service based on this layer.

3. Ads (application data store) layer

Different from DWS, this layer of data services directly for users does not need to be calculated again, and it is the final data needed.

It is mainly divided into two links:

  1. Business dB and log > Kafka > real time data warehouse (Kafka + dim dimension table) > Bi DB > Data Service
  2. Business dB and log > Kafka > hive metacore + HDFS > Bi DB > Data Service

The mainstream data warehouse architecture is still lambda architecture. Although lambda architecture is complex, it can cover the scenarios needed by the business, and it is the most flexible way for business.

Lambda architecture is divided into two links:

  • Traditional offline data has the advantages of stability, complexity and flexibility. Running batch computing ensures T + 1 report generation and flexible ad hoc query.
  • Real time data warehouse provides data service with low delay. The traditional offline data warehouse often has the delay of T + 1, which makes the analysts unable to make some real-time decisions. However, the delay of the whole link of the real-time data warehouse is the lowest, even can reach the second level. This not only speeds up the analysis and decision-making, but also brings the possibility of more services, such as real-time monitoring and alarm. Flink’s strengths are real-time computing and streaming computing, and Kafka is the core of real-time data warehouse storage.

There are 1-9 edges in the icon above. Each edge represents the data conversion, which is the calculation of big data. In this paper, we will analyze these edges and explore the role Flink can play in it.

Flink one stack calculation

metadata

First of all, let’s talk about metadata management. Hive metacore is used to manage metadata in offline data warehouse, but Kafka alone does not have the ability to manage metadata. Here, two methods are recommended:

1. Confluent schema registry

After the schema registry service is set up, the schema information of the table can be obtained through the URL of the consistent. For tables with hundreds of fields, it can save a lot of work when writing Flink jobs. In the future, Flink is also combining its schema inference function with fluent schema registry. However, it still can not save the process of creating a table, and users need to fill in the corresponding URL of confluent.

2. Catalog

Currently, hivecatalog has been provided in Flink. Kafka’s tables can be directly integrated into hive metacore, and users can directly use these tables in SQL. For this purpose, we need some flexible methods such as [start-2] and [flick-2].

Both offline and real-time data warehouses in Flink use hive catalog

use catalog my_hive;
-- build streaming database and tables;
create database stream_db;
use stream_db;
create table order_table (
    id long,
    amount double,
    user_id long,
    status string,
    ts timestamp,
    …  --There may be dozens more fields
    ts_day string,
    ts_hour string
) with (
    ‘connector.type’ = ‘kafka’,
    …  --Kafka table related configuration
);
-- build batch database and tables;
create database batch_db;
use batch_db;
create table order_table like stream_db.order_table (excluding options)
partitioned by (ts_day, ts_hour)
with (
    ‘connector.type’ = ‘hive’,
    …  --Hive table related configuration
);

With catalog, subsequent calculations can completely reuse batches and streams, providing the same experience.

Data warehouse import

Calculation ① and ⑤ are the import of real-time data warehouse and offline data warehouse respectively. Recently, more real-time off-line data warehouse import has become the conventional method of data warehouse. The import of Flink can make the data of offline data warehouse more real-time.

In the past, it was mainly imported by datastream + streamingfilesink, but ORC was not supported and HMS could not be updated.

After Flink streaming integrate hive, hive’s streaming sink [3] is provided. It is more convenient and flexible to use SQL’s built-in functions and UDF, and stream and batch can be reused to run two stream calculation jobs.

insert into [stream_db.|batch_db.]order_table select … from log_table;

data processing

A set of SQL to handle data warehouse? Flink has a new try

Calculation (2) and (6) are the intermediate data processing of real-time data warehouse and off-line data warehouse respectively

  1. ETL: as with data import, there is no difference between batch flows.
  2. Dimension table join: filling fields in dimension tables is a very common data warehouse operation. In offline data warehouses, you can join hive table directly. However, the streaming operation is somewhat different, which will be described in detail below.
  3. Aggregation: streaming jobs in these stateful calculations, not a fixed value, but may be a constantly changing value.

Dimension table join

Different from offline computing, offline computing only cares about the dimension table data at a certain point in time, while the streaming job runs continuously, so it focuses not only on static data, but also on dynamic dimension table.

In addition, for the sake of join efficiency, streaming jobs often join a database table, not just hive table.

example:

--Stream dimension table
use stream_db;
create table user_info (
    user_id long,
    age int,
    address,
    primary key(user_id)
) with (
    ‘connector.type’ = ‘jdbc’,
    ...
);
 
--The dimension table of offline data warehouse is imported into real-time data warehouse
insert into user_info select * from batch_db.user_info;
 
--Dimension table join, SQL batch stream reuse
insert into order_with_user_age select * from order_table join user_info for system_time as of order_table.proctime on user_info.user_id = user_info.user_id;

There is a very troublesome thing here, that is, in the real-time data warehouse, the dimension table needs to be updated periodically to the real-time dimension table database. Can we join the hive dimension table of the offline data warehouse directly? At present, the community is also developing hive dimension table. What are its challenges

  1. Hive dimension table is too large to be put into the cache:
  • Consider shuffle by key, a distributed dimension table join to reduce the amount of data in a single concurrent cache
  • Consider putting dimension table data into state
  1. Dimension table updating problem:
  • The simple solution is TTL expiration
  • The more complex solution is to implement hive streaming source and combine Flink’s watermark mechanism

Stateful computing and data export

example:

select age, avg(amount) from order_with_user_age group by age;

A simple aggregate SQL, its execution mode in batch computing and stream computing is completely different.

The biggest difference between streaming aggregation and offline computing aggregation is that it is a dynamic table [4], and its output is constantly changing. The concept of dynamic table is simple. For a streaming count, its output is driven by the input, rather than output after obtaining all the input like batch. Therefore, its results are dynamic

  • If it is inside SQL, the retract mechanism in Flink will ensure that the result of SQL is the same as that of batch.
  • If it is external storage, this brings challenges to sink.

Output after stateful calculation:

  • If sink is an updatable database, such as HBase / redis / JDBC, it doesn’t seem to be a problem. We just need to update it constantly.
  • But if it’s non updatable storage, we can’t update the original data. For this reason, Flink has proposed the support of changelog [5]. It wants to build in support of this kind of sink and output the data of specific schema, so that downstream consumers can also work well.

example:

--Batch: after the calculation is completed, it is output to MySQL at one time, and there is only one data for the same key
--Streaming: the data in MySQL is constantly updated and changing
insert into mysql_table select age, avg(amount) from order_with_user_age group by age;
--Batch: if there is only one data in the same key, append is OK
insert into hive_table select age, avg(amount) from order_with_user_age group by age;
--Streaming: the data in Kafka is constantly appending, and there is an extra column to indicate that this is the upsert message. Subsequent Flink consumption will automatically make a mechanism to process the upsert
insert into kafka_table select age, avg(amount) from order_with_user_age group by age;

Ad-hoc and OLAP

Offline data warehouse can be calculated, detailed data or summary data can be ad-hoc query, data analysts can be flexible query.

At present, a big drawback of real-time data warehouse is that it can’t query by ad hoc, because it doesn’t save historical data. Kafka may be able to save data for more than three days, but one is high storage cost and the other is poor query efficiency.

One idea is to provide a batch stream unified sink component of OLAP database

  • Druid sink
  • Doris sink
  • Clickhouse sink
  • HBase/Phoenix sink

summary

Starting from the current lambda architecture, this paper analyzes the ability of Flink one stack data warehouse computing scheme. Some new Flink functions in this paper are still in rapid iterative evolution. With continuous exploration and practice, we hope to gradually advance towards the direction of computing integration. The future data warehouse architecture hopes to truly unify the offline and real-time of users and provide a unified experience

  • Unified metadata
  • Unified SQL development
  • Unified data import and export
  • Consider unified storage in the future

reference resources

[1]https://cwiki.apache.org/conf…

[2]https://cwiki.apache.org/conf…

[3]https://cwiki.apache.org/conf…

[4]https://ci.apache.org/project…

[5]https://cwiki.apache.org/conf…

Here comes the welfare

Calmly deal with the technical problems in the production environment, “Apache Flink ten technical difficulties” e-book free download!

<p style=”text- align:center “> < font size = 5 > Click to download for free
Apache Flink e-book collection

  • In depth interpretation of 1.02 million lines of code, 1270 questions, what did Flink 1.10 release?
  • From development to production, how to determine the cluster planning size?
  • Demo: building streaming application based on Flink SQL
  • Practical guide to flick checkpoint troubleshooting
  • How to analyze and deal with Flink back pressure?
  • Flink on yarn (Part 1): easy to master the infrastructure and startup process
  • Flink on yarn (2): common problems and troubleshooting ideas
  • The integration of Apache Flink and Apache hive
  • Flink batch SQL 1.10 practice
  • How to customize Python UDF in pyflink 1.10?
  • Principle and practice of Flink 1.10 native kubernetes

A set of SQL to handle data warehouse? Flink has a new try

This book is summarized and shared by Apache Flink core contributors and production environment users of first-line large factories. It covers principle analysis, application practice, demo demonstration, troubleshooting and solution of common problems in production environment, and Flink 1.10 ecological application principle and practice, which helps big data developers to solve Flink production and application problems!