Real time data warehouse | exploration and practice of building real time data warehouse based on SQL of flink1.11

Time:2020-12-30

Real time data warehouse is mainly to solve the problem of low timeliness of traditional data warehouse. It is usually used in real-time OLAP analysis, real-time data Kanban, real-time monitoring of business indicators and other scenarios. Although there are differences between the architecture and technology selection of real-time data warehouse and traditional offline data warehouse, the basic methodology of data warehouse construction is consistent. This article will share the demo of building a real-time data warehouse from 0 to 1 based on Flink SQL, which involves the whole process of data collection, storage, calculation and visualization. You can learn from this article:

  • The basic architecture of real time data warehouse
  • Data processing flow of real time data warehouse
  • New SQL features of flink1.11
  • Bugs in flink1.11
  • Complete operation case

The ancients had no spare effort in learning, but only when they were young and old.

It’s easy to feel it on paper, but you have to practice it.

Case introduction

This paper will take e-commerce business as an example to show the data processing flow of real-time data warehouse. In addition, this paper aims to explain the construction process of real-time data warehouse, so it will not involve too complex data calculation. In order to ensure the operability and integrity of the case, this paper will give the detailed operation steps. For the convenience of demonstration, all operations in this article are completed in Flink SQL cli.

architecture design

The specific architecture design is shown in the figure: first, parse the MySQL binlog log through canal, and store the data in Kafka. Then use Flink SQL to clean and associate the original data, and write the processed parts list into Kafka. Dimension table data is stored in mysql, and the detail wide table and dimension table are joined through Flink SQL. The aggregated data is written into mysql, and finally visualized through finebi.

Real time data warehouse | exploration and practice of building real time data warehouse based on SQL of flink1.11

Business data preparation

CREATE TABLE `order_info` (
  `id` bigint(20) NOT NULL AUTO_ Input comment 'number',
  `Consignee ` varchar (100) default null comment 'consignee',
  `consignee_ Tel ` varchar (20) default null comment 'recipient phone',
  `total_ Amount ` decimal (10,2) default null comment 'total amount',
  `order_ Status ` varchar (20) default null comment 'order status',
  `user_ ID ` bigint (20) default null comment 'user ID',
  `payment_ Way ` varchar (20) default null comment 'payment method',
  `delivery_ Address ` varchar (1000) default null comment 'shipping address',
  `order_ Comment ` varchar (200) default null comment 'order comment',
  `out_ trade_ No ` varchar (50) default null comment 'order transaction number (for third party payment)',
  `trade_ Body ` varchar (200) default null comment 'order description (for third party payment)',
  `create_ Time ` datetime default null comment 'creation time',
  `operate_ Time ` datetime default null comment 'operation time',
  `expire_ Time ` datetime default null comment 'expiration time',
  `tracking_ No ` varchar (100) default null comment 'logistics order number',
  `parent_ order_ ID ` bigint (20) default null comment 'parent order number',
  `img_ URL ` varchar (200) default null comment 'image path',
  `province_ ID ` int (20) default null comment 'region',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_ Increment = 1 default charset = utf8 comment ='order table ';
  • Order details_ detail)
CREATE TABLE `order_detail` (
  `id` bigint(20) NOT NULL AUTO_ Input comment 'number',
  `order_ ID ` bigint (20) default null comment 'order number',
  `sku_id` bigint(20) DEFAULT NULL COMMENT 'sku_id',
  `sku_ Name ` varchar (200) default null comment 'SKU name (redundant)',
  `img_ URL ` varchar (200) default null comment 'picture name (redundant)',
  `order_ Price ` decision (10,2) default null comment 'purchase price (SKU price when placing an order)',
  `sku_ Num ` varchar (200) default null comment 'number of purchases',
  `create_ Time ` datetime default null comment 'creation time',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_ Increment = 1 default charset = utf8 comment ='order detail table ';
  • Commodity list (SKU)_ info)
CREATE TABLE `sku_info` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'skuid(itemID)',
  `spu_id` bigint(20) DEFAULT NULL COMMENT 'spuid',
  `Price ` decimal (10,0) default null comment 'price',
  `sku_ Name ` varchar (200) default null comment 'SKU name',
  `sku_ Desc ` varchar (2000) default null comment 'commodity specification description',
  `Weight ` decimal (10,2) default null comment 'weight',
  `tm_ ID ` bigint (20) default null comment 'brand (redundant)',
  `category3_ ID ` bigint (20) default null comment 'three level classification ID (redundant)',
  `sku_ default_ IMG ` varchar (200) default null comment 'default display picture (redundant)',
  `create_ Time ` datetime default null comment 'creation time',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_ Increment = 1 default charset = utf8 comment ='commodity table ';
  • List of commodity class I (base)_ category1)
CREATE TABLE `base_category1` (
  `id` bigint(20) NOT NULL AUTO_ Input comment 'number',
  `Name ` varchar (10) not null comment 'category name',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_ Increment = 1 default chart = utf8 comment ='first level classification table ';
  • Commodity secondary category list (base)_ category2)
CREATE TABLE `base_category2` (
  `id` bigint(20) NOT NULL AUTO_ Input comment 'number',
  `Name ` varchar (200) not null comment 'secondary category name',
  `category1_ ID ` bigint (20) default null comment 'first level classification number',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_ Increment = 1 default chart = utf8 comment ='secondary classification table ';
  • Class 3 commodity list_ category3)
CREATE TABLE `base_category3` (
  `id` bigint(20) NOT NULL AUTO_ Input comment 'number',
  `Name ` varchar (200) not null comment 'three level classification name',
  `category2_ ID ` bigint (20) default null comment 'secondary classification number',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_ Increment = 1 default chart = utf8 comment ='three level classification table ';
  • Province table (base)_ province)
CREATE TABLE `base_province` (
  `id` int(20) DEFAULT NULL COMMENT 'id',
  `Name ` varchar (20) default null comment 'province name',
  `region_ ID ` int (20) default null comment 'region ID',
  `area_ Code ` varchar (20) default null comment 'administrative location code'
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
  • Region table (base)_ region)
CREATE TABLE `base_region` (
  `ID ` int (20) not null comment 'region ID',
  `region_ Name ` varchar (20) default null comment 'region name',
   PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Note: the above table creation statements are completed in MySQL. For a complete table creation and simulation data generation script, see:

Link:https://pan.baidu.com/s/1fcMg…Extraction code: zuqw

Data processing flow

Data synchronization in ODS layer

For data synchronization of ODS layer, please refer to another article of mineReal time incremental data synchronization based on canal and Flink (1)。 Cannal is mainly used to parse the binlog log of MySQL, and then write it to the topic corresponding to Kafka. Due to space limitations, the specific details will not be explained. The result after synchronization is shown in the following figure:

Real time data warehouse | exploration and practice of building real time data warehouse based on SQL of flink1.11

Dim layer dimension table data preparation

In this case, the dimension table is stored in mysql, and HBase is used to store the dimension table data in actual production. We mainly use two dimension tables:Region dimension tableandCommodity dimension table。 The treatment process is as follows:

  • Region dimension table

First of allmydw.base_provinceandmydw.base_regionThe data corresponding to this topic is extracted into mysql, mainly using the canal JSON format corresponding to the Kafka data source of Flink SQL. Note: before loading, you need to create the corresponding table in MySQL. The name of the MySQL database used in this paper isdim, used to store dimension table data. As follows:

-- -------------------------
--Province
--   kafka Source
-- ------------------------- 
DROP TABLE IF EXISTS `ods_base_province`;
CREATE TABLE `ods_base_province` (
  `id` INT,
  `name` STRING,
  `region_id` INT ,
  `area_code`STRING
) WITH(
'connector' = 'kafka',
 'topic' = 'mydw.base_province',
 'properties.bootstrap.servers' = 'kms-3:9092',
 'properties.group.id' = 'testGroup',
 'format' = 'canal-json' ,
 'scan.startup.mode' = 'earliest-offset' 
) ; 

-- -------------------------
--Province
--   MySQL Sink
-- ------------------------- 
DROP TABLE IF EXISTS `base_province`;
CREATE TABLE `base_province` (
    `id` INT,
    `name` STRING,
    `region_id` INT ,
    `area_code`STRING,
    PRIMARY KEY (id) NOT ENFORCED
) WITH (
    'connector' = 'jdbc',
    'url' = 'jdbc:mysql://kms-1:3306/dim',
    'table-name' = 'base_ Province '-- the table of data to be inserted in MySQL
    'driver' = 'com.mysql.jdbc.Driver',
    'username' = 'root',
    'password' = '123qwe',
    'sink.buffer-flush.interval' = '1s'
);

-- -------------------------
--Province
--   MySQL Sink Load Data
-- ------------------------- 
INSERT INTO base_province
SELECT *
FROM ods_base_province;

-- -------------------------
--Area
--   kafka Source
-- ------------------------- 
DROP TABLE IF EXISTS `ods_base_region`;
CREATE TABLE `ods_base_region` (
  `id` INT,
  `region_name` STRING
) WITH(
'connector' = 'kafka',
 'topic' = 'mydw.base_region',
 'properties.bootstrap.servers' = 'kms-3:9092',
 'properties.group.id' = 'testGroup',
 'format' = 'canal-json' ,
 'scan.startup.mode' = 'earliest-offset' 
) ; 

-- -------------------------
--Area
--   MySQL Sink
-- ------------------------- 
DROP TABLE IF EXISTS `base_region`;
CREATE TABLE `base_region` (
    `id` INT,
    `region_name` STRING,
     PRIMARY KEY (id) NOT ENFORCED
) WITH (
    'connector' = 'jdbc',
    'url' = 'jdbc:mysql://kms-1:3306/dim',
    'table-name' = 'base_ Region '-- the table of data to be inserted in MySQL
    'driver' = 'com.mysql.jdbc.Driver',
    'username' = 'root',
    'password' = '123qwe',
    'sink.buffer-flush.interval' = '1s'
);

-- -------------------------
--Area
--   MySQL Sink Load Data
-- ------------------------- 
INSERT INTO base_region
SELECT *
FROM ods_base_region;

After the above steps, the original data needed to create a dimension table has been stored in MySQL. Next, we need to create a dimension table in MySQL. We use the above two tables to create a viewdim_provinceAs dimension table:

-- ---------------------------------
--Dim layer, area dimension table,
--Creating views in MySQL
-- ---------------------------------
DROP VIEW IF EXISTS dim_province;
CREATE VIEW dim_province AS
SELECT
  bp.id AS province_id,
  bp.name AS province_name,
  br.id AS region_id,
  br.region_name AS region_name,
  bp.area_code AS area_code
FROM base_region br 
     JOIN base_province bp ON br.id= bp.region_id
;

So we need the dimension table: dim_ When you join a dimension table, you need to use Flink SQL to create a JDBC data source to use the dimension table. Similarly, we use the same method to create a commodity dimension table, as follows:

-- -------------------------
--First class list
--   kafka Source
-- ------------------------- 
DROP TABLE IF EXISTS `ods_base_category1`;
CREATE TABLE `ods_base_category1` (
  `id` BIGINT,
  `name` STRING
)WITH(
 'connector' = 'kafka',
 'topic' = 'mydw.base_category1',
 'properties.bootstrap.servers' = 'kms-3:9092',
 'properties.group.id' = 'testGroup',
 'format' = 'canal-json' ,
 'scan.startup.mode' = 'earliest-offset' 
) ;

-- -------------------------
--First class list
--   MySQL Sink
-- ------------------------- 
DROP TABLE IF EXISTS `base_category1`;
CREATE TABLE `base_category1` (
    `id` BIGINT,
    `name` STRING,
     PRIMARY KEY (id) NOT ENFORCED
) WITH (
    'connector' = 'jdbc',
    'url' = 'jdbc:mysql://kms-1:3306/dim',
    'table-name' = 'base_ Category1 '-- the table of data to be inserted in MySQL
    'driver' = 'com.mysql.jdbc.Driver',
    'username' = 'root',
    'password' = '123qwe',
    'sink.buffer-flush.interval' = '1s'
);

-- -------------------------
--First class list
--   MySQL Sink Load Data
-- ------------------------- 

INSERT INTO base_category1
SELECT *
FROM ods_base_category1;

-- -------------------------
--Secondary category list
--   kafka Source
-- ------------------------- 
DROP TABLE IF EXISTS `ods_base_category2`;
CREATE TABLE `ods_base_category2` (
  `id` BIGINT,
  `name` STRING,
  `category1_id` BIGINT
)WITH(
'connector' = 'kafka',
 'topic' = 'mydw.base_category2',
 'properties.bootstrap.servers' = 'kms-3:9092',
 'properties.group.id' = 'testGroup',
 'format' = 'canal-json' ,
 'scan.startup.mode' = 'earliest-offset' 
) ;

-- -------------------------
--Secondary category list
--   MySQL Sink
-- ------------------------- 
DROP TABLE IF EXISTS `base_category2`;
CREATE TABLE `base_category2` (
    `id` BIGINT,
    `name` STRING,
    `category1_id` BIGINT,
     PRIMARY KEY (id) NOT ENFORCED
) WITH (
    'connector' = 'jdbc',
    'url' = 'jdbc:mysql://kms-1:3306/dim',
    'table-name' = 'base_ Category2 '-- the table of data to be inserted in MySQL
    'driver' = 'com.mysql.jdbc.Driver',
    'username' = 'root',
    'password' = '123qwe',
    'sink.buffer-flush.interval' = '1s'
);

-- -------------------------
--Secondary category list
--   MySQL Sink Load Data
-- ------------------------- 
INSERT INTO base_category2
SELECT *
FROM ods_base_category2;

-- -------------------------
--Three level category list
--   kafka Source
-- ------------------------- 
DROP TABLE IF EXISTS `ods_base_category3`;
CREATE TABLE `ods_base_category3` (
  `id` BIGINT,
  `name` STRING,
  `category2_id` BIGINT
)WITH(
'connector' = 'kafka',
 'topic' = 'mydw.base_category3',
 'properties.bootstrap.servers' = 'kms-3:9092',
 'properties.group.id' = 'testGroup',
 'format' = 'canal-json' ,
 'scan.startup.mode' = 'earliest-offset' 
) ; 

-- -------------------------
--Three level category list
--   MySQL Sink
-- ------------------------- 
DROP TABLE IF EXISTS `base_category3`;
CREATE TABLE `base_category3` (
    `id` BIGINT,
    `name` STRING,
    `category2_id` BIGINT,
    PRIMARY KEY (id) NOT ENFORCED
) WITH (
    'connector' = 'jdbc',
    'url' = 'jdbc:mysql://kms-1:3306/dim',
    'table-name' = 'base_ Category3 '-- the table of data to be inserted in MySQL
    'driver' = 'com.mysql.jdbc.Driver',
    'username' = 'root',
    'password' = '123qwe',
    'sink.buffer-flush.interval' = '1s'
);

-- -------------------------
--Three level category list
--   MySQL Sink Load Data
-- ------------------------- 
INSERT INTO base_category3
SELECT *
FROM ods_base_category3;

-- -------------------------
--Commodity list
--   Kafka Source
-- ------------------------- 

DROP TABLE IF EXISTS `ods_sku_info`;
CREATE TABLE `ods_sku_info` (
  `id` BIGINT,
  `spu_id` BIGINT,
  `price` DECIMAL(10,0),
  `sku_name` STRING,
  `sku_desc` STRING,
  `weight` DECIMAL(10,2),
  `tm_id` BIGINT,
  `category3_id` BIGINT,
  `sku_default_img` STRING,
  `create_time` TIMESTAMP(0)
) WITH(
 'connector' = 'kafka',
 'topic' = 'mydw.sku_info',
 'properties.bootstrap.servers' = 'kms-3:9092',
 'properties.group.id' = 'testGroup',
 'format' = 'canal-json' ,
 'scan.startup.mode' = 'earliest-offset' 
) ; 

-- -------------------------
--Commodity list
--   MySQL Sink
-- ------------------------- 
DROP TABLE IF EXISTS `sku_info`;
CREATE TABLE `sku_info` (
  `id` BIGINT,
  `spu_id` BIGINT,
  `price` DECIMAL(10,0),
  `sku_name` STRING,
  `sku_desc` STRING,
  `weight` DECIMAL(10,2),
  `tm_id` BIGINT,
  `category3_id` BIGINT,
  `sku_default_img` STRING,
  `create_time` TIMESTAMP(0),
   PRIMARY KEY (tm_id) NOT ENFORCED
) WITH (
    'connector' = 'jdbc',
    'url' = 'jdbc:mysql://kms-1:3306/dim',
    'table-name' = 'sku_ Info '-- table of data to be inserted in MySQL
    'driver' = 'com.mysql.jdbc.Driver',
    'username' = 'root',
    'password' = '123qwe',
    'sink.buffer-flush.interval' = '1s'
);

-- -------------------------
--Merchandise
--   MySQL Sink Load Data
-- ------------------------- 
INSERT INTO sku_info
SELECT *
FROM ods_sku_info;

After the above steps, we can synchronize the basic data table of creating the commodity dimension table to MySQL, and we also need to create the corresponding data table in advance. Next, we use the above basic table to create a view in the dim Library of MySQLdim_sku_infoFor subsequent use.

-- ---------------------------------
--Dim layer, commodity dimension table,
--Creating views in MySQL
-- ---------------------------------
CREATE VIEW dim_sku_info AS
SELECT
  si.id AS id,
  si.sku_name AS sku_name,
  si.category3_id AS c3_id,
  si.weight AS weight,
  si.tm_id AS tm_id,
  si.price AS price,
  si.spu_id AS spu_id,
  c3.name AS c3_name,
  c2.id AS c2_id,
  c2.name AS c2_name,
  c3.id AS c1_id,
  c3.name AS c1_name
FROM
(
  sku_info si 
  JOIN base_category3 c3 ON si.category3_id = c3.id
  JOIN base_category2 c2 ON c3.category2_id =c2.id
  JOIN base_category1 c1 ON c2.category1_id = c1.id
);

At this point, the dimension table data we need is ready. Next, we begin to process the data of DWD layer.

DWD layer data processing

After the above steps, we have prepared the dimension table. Next, we will process the original data of ODS and process it into DWD layer wide list. The specific process is as follows:

-- -------------------------
--Order details
--   Kafka Source
-- ------------------------- 

DROP TABLE IF EXISTS `ods_order_detail`;
CREATE TABLE `ods_order_detail`(
  `id` BIGINT,
  `order_id` BIGINT,
  `sku_id` BIGINT,
  `sku_name` STRING,
  `img_url` STRING,
  `order_price` DECIMAL(10,2),
  `sku_num` INT,
  `create_time` TIMESTAMP(0)
) WITH(
 'connector' = 'kafka',
 'topic' = 'mydw.order_detail',
 'properties.bootstrap.servers' = 'kms-3:9092',
 'properties.group.id' = 'testGroup',
 'format' = 'canal-json' ,
 'scan.startup.mode' = 'earliest-offset' 
) ; 

-- -------------------------
--Order information
--   Kafka Source
-- -------------------------
DROP TABLE IF EXISTS `ods_order_info`;
CREATE TABLE `ods_order_info` (
  `id` BIGINT,
  `consignee` STRING,
  `consignee_tel` STRING,
  `total_amount` DECIMAL(10,2),
  `order_status` STRING,
  `user_id` BIGINT,
  `payment_way` STRING,
  `delivery_address` STRING,
  `order_comment` STRING,
  `out_trade_no` STRING,
  `trade_body` STRING,
  `create_time` TIMESTAMP(0) ,
  `operate_time` TIMESTAMP(0) ,
  `expire_time` TIMESTAMP(0) ,
  `tracking_no` STRING,
  `parent_order_id` BIGINT,
  `img_url` STRING,
  `province_id` INT
) WITH(
'connector' = 'kafka',
 'topic' = 'mydw.order_info',
 'properties.bootstrap.servers' = 'kms-3:9092',
 'properties.group.id' = 'testGroup',
 'format' = 'canal-json' ,
 'scan.startup.mode' = 'earliest-offset' 
) ; 

-- ---------------------------------
--DWD layer, payment order details DWD_ paid_ order_ detail
-- ---------------------------------
DROP TABLE IF EXISTS dwd_paid_order_detail;
CREATE TABLE dwd_paid_order_detail
(
  detail_id BIGINT,
  order_id BIGINT,
  user_id BIGINT,
  province_id INT,
  sku_id BIGINT,
  sku_name STRING,
  sku_num INT,
  order_price DECIMAL(10,0),
  create_time TIMESTAMP(0),
  pay_time TIMESTAMP(0)
 ) WITH (
    'connector' = 'kafka',
    'topic' = 'dwd_paid_order_detail',
    'scan.startup.mode' = 'earliest-offset',
    'properties.bootstrap.servers' = 'kms-3:9092',
    'format' = 'changelog-json'
);
-- ---------------------------------
--DWD layer, paid order list
--To DWD_ paid_ order_ Detail loading data
-- ---------------------------------
INSERT INTO dwd_paid_order_detail
SELECT
  od.id,
  oi.id order_id,
  oi.user_id,
  oi.province_id,
  od.sku_id,
  od.sku_name,
  od.sku_num,
  od.order_price,
  oi.create_time,
  oi.operate_time
FROM
    (
    SELECT * 
    FROM ods_order_info
    WHERE order_ Status ='2 '- paid
    ) oi JOIN
    (
    SELECT *
    FROM ods_order_detail
    ) od 
    ON oi.id = od.order_id;

Real time data warehouse | exploration and practice of building real time data warehouse based on SQL of flink1.11

Ads layer data

After the above steps, we created a DWD_ paid_ order_ Detail parts list, which is stored in Kafka. Next, we will use this detail wide table to join with the dimension table to get our ads application layer data.

  • ads_province_index

First, create the corresponding ads target table in MySQLads_province_index

CREATE TABLE ads.ads_province_index(
  province_id INT(10),
  area_code VARCHAR(100),
  province_name VARCHAR(100),
  region_id INT(10),
  region_name VARCHAR(100),
  order_amount DECIMAL(10,2),
  order_count BIGINT(10),
  dt VARCHAR(100),
  PRIMARY KEY (province_id, dt) 
) ;

Load data to ads layer target of MySQL:

--Flink SQL cli operation
-- ---------------------------------
--Using DDL to create ads layer table in MySQL
--Indicators: 1. Number of orders per province per day
--2. Order amount per province per day
-- ---------------------------------
CREATE TABLE ads_province_index(
  province_id INT,
  area_code STRING,
  province_name STRING,
  region_id INT,
  region_name STRING,
  order_amount DECIMAL(10,2),
  order_count BIGINT,
  dt STRING,
  PRIMARY KEY (province_id, dt) NOT ENFORCED  
) WITH (
    'connector' = 'jdbc',
    'url' = 'jdbc:mysql://kms-1:3306/ads',
    'table-name' = 'ads_province_index', 
    'driver' = 'com.mysql.jdbc.Driver',
    'username' = 'root',
    'password' = '123qwe'
);
-- ---------------------------------
-- dwd_ paid_ order_ Detail paid order details
-- ---------------------------------
CREATE TABLE dwd_paid_order_detail
(
  detail_id BIGINT,
  order_id BIGINT,
  user_id BIGINT,
  province_id INT,
  sku_id BIGINT,
  sku_name STRING,
  sku_num INT,
  order_price DECIMAL(10,2),
  create_time STRING,
  pay_time STRING
 ) WITH (
    'connector' = 'kafka',
    'topic' = 'dwd_paid_order_detail',
    'scan.startup.mode' = 'earliest-offset',
    'properties.bootstrap.servers' = 'kms-3:9092',
    'format' = 'changelog-json'
);

-- ---------------------------------
-- tmp_province_index
--Temporary order summary table
-- ---------------------------------
CREATE TABLE tmp_province_index(
    province_id INT,
    order_ Count bigint, - number of orders
    order_ Amount decimal (10,2), -- order amount
    pay_date DATE
)WITH (
    'connector' = 'kafka',
    'topic' = 'tmp_province_index',
    'scan.startup.mode' = 'earliest-offset',
    'properties.bootstrap.servers' = 'kms-3:9092',
    'format' = 'changelog-json'
);
-- ---------------------------------
-- tmp_province_index
--Temporary order summary table数据装载
-- ---------------------------------
INSERT INTO tmp_province_index
SELECT
      province_id,
      count(distinct order_ id) order_ Count, - number of orders
      sum(order_ price * sku_ num) order_ Amount, - order amount
      TO_DATE(pay_time,'yyyy-MM-dd') pay_date
FROM dwd_paid_order_detail
GROUP BY province_id,TO_DATE(pay_time,'yyyy-MM-dd')
;
-- ---------------------------------
-- tmp_province_index_source
--Use the temporary summary table as the data source
-- ---------------------------------
CREATE TABLE tmp_province_index_source(
    province_id INT,
    order_ Count bigint, - number of orders
    order_ Amount decimal (10,2), -- order amount
    pay_date DATE,
    A proctis () column is used to calculate the processing time
 ) WITH (
    'connector' = 'kafka',
    'topic' = 'tmp_province_index',
    'scan.startup.mode' = 'earliest-offset',
    'properties.bootstrap.servers' = 'kms-3:9092',
    'format' = 'changelog-json'
);

-- ---------------------------------
--Dim layer, area dimension table,
--Create regional dimension table data source
-- ---------------------------------
DROP TABLE IF EXISTS `dim_province`;
CREATE TABLE dim_province (
  province_id INT,
  province_name STRING,
  area_code STRING,
  region_id INT,
  region_name STRING ,
  PRIMARY KEY (province_id) NOT ENFORCED
) WITH (
    'connector' = 'jdbc',
    'url' = 'jdbc:mysql://kms-1:3306/dim',
    'table-name' = 'dim_province', 
    'driver' = 'com.mysql.jdbc.Driver',
    'username' = 'root',
    'password' = '123qwe',
    'scan.fetch-size' = '100'
);

-- ---------------------------------
--To ads_ province_ Index load data
--Dimension table join
-- ---------------------------------

INSERT INTO ads_province_index
SELECT
  pc.province_id,
  dp.area_code,
  dp.province_name,
  dp.region_id,
  dp.region_name,
  pc.order_amount,
  pc.order_count,
  cast(pc.pay_date as VARCHAR)
FROM
tmp_province_index_source pc
  JOIN dim_province FOR SYSTEM_TIME AS OF pc.proctime as dp 
  ON dp.province_id = pc.province_id;

After submitting the task: observe the Flink Web UI:

Real time data warehouse | exploration and practice of building real time data warehouse based on SQL of flink1.11

View ads in ads layer_ province_ Index table data:

Real time data warehouse | exploration and practice of building real time data warehouse based on SQL of flink1.11

  • ads_sku_index

First, create the corresponding ads target table in MySQLads_sku_index

CREATE TABLE ads_sku_index
(
  sku_id BIGINT(10),
  sku_name VARCHAR(100),
  weight DOUBLE,
  tm_id BIGINT(10),
  price DOUBLE,
  spu_id BIGINT(10),
  c3_id BIGINT(10),
  c3_name VARCHAR(100) ,
  c2_id BIGINT(10),
  c2_name VARCHAR(100),
  c1_id BIGINT(10),
  c1_name VARCHAR(100),
  order_amount DOUBLE,
  order_count BIGINT(10),
  sku_count BIGINT(10),
  dt varchar(100),
  PRIMARY KEY (sku_id,dt)
);

Load data to ads layer target of MySQL:

-- ---------------------------------
--Using DDL to create ads layer table in MySQL
--Indicators: 1. The number of orders corresponding to each commodity every day
--2. Order amount of each commodity every day
--3. The corresponding quantity of each commodity per day
-- ---------------------------------
CREATE TABLE ads_sku_index
(
  sku_id BIGINT,
  sku_name VARCHAR,
  weight DOUBLE,
  tm_id BIGINT,
  price DOUBLE,
  spu_id BIGINT,
  c3_id BIGINT,
  c3_name VARCHAR ,
  c2_id BIGINT,
  c2_name VARCHAR,
  c1_id BIGINT,
  c1_name VARCHAR,
  order_amount DOUBLE,
  order_count BIGINT,
  sku_count BIGINT,
  dt varchar,
  PRIMARY KEY (sku_id,dt) NOT ENFORCED
) WITH (
    'connector' = 'jdbc',
    'url' = 'jdbc:mysql://kms-1:3306/ads',
    'table-name' = 'ads_sku_index', 
    'driver' = 'com.mysql.jdbc.Driver',
    'username' = 'root',
    'password' = '123qwe'
);

-- ---------------------------------
-- dwd_ paid_ order_ Detail paid order details
-- ---------------------------------
CREATE TABLE dwd_paid_order_detail
(
  detail_id BIGINT,
  order_id BIGINT,
  user_id BIGINT,
  province_id INT,
  sku_id BIGINT,
  sku_name STRING,
  sku_num INT,
  order_price DECIMAL(10,2),
  create_time STRING,
  pay_time STRING
 ) WITH (
    'connector' = 'kafka',
    'topic' = 'dwd_paid_order_detail',
    'scan.startup.mode' = 'earliest-offset',
    'properties.bootstrap.servers' = 'kms-3:9092',
    'format' = 'changelog-json'
);

-- ---------------------------------
-- tmp_sku_index
--Commodity index statistics
-- ---------------------------------
CREATE TABLE tmp_sku_index(
    sku_id BIGINT,
    order_ Count bigint, - number of orders
    order_ Amount decimal (10,2), -- order amount
    order_sku_num BIGINT,
    pay_date DATE
)WITH (
    'connector' = 'kafka',
    'topic' = 'tmp_sku_index',
    'scan.startup.mode' = 'earliest-offset',
    'properties.bootstrap.servers' = 'kms-3:9092',
    'format' = 'changelog-json'
);
-- ---------------------------------
-- tmp_sku_index
--Data loading
-- ---------------------------------
INSERT INTO tmp_sku_index
SELECT
      sku_id,
      count(distinct order_ id) order_ Count, - number of orders
      sum(order_ price * sku_ num) order_ Amount, - order amount
      sum(sku_num) order_sku_num,
      TO_DATE(pay_time,'yyyy-MM-dd') pay_date
FROM dwd_paid_order_detail
GROUP BY sku_id,TO_DATE(pay_time,'yyyy-MM-dd')
;

-- ---------------------------------
-- tmp_sku_index_source
--Use the temporary summary table as the data source
-- ---------------------------------
CREATE TABLE tmp_sku_index_source(
    sku_id BIGINT,
    order_ Count bigint, - number of orders
    order_ Amount decimal (10,2), -- order amount
    order_sku_num BIGINT,
    pay_date DATE,
    A proctis () column is used to calculate the processing time
 ) WITH (
    'connector' = 'kafka',
    'topic' = 'tmp_sku_index',
    'scan.startup.mode' = 'earliest-offset',
    'properties.bootstrap.servers' = 'kms-3:9092',
    'format' = 'changelog-json'
);
-- ---------------------------------
--Dim layer, commodity dimension table,
--Create commodity dimension table data source
-- ---------------------------------
DROP TABLE IF EXISTS `dim_sku_info`;
CREATE TABLE dim_sku_info (
  id BIGINT,
  sku_name STRING,
  c3_id BIGINT,
  weight DECIMAL(10,2),
  tm_id BIGINT,
  price DECIMAL(10,2),
  spu_id BIGINT,
  c3_name STRING,
  c2_id BIGINT,
  c2_name STRING,
  c1_id BIGINT,
  c1_name STRING,
  PRIMARY KEY (id) NOT ENFORCED
) WITH (
    'connector' = 'jdbc',
    'url' = 'jdbc:mysql://kms-1:3306/dim',
    'table-name' = 'dim_sku_info', 
    'driver' = 'com.mysql.jdbc.Driver',
    'username' = 'root',
    'password' = '123qwe',
    'scan.fetch-size' = '100'
);
-- ---------------------------------
--To ads_ sku_ Index load data
--Dimension table join
-- ---------------------------------
INSERT INTO ads_sku_index
SELECT
  sku_id ,
  sku_name ,
  weight ,
  tm_id ,
  price ,
  spu_id ,
  c3_id ,
  c3_name,
  c2_id ,
  c2_name ,
  c1_id ,
  c1_name ,
  sc.order_amount,
  sc.order_count ,
  sc.order_sku_num ,
  cast(sc.pay_date as VARCHAR)
FROM
tmp_sku_index_source sc 
  JOIN dim_sku_info FOR SYSTEM_TIME AS OF sc.proctime as ds
  ON ds.id = sc.sku_id
  ;

After submitting the task: observe the Flink Web UI:

Real time data warehouse | exploration and practice of building real time data warehouse based on SQL of flink1.11

View ads in ads layer_ sku_ Index table data:

Real time data warehouse | exploration and practice of building real time data warehouse based on SQL of flink1.11

Finebi results display

Real time data warehouse | exploration and practice of building real time data warehouse based on SQL of flink1.11

Other points for attention

Bugs in flink1.11.0

When flink1.11.0 is used in the code, if a change log data source is inserted into an upsert sink, the following exception will be reported:

[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.table.api.TableException: Provided trait [BEFORE_AND_AFTER] can't satisfy required trait [ONLY_UPDATE_AFTER]. This is a bug in planner, please file an issue. 
Current node is TableSourceScan(table=[[default_catalog, default_database, t_pick_order]], fields=[order_no, status])

The bug has been fixed and can be used in flink1.11.1.

summary

This paper mainly shares a demo case of building a real-time data warehouse. Through this paper, we can understand the data processing process of the real-time data warehouse. On this basis, we will have a deeper understanding of the CDC of Flink SQL. In addition, this paper gives a very detailed use case, you can operate directly, and explore the construction process of real-time data warehouse in practice.

Official account “big data technology and multi warehouse”, reply to “information” to receive big data package.