Join the dimension table of Flink SQL

Time:2021-1-8

Dimension table is a concept in data warehouse. Dimension attribute in dimension table is the angle of observing data. When building offline data warehouse, it is usually to associate dimension table with fact table to build star model. In real-time data warehouse, there are also the concepts of dimension table and fact table. The fact table is usually stored in Kafka, and the dimension table is usually stored in external devices (such as mysql, HBase). For each stream data, an external dimension table data source can be associated to provide data association query for real-time computing. Dimension table may change constantly. When dimension table joins, it is necessary to indicate the time when this record is associated with dimension table snapshot. It should be noted that at present, the dimension table join of Flink SQL only supports the association of the snapshot of the dimension table at the current time (processing time semantics), but does not support the snapshot of the dimension table corresponding to the fact table rowTime (event time semantics). You can learn from this article:

  • How to create a table with Flink SQL
  • How to define Kafka data source table
  • How to define MySQL data source table
  • What is temporary table join
  • Case study of dimension table join

Creating tables with Flink SQL

be careful: all operations in this article are performed in Flink SQL cli

Syntax for creating tables

CREATE TABLE [catalog_name.][db_name.]table_name
  (
    { <column_definition> | <computed_column_definition> }[ , ...n]
    [ <watermark_definition> ]
  )
  [COMMENT table_comment]
  [PARTITIONED BY (partition_column_name1, partition_column_name2, ...)]
  WITH (key1=val1, key2=val2, ...)
--Define table fields
<column_definition>:
  column_name column_type [COMMENT column_comment]
--Define calculated columns
<computed_column_definition>:
  column_name AS computed_column_expression [COMMENT column_comment]
--Define the waterline
<watermark_definition>:
  WATERMARK FOR rowtime_column_name AS watermark_strategy_expression

explain

Computed column

The computed column is a pass throughcolumn_name AS computed_column_expressionThe generated virtual columns are not physically stored in the data source table. A calculated column can be generated by a field, operator and built-in function in the original data source table. For example, define a calculation column (cost) of consumption amount, which can be calculated by using the price * quantity of the table.

Computed columns are often used to define time properties (see another article)Time attribute of Flink table API & SQL Programming Guide (3), you can define the processing time attribute through proctime() function. The syntax isproc AS PROCTIME()。 In addition, the computed column can be used to extract the event time column, because the original event time may not be of timestamp (3) type or exist in the JSON string.

Scream tip

1. Define the calculation column on the source table after reading the data source, and the calculation column should follow the select query statement;

2. The calculated column cannot be inserted by the insert statement. In the insert statement, only the schema of the actual target table can be included, not the calculated column

Water line

Watermarks define the event time attribute of a table

WATERMARK FOR rowtime_column_name AS watermark_strategy_expression

amongrowtime_column_nameIndicates the event time field that already exists in the table. It is worth noting that the event time field must be of type timestamp (3), which is in the form ofyyyy-MM-dd HH:mm:ssIf it is not a data type in this form, it needs to be converted by defining calculated columns.

watermark_strategy_expressionThe strategy of generating water level is defined. The return data type of the expression must be timestamp (3).

Flink provides many common strategies for generating watermarks

  • Strictly monotonically increasing waterline

    WATERMARK FOR rowtime_column AS rowtime_column

That is, the time stamp is directly used as the water mark

  • Incremental waterline: syntax is

    WATERMARK FOR rowtime_column AS rowtime_column - INTERVAL '0.001' SECOND
  • Out of order water mark

    WATERMARK FOR rowtime_column AS rowtime_column - INTERVAL 'string' timeUnit
    --For example, five seconds out of order is allowed
    WATERMARK FOR rowtime_column AS rowtime_column - INTERVAL '5' SECOND

partition

Create a partition table according to specific fields, and each partition will correspond to a file path

With option

To create a table source or table sink, you need to specify the attributes of the table. The attributes are configured in the form of key / value. Please refer to the corresponding connector for details

Scream tip

Note: there are three forms of table names specified when creating a table:

(1)catalog_name.db_name.table_name

(2)db_name.table_name

(3)table_name

For the first form, the table is registered to a directory named ‘catalog’_ Name ‘and a’ db ‘_ In the metadata of name’d database;

For the second form, the table will be registered to the catalog of the current execution environment and named ‘db’_ Name ‘in the metadata of the database;

For the third form, the table is registered in the metadata of the catalog and database of the current execution environment

Define Kafka data table

Kafka is a common data storage device for building real-time data warehouse. The syntax of creating Kafka data source table with Flink SQL is as follows:

CREATE TABLE MyKafkaTable (
  ...
) WITH (
  ' connector.type '='kafka' -- connection type       
  ' connector.version '='0.11' -- required: the optional Kafka versions are: 0.8/0.9/0.10/0.11/universal
  ' connector.topic ' = 'topic_ Name ', -- required: subject name
  ' connector.properties.zookeeper .connect' = ' localhost:2181 '-- required: ZK connection address
  ' connector.properties.bootstrap .servers' = ' localhost:9092 '-- required: Kafka connection address
  ' connector.properties.group . ID '='testgroup' -- optional: consumer group
   --Optional: offset, earliest offset / latest offset / group offsets / specific offsets
  'connector.startup-mode' = 'earliest-offset',                                          
  --Optional: when the offset is specified as specific offsets, specify the specific location for each partition
  'connector.specific-offsets' = 'partition:0,offset:42;partition:1,offset:300',
  ' connector.sink -Partitioner '='... '-- optional: sink partitioner, fixed / round robin / custom
  --Optional: Specifies the class name of the partition when customizing it
  'connector.sink-partitioner-class' = 'org.mycompany.MyPartitioner',
  ' format.type '='... '-- required: specified format, support CSV / JSON / Avro
   --Specify update mode and support append / retract / upert
  'update-mode' = 'append',

)

Scream tip

  • Specify the specific offset location: by default, consumption starts from the offset submitted by the current consumer group
  • Sink partition: the default is to write data to as many partitions as possible (each sink parallelism instance only writes data to one partition), or you can use your own partition strategy. When using the round robin partitioner, it can avoid uneven partition, but it will cause a lot of network connections between Flink instance and Kafka broker
  • Consistency guarantee: the default sink semantic is at least once
  • Kafka 0.10+Time stamp: starting from kafka0.10, Kafka messages are attached with a time stamp as the metadata of the message, indicating the time when the record was written to the Kafka topic. This time stamp can be used as the rowTime attribute of the event
  • Kafka 0.11+Version: since 1.7, Flink supports universal version as Kafka connector and is compatible with Kafka 0.11 and later versions

Define MySQL data table

CREATE TABLE MySQLTable (
  ...
) WITH (
  ' connector.type '='jdbc' -- required: JDBC mode
  ' connector.url ' = ' jdbc:mysql // localhost:3306/flink-test '-- required: JDBC URL
  ' connector.table ' = 'jdbc_ table_ Name ', -- required: table name
   --Optional: jdbc driver. If it is not configured, it will be automatically extracted through URL 
  'connector.driver' = 'com.mysql.jdbc.Driver',                                           
  ' connector.username '='name' -- optional: database user name
  ' connector.password '='password' -- optional: database password
    --Optional, enter the field name to partition
  'connector.read.partition.column' = 'column_name',
    --Optional, number of partitions
  'connector.read.partition.num' = '50', 
    --Optional, the minimum value of the first partition
  'connector.read.partition.lower-bound' = '500',
    --Optional, the maximum value of the last partition
  'connector.read.partition.upper-bound' = '1000', 
    --Optional. The number of rows to extract data at one time. The default value is 0, which means this configuration is ignored
  'connector.read.fetch-size' = '100', 
   --Optional. The maximum number of rows of lookup cache data. If it exceeds the configuration, the old data will be cleared
  'connector.lookup.cache.max-rows' = '5000', 
   --Optional. The maximum time for the lookup cache to survive. After this time, the old data will be out of date cache.max -Rows and cache.ttl Must be configured at the same time
  'connector.lookup.cache.ttl' = '10s', 
   --Optional, the maximum number of retries to query data
  'connector.lookup.max-retries' = '3', 
   --Optional. The maximum number of flush rows to write data is 5000 by default. If the number exceeds the configuration, data brushing will be triggered 
  'connector.write.flush.max-rows' = '5000', 
   --Optional. The interval time of flush data. After this time, the data will be flushed through an asynchronous thread. The default is 0s 
  'connector.write.flush.interval' = '2s', 
  --Optional, the maximum number of retries for failed data writing
  'connector.write.max-retries' = '3' 
)

Temporal Table Join

Usage grammar

SELECT column-names
FROM table1  [AS <alias1>]
[LEFT] JOIN table2 FOR SYSTEM_TIME AS OF table1.proctime [AS <alias2>]
ON table1.column-name1 = table2.key-name1

be careful: at present, only inner join and left join are supported. You need to use theFOR SYSTEM_TIME AS OF Where table1. Procetime represents the procetime processing time property (calculation column) of table1. useFOR SYSTEM_TIME AS OF table1.proctimeIndicates that when the records in the left table join with the dimension table on the right, only the snapshot data corresponding to the current processing time dimension table will be matched.

Examples

SELECT
  o.amout, o.currency, r.rate, o.amount * r.rate
FROM
  Orders AS o
  JOIN LatestRates FOR SYSTEM_TIME AS OF o.proctime AS r
  ON r.currency = o.currency

instructions

  • Only blink planner is supported
  • Only SQL is supported, table API is not supported at present
  • At present, temporary table join based on event time is not supported
  • The dimension table may change constantly. After the join behavior occurs, the data in the dimension table will change (add, update or delete), and the associated dimension table data will not be changed synchronously
  • Dimension table and dimension table cannot be joined
  • Dimension table must specify primary key. When dimension table joins, the condition of on must include the equivalent condition of all primary keys

Join case of dimension table

background

Kafka has a user behavior data, including PV, buy, cart and fav behaviors; MySQL has a dimension table data of provinces and regions. Now join the two tables to count the number of purchasing behaviors in each region.

step

Dimension tables are stored in MySQL. Create a dimension table data source as follows:

CREATE TABLE dim_province (
    province_ ID bigint, -- province ID
    province_ Name varchar, -- province name
    region_ Name varchar -- area name
) WITH (
    'connector.type' = 'jdbc',
    'connector.url' = 'jdbc:mysql://192.168.10.203:3306/mydw',
    'connector.table' = 'dim_province',
    'connector.driver' = 'com.mysql.jdbc.Driver',
    'connector.username' = 'root',
    'connector.password' = '123qwe',
    'connector.lookup.cache.max-rows' = '5000',
    'connector.lookup.cache.ttl' = '10min'
);

The fact table is stored in Kafka, and the data is user click behavior in JSON format. The specific data examples are as follows:

{"user_id":63401,"item_id":6244,"cat_id":143,"action":"pv","province":3,"ts":1573445919}
{"user_id":9164,"item_id":2817,"cat_id":611,"action":"fav","province":28,"ts":1573420486}

Create the Kafka data source table as follows:

CREATE TABLE user_behavior (
    user_ ID bigint, - user ID
    item_ ID bigint, - commodity ID
    cat_ ID bigint, - category ID
    Action string, - user behavior
    Province int, -- user's Province
    TS bigint, - time stamp of user behavior occurrence
    Proctime as proctime (), -- generates a processing time column by calculating the column
    eventTime AS TO_ TIMESTAMP(FROM_ Unixtime (TS, 'yyyy MM DD HH: mm: Ss'), -- event time
    Water for eventtime as eventtime - interval '5' second
) WITH (
    ' connector.type '='kafka' -- use Kafka connector
    ' connector.version '='universal' -- Kafka version. Universal supports versions above 0.11
    ' connector.topic ' = 'user_ Behavior '-- Kafka theme
    ' connector.startup -Mode '='earliest offset' -- offset, read from the start offset
    ' connector.properties.group . ID '='group1' -- consumer group
    ' connector.properties.zookeeper . connect '='kms-2:2181, kms-3:2181, kms-4:2181' -- zookeeper address
    ' connector.properties.bootstrap . servers' ='kms-2:9092, kms-3:9092, kms-4:9092 '-- Kafka broker address
    ' format.type '='json' -- the data source format is JSON
);

Create a MySQL result table to represent the regional sales volume

CREATE TABLE region_sales_sink (
    region_ Name string, -- zone name
    buy_ CNT bigint -- sales volume
) WITH (
  
    'connector.type' = 'jdbc',
    'connector.url' = 'jdbc:mysql://192.168.10.203:3306/mydw',
    ' connector.table ' = 'top_ Region '-- the table of data to be inserted in MySQL
    'connector.driver' = 'com.mysql.jdbc.Driver',
    'connector.username' = 'root',
    'connector.password' = '123qwe',
    'connector.write.flush.interval' = '1s'
);

User behavior data and province dimension table data join

CREATE VIEW user_behavior_detail AS
SELECT
  u.user_id, 
  u.item_id,
  u.cat_id,
  u.action,  
  p.province_name,
  p.region_name
FROM user_behavior AS u LEFT JOIN dim_province FOR SYSTEM_TIME AS OF u.proctime AS p
ON u.province = p.province_id;

Calculate the sales volume of the region, and write the calculation results to MySQL

INSERT INTO region_sales_sink
SELECT 
  region_name,
  COUNT(*) buy_cnt
FROM user_behavior_detail
WHERE action = 'buy'
GROUP BY region_name;

Results view:

Flink SQL> select * from  region_ sales_ Sink; -- View in Flink SQL cli

Join the dimension table of Flink SQL

mysql> select * from top_ Region; -- View MySQL data

Join the dimension table of Flink SQL

summary

This paper mainly introduces the dimension table join of flinksql. The method is temporary table join. Firstly, the basic syntax of creating tables in Flink SQL is introduced, and the details are described. Then it introduces how to create the data source table of Kafka and mysql. Then it introduces the basic concept and usage syntax of temporary table join. Finally, a complete case of dimension table join is given.

Official account “big data technology and multi warehouse”, reply to “information” to receive big data package.