stayFlink table API & SQL Programming Guide (1)This paper introduces some basic concepts and general APIs of Flink table API SQL. In this paper, we will explain the basic concepts of flow processing of Flink table API SQL more deeply. Flink table API & SQL realizes the unification of batch and stream processing, which also means that both bounded batch processing input and unbounded stream processing input have the same semantics when using Flink table API & SQL for query operation. In addition, since SQL was originally designed for batch processing, the use of relational queries on unbounded streams is different from that on bounded streams. This article will focus on dynamic tables.
The original design of SQL and relational algebra is not for stream processing, so there are some differences between SQL and stream processing. Flink implements SQL operation on unbounded data stream.
Relational query on data stream
Traditional relational algebra (SQL) is different from stream processing in data input, execution and result output
|difference||Relational algebra / SQL||Stream processing|
|data input||Table, bounded set of Yuanzu||Unbounded data flow|
|implement||Batch processing, query and other operations on the entire input data||The query cannot be executed on all the data. You need to wait for the data stream to arrive|
|Result output||After the end of query processing, output fixed size results||The previous results need to be continuously updated and will never end|
Despite these differences, it does not mean that SQL and stream processing cannot be integrated. Some advanced relational databases provide the function of materialized view. A materialized view is defined by a query statement. Compared with ordinary views, materialized view caches the query results, so when accessing materialized view, there is no need to repeat SQL query operation. A common challenge of caching is how to prevent providing outdated results. When the query base table that defines materialized views changes, the results of materialized views become obsolete.Eager View Maintenance As long as the query base table of materialized view is updated, the materialized view will be updated.
If you consider the following factors, thenEager View Maintenance The connection between SQL query and convection will become obvious
- Database tables are executed on streams
DELETEThe result of a DML operation statement is usually called a changelog stream.
- A materialized view is defined by a query statement. To update the atomized view, the query continuously processes the change log flow of the atomized view.
- Atomization view is the result of streaming SQL query.
Dynamic table and continuous query
Dynamic table is the core concept of flow processing supported by Flink table API & SQL. Compared with batch static table, dynamic table will change with time. Dynamic table queries produce aContinuous Query, continuous query does not terminate and produces the results of a dynamic table.
The relationship between flow, dynamic table and continuous query is shown in the following figure
- 1. The stream is converted to a dynamic table
- 2. Continuous query is executed continuously on the dynamic table to generate a new dynamic table
- 3. Dynamic table is converted to stream
Scream tip:Dynamic table is a logical concept, which will not be atomized during query execution.
Define a table on a stream
To use SQL query on data stream, it is necessary to convert the stream into a table. The records in the stream will be parsed and inserted into the table (for a stream with only insert operation), as shown in the following figure:
- Group aggregation
A continuous query on a dynamic table produces a new dynamic table result. Compared with batch queries, persistent queries never end and update previous results based on the input data. In the following example, click event flow is shown and calculated by grouping aggregation, as shown in the following figure:
The figure above shows the data of a user’s click behavior. The calculation operation uses group aggregation. When the first data [Mary,. / home] comes in, the calculation operation will be carried out immediately, and the calculation result will be output: [Mary, 1]. When [Bob,. / cart] comes in, it will calculate immediately and output the calculation results: [Mary, 1], [Bob, 1]. When [Mary,. / prod? Id = 1] comes in, it will calculate immediately and output the results: [Mary, 2], [Bob, 1]. From this we can see: grouping aggregation will act on all data, and will update the previous output results.
- Window aggregation
The above demonstration is a case of group aggregation. Next, let’s look at a case of window aggregation. According to the tumble window of an hour, calculate the user’s click situation within the hour, as shown in the following figure:
As shown in the figure above: CTime refers to the time of the event. It can be seen that there are four lines of data in one hour of [12:00:00,12:59:59], three lines of data in one hour of [13:00:00,13:59:59], and four lines of data in one hour of [14:00:00,14:59:59].
It can be seen that: in the [12:00:00,12:59:59] time period, the calculation results are [Mary, 13:00:00,3], [Bob, 13:00:00,1], which will be appended to the result table. In the [13:00:00,13:59:59] time period, the calculation results are [Bob, 14:00:00,1], [Liz, 14:00:00,2], which are also appended to the result table, and the data in the previous window will not be updated. Therefore, the feature of window aggregation is to only calculate the data belonging to the window, and insert the result into the result table by appending.
- Similarities and differences between group aggregation and window aggregation
|compare||Group aggregation||Window aggregation|
|Output mode||Output in advance, and calculate each piece of data once||Output by window trigger time|
|Output||A window outputs the result once||Each key outputs n results|
|Output stream||Append stream||Update stream|
|State cleanup||Clean up outdated data in time||The state will grow infinitely|
|The output is not required to support the update operation||Support update operation (kudu, HBase, mysql, etc.)|
Update and add query
The above two examples demonstrate the updated query and the appended query respectively. The output result of the first group aggregation case will update the previous result, that is, the result table containsINSERTAndUPDATEOperation.
The second case of window aggregation is only to add the result table of calculation results, that is, the result table only containsINSERTOperation.
When a query generates an append only table or an updated table, the differences are as follows:
- When the query produces an update table (that is, it will update the previous output), a larger state needs to be maintained
- The method of converting an append only table to a stream is different from that of an updated table
Conversion from table to stream
Dynamic tables are continuously changed by insert, update, and delete operations. When a dynamic table is transformed into a stream or written to an external storage system, these changed values need to be encoded. Flink table API and SQL support three ways to encode these changed data
- Append-only stream
The dynamic table will only be modified by the insert operation, and the changed data (new data) will be inserted into the row of the dynamic table
- Retract stream
A retract stream contains two types of messages: add messages and retract messages. When a dynamic table is converted to a retract stream, the change data of insert operation is encoded as add messages, the change data caused by delete operation is encoded as retract message, and the change data caused by update operation is encoded as retract first Message to encode the new updated data as add messages, as shown in the following figure:
- Upsert stream
There are two types of messages in an upsert stream:upsert messagesAnddelete messages。 When a dynamic table is converted to an upsert stream, it needs a unique primary key (possibly a composite key). When a dynamic table with a unique primary key is converted to a stream, it encodes the change data caused by insert and update operations as upsert messages, and the change data caused by delete operations as delete message. Compared with the retract stream, the upert stream encodes the change data caused by upadte operation with a single message, that is, the upert message. For retract stream, we need to encode the old data as retract message first, and then encode the new data as add message. That is to say, we need to encode delete and insert messages, so it is more efficient to use upsert stream. The details are shown in the figure below:
Scream tip:When a dynamic table is converted to a datastream, only append stream and retract stream are supported. When exporting dynamic table to external system, it supports append, retract and upsert modes.
This paper mainly introduces the concept of dynamic table in Flink table API & SQL. This paper first introduces the basic concept of dynamic table, then introduces the way of defining table on stream, and points out the similarities and differences between grouping aggregation and window aggregation. Finally, it introduces three modes of table to stream conversion and output to external system.
Official account “big data technology and multi warehouse”, reply to “information” to receive big data package.