How to use Flink 1.11 SQL

Time:2021-2-1

On July 6, Apache Flink 1.11 was officially released. From the functional planning in early March to the official release in early July, 1.11 focused on optimizing the usability of Flink in nearly four months to improve the user‘s production experience.

As one of the core modules in Flink, SQL is very important to promote the improvement of Flink stream batch integration function. In 1.11, Flink SQL has been greatly enhanced and improved, and more than 10 major functions have been developed, which not only expands the application scenarios, but also simplifies the process and makes the operation easier. Among them, notable changes include:

  • The default planner has been switched to the blink planner.
  • The support for CDC (change data capture) is introduced. Users can connect the data sources of debezium and canal with just a few simple SQL sentences.
  • Offline data warehouse is real-time, users can easily use SQL to write streaming data from Kafka to hive.

The evolution of Flink SQL

With the development of streaming computing, the challenge is no longer limited to the amount of data and computation, and the business is becoming more and more complex. Developers may be experienced big data practitioners, beginners of Java, or data analysts who do not understand the code. How to improve the efficiency of developers and reduce the threshold of stream computing is very important to promote real-time computing. SQL is the most widely used language in data processing, which allows users to show their business logic in a concise way. As a computing engine of stream batch integration, Flink is committed to providing a set of SQL to support all application scenarios, and the implementation of Flink SQL fully complies with ANSI SQL standard. Before, the user may need to write hundreds of lines of business code, after using SQL, it may only need a few lines of SQL to easily complete. The development of Flink SQL has gone through the following stages:

  • Flink 1.1.0: SQL module was introduced for the first time, and tableapi was provided. Of course, the function at this time was very limited.
  • Flink 1.3.0: supports retractions on streaming SQL, significantly improves the usability of streaming SQL, and makes Flink SQL support complex unbounded aggregate connections.
  • Flink 1.5.0: the introduction of SQL client marks that Flink SQL begins to provide pure SQL text.
  • Flink 1.9.0: abstracts the planner interface of table and introduces a separate blink table module. Blink table module is the internal SQL layer version of Alibaba. It not only has significant changes in structure, but also has more powerful and perfect functions.
  • Flink 1.10.0: as the first version of blink to basically complete merge, it fixes a large number of remaining problems, brings the syntax of watermark to DDL, and brings complete tpc-ds support and efficient performance to batch SQL.

After several versions of iterative support, SQL module is becoming more and more important in Flink, and the SQL users of Flink are gradually expanding. Python interface and machine learning interface based on SQL module are also developing rapidly. There is no doubt that SQL module as one of the most commonly used API and ecological integration is becoming more and more important.

SQL 1.11 important changes

Flink SQL extends the support of the new scenario on the original basis

  • Flink SQL introduces support for CDC (change data capture), which makes it easy for Flink to translate and consume the change log of database through tools like debezium.
  • Flink SQL extends the support of file system connector for real-time user scenarios and formats, so that it can support scenarios such as writing streaming data from Kafka to hive.

In addition, Flink SQL also improves the usability of SQL from many aspects, systematically solves the previous bugs and improves the user API.

CDC support

CDC format is a common pattern in database. The typical application in business is to export CDC data from database to Kafka through specific format through tools (such as debezium or canal). In the past, the business needs to define special logic to parse CDC data and convert it into general insert only data. The subsequent processing logic needs to consider this particularity. This work around method undoubtedly brings unnecessary complexity to the business. If Flink SQL Engine can natively support the input of CDC data, docking CDC to the concept of changelog stream of Flink SQL will greatly reduce the complexity of user business.

How to use Flink 1.11 SQL

The essence of stream computing is to update and change the result constantly. Consider a simple aggregate SQL. In flow calculation, the aggregate value generated by each calculation is actually a local value, so changelog stream will be generated. In the past, it was almost impossible to output aggregated data to Kafka, as shown in the figure above, because Kafka can only receive insert only data. Before Flink, it was mainly due to the limitation of the source & sink interface that it could not support the input of CDC data. Flink SQL 1.11 has undergone a lot of interface refactoring. On the new source & sink interface, it supports the input and output of CDC data, and supports debezium and canal formats (flip-105). This change makes dynamic table source no longer only support append only operations, but also import external modification logs (insert events), translate them into corresponding modification operations (insert, modify and delete), and send these operations and operation types to subsequent streams.

How to use Flink 1.11 SQL

As shown in the figure above, in theory, the data synchronized from CDC to Kafka is a stream of append, but the format contains the identity of changelog

  • One way is to treat the changlog ID as a common field, which is also a common way at present.
  • After Flink 1.11, it can be declared in the format of changelog. The internal mechanism of Flink supports interpret changelog, which can recognize this special stream natively, convert it into the changlog stream of Flink, and process it according to the semantics of SQL. Similarly, Flink SQL also has the ability to output change stream (Flink 1.11) This means that you can write any type of SQL to Kafka as long as there is a format supported by changelog.

In order to consume CDC data, users need to specify “format = debezium JSON” or “format = canal JSON” when creating tables with SQL DDL

CREATE TABLE my_table (
    ...
) WITH (
    'connector'='...', -- e.g. 'kafka'
    'format'='debezium-json'
);

The interface of Flink 1.11 has been ready, but in the implementation:

  • Only debezium JSON and canal JSON reading of Kafka are supported.
  • Welcome to extend and implement your own format and connector.

Source & sink reconstruction

An important purpose of source & sink refactoring is to support the changelog mentioned in the previous section, but in addition to changelog, it also solves many previous problems. The new source & sink uses standard posture (see the official document for details)

CREATE TABLE kafka_table (
    ...
) WITH (
    'connector' = 'kafka-0.10',
    'topic' = 'test-topic',
    'scan.startup.mode' = 'earliest-offset',
    'properties.bootstrap.servers' = 'localhost:9092',
    'format' = 'json',
    'json.fail-on-missing-field' = 'false'
);

For forward compatibility, Flink 1.11 still retains the old source & sink and uses“ connector.type ”You can fallback to the old source & sink.

■ factory discovery mechanism

Before Flink 1.11, users often encountered an exception called nomatchingfactory exception

How to use Flink 1.11 SQL

It means that a DDL is defined. When using the DDL attribute, the corresponding tablefactory implementation cannot be found. The possible reasons are as follows:

  • There is no implementation class in classpath. Flink SQL discovers factory through Java SPI mechanism;
  • Wrong parameter.

However, the reported exception makes people very confused. According to the exception prompt message, it is difficult to find out where the code is wrong, and even more difficult to know which key is wrong.

public interface Factory {
    String factoryIdentifier();
    ……
}

So in Flink 1.11, the community reconstructs the tablefactory interface and proposes a new factory interface, which has a method called factoryidentifier. In the future, all look ups of factories will go through identifier. In this way, it is very clear that the reason why we can’t find it is because there is no factory class in classpath. If we can find it, we can locate it in the factory implementation for deterministic verification.

Type and data structure

The previous source & sink interface supports user-defined data structure, that is, the framework knows how to convert the user-defined data structure into the internal data structure recognized by Flink SQL, such as:

public interface TableSource<T> {
    TypeInformation<T> getReturnType();
    ...
}

Users can customize the generic T and tell the framework how to convert it by getreturntype.

However, the problem arises. What happens when getreturntype is inconsistent with the type declared in DDL? Especially in the case of two type systems, such as typeinformation of runtime and datatype of SQL layer. Due to the accuracy and other problems, it may lead to type mismatched exceptions.

Flink 1.11 systematically solves this problem. Now connector developers can’t customize the data structure, they can only use the internal data structure of Flink SQL: rowdata. Therefore, it ensures the correspondence between the default type and DDL, and does not need to return the type for the framework to confirm.

The rowdata data structure is designed in SQL

  • Abstract class interface, suitable for high-performance implementation in different scenarios.
  • Contains rowkind, which fits the CDC data format in flow calculation.
  • Follow SQL specifications, such as include precision information.
  • Enumerable data structure corresponding to SQL type.

How to use Flink 1.11 SQL

■ upsert and primary key

A typical scenario of stream computing is to write aggregated data to upsert sink, such as JDBC and HBase

How to use Flink 1.11 SQL

Upsertstreamtablesink requires the upstream query to have complete primary key information, otherwise it will throw an exception directly. This phenomenon involves Flink’s upsertstreamtablesink mechanism. As the name suggests, it is an updated sink, which needs to be updated by key, so it must have key information.

How to find the primary key? One method is to let the optimizer infer from the query, as shown in the figure below.

In this case, it’s very good in simple query, and it’s also semantic and natural. However, if it is a complex query, such as aggregation, join and regrouping, it will only report an error. We can’t expect the optimizer to be intelligent. In many cases, it can’t infer PK. Moreover, the business SQL itself can’t infer PK, which leads to such an exception.

How to use Flink 1.11 SQL

How to solve the problem? Flink 1.11 completely abandoned this mechanism. Instead of inferring PK from query, it completely relies on create table syntax. For example, create a JDBC_ Table, you need to explicitly write the primary key in the definition. When PK is specified, it is equivalent to telling the framework that the JDBC sink will be updated according to the corresponding key. In this way, it has nothing to do with query. This design can be very clearly defined, and how to update is completely in accordance with the definition of settings.

CREATE TABLE jdbc_table (
    id BIGINT,
    ...
    PRIMARY KEY (id) NOT ENFORCED
)

Hive flow batch integration

First, look at the traditional hive data warehouse. A typical hive data warehouse is shown in the figure below. In general, ETL uses scheduling tools to schedule jobs, such as once a day or once an hour. The scheduling here is actually a superimposed delay. Scheduling generates table1, then table2, and then table3. The calculation delay needs to be superimposed.

How to use Flink 1.11 SQL

The problem is slow and the delay is large, and the ad hoc analysis delay is also large, because the previous data warehousing or the previous scheduling ETL will have a great delay. If ad hoc analysis returns quickly, it will also see historical data.

So it’s popular to build a real-time data warehouse, write it from Kafka to Kafka, and then output it to bi dB. Bi DB provides real-time data services and can query it in real time. Kafka’s ETL is a real-time job, and its delay may even reach millisecond level. Real time data warehouse depends on queue, and all its data storage is based on queue or real-time database, which has good real-time performance and low latency. However:

  • First, based on the queue, generally speaking, it is row storage plus queue, and the storage efficiency is not high.
  • Second, based on precomputation, it will eventually fall to bi dB. It is already aggregated data, and there is no historical data. In addition, Kafka generally stores data within 15 days, and there is no historical data, which means that ad hoc analysis cannot be carried out. All the analyses are pre-defined, and the corresponding real-time jobs must be started and written to the DB before they can be used. Comparatively speaking, the advantage of hive data warehouse is that it can be used for ad hoc analysis, and you can get any result you want at any time.

How to use Flink 1.11 SQL

Can we combine the advantages of offline data warehouse and real-time data warehouse, and then build a lambda architecture?

The core problem is that the cost is too high. The maintenance cost, calculation cost and storage cost are very high. In addition, the data on both sides should be consistent. When the off-line data warehouse finishes writing hive data warehouse and SQL, then the real-time data warehouse also finishes writing corresponding SQL, which will cause a lot of repeated development. There may also be teams that are divided into offline teams and real-time teams. The communication, migration and data sharing between the two teams will bring a lot of human costs. Nowadays, there will be more and more real-time analysis and continuous migration, which leads to higher and higher cost of repeated development. A small number of important operations are acceptable. If there are a large number of operations, the maintenance cost is very large.

How to enjoy the benefits of ad hoc and realize the advantages of real-time? One idea is to make hive’s offline data warehouse real-time, even if it can’t be real-time or quasi real-time. Therefore, Flink 1.11 has made some explorations and attempts on hive stream batch integration, as shown in the figure below. It can export data in the way of streaming in real time and write it to bi dB. In addition, this system can also analyze ad hoc with analytical computing framework. In this figure, the most important is the import of Flink streaming.

How to use Flink 1.11 SQL

■ Streaming Sink

Earlier versions of Flink had a powerful streaming file sink in the data streaming layer to write streaming data to the file system. It is a quasi real-time, exact only system, which can realize sink with a little data and a lot of data.

How to use Flink 1.11 SQL

The specific principle is based on two stages of submission

  • Stage 1: snapshot pertask, close the file that needs commit, or record the offset of the file being written.
  • The second stage: notifycheckpoint complete. Rename needs a commit file. Note that Rename is an atomic and idempotent operation, so as long as you guarantee the at least once of rename, you can guarantee the exact once of the data.

Such a file system writer seems to be more perfect. But in the history data warehouse, the visibility of data depends on the history Metastore. Who will inform the history Metastore in this process?

How to use Flink 1.11 SQL

The SQL layer extends the partition committer in streaming file sink.

It is equivalent to not only file commit, but also partition commit. As shown in the figure, filewriter corresponds to the previous streaming filesink, which provides the filewriter of exactly once. And then a node partition committee is connected. The commit policies supported are as follows:

  • Built in support of add partition to hive metadata;
  • Support writing successfile to the file system;
  • You can also customize the committer, such as analyzing partition and merging small files in partition.

After the committer is hung on the writer, the commit trigger decides when to commit

  • The default commit time is to commit as soon as there is a file. Because all commits are reentrant, this is permissible.
  • In addition, it also supports the decision through partition time and watermark. For example, when the time is 11 o’clock, the 10 o’clock partition can be committed. Watermark ensures the current accuracy of the job.

■ Streaming Source

There are a large number of ETL tasks in hive data warehouse. These tasks are often run periodically through scheduling tools. There are two main problems in doing so

  • The real-time performance is not strong, and the scheduling is usually at the minimum hour level.
  • Complex process, many components, easy to have problems.

For these off-line ETL jobs, Flink 1.11 develops a real-time hive stream read to support:

  • The partition table monitors the generation of partitions and reads new partitions incrementally.
  • Non partition table to monitor the generation of new files in the folder and read new files incrementally.

Even the 10 minute level partition strategy can be used, and the hive streaming source and hive streaming sink of Flink can greatly improve the real-time performance of hive data warehouse to the quasi real-time minute level. At the same time of real-time, it also supports the full amount of ad hoc queries for tables to improve the flexibility.

SELECT * FROM hive_table /*+ OPTIONS('streaming-source.enable'=’true’, 'streaming-source.consume-start-offset'='2020-05-20') */;

In addition to the scan read mode, Flink 1.11 also supports the temporary join mode, which is commonly known as streaming dim join.

SELECT
  o.amout, o.currency, r.rate, o.amount * r.rate
FROM
  Orders AS o
  JOIN LatestRates FOR SYSTEM_TIME AS OF o.proctime AS r
  ON r.currency = o.currency

At present, cache all is supported and partition insensitive, which is more suitable for small tables.

■ Hive Dialect

Flink SQL follows the standard of ansi-sql, while hive SQL has its own HQL syntax, and their syntax and semantics are slightly different.

How can hive users migrate to the Flink ecosystem while avoiding too much learning cost? Therefore, Flink SQL 1.11 provides hive dialog, which enables users to use HQL language to calculate in the Flink ecosystem. At present, it only supports DDL, and subsequent versions will gradually attack qeuries.

■ Filesystem Connector**

Hive integration provides a heavyweight integration with rich functions but complex environment. What if you just want a lightweight file system to read and write?

For a long time, Flink table only supports one CSV file system table, and does not support partition. In some aspects, its behavior does not conform to the intuition of big data computing.

Flink 1.11 reconstructs the implementation of the whole filesystem connector

  • Combined with partition, now, filesystem connector supports all semantics of partition in SQL, DDL of partition, partition pruning, static / dynamic partition insertion, and rewrite insertion.
  • Support various formats:

■ CSV
■ JSON
■ Aparch AVRO
■ Apache Parquet
■ Apache ORC

  • It supports reading and writing of batch.
  • Support streaming sink, partition commit and write success file.

With a few simple SQL sentences, you don’t need to build hive integrated environment

  • Start a stream job to write to the filesystem, and then query the data on the filesystem at hive end. Compared with the previous datastream job, simple SQL can store offline data.
  • Query the data in hive data warehouse through the filesystem connector. The function is not as complete as hive integration, but the definition is simple.

Table ease of use

■ DDL hints and like

How to use Flink 1.11 SQL

After Flink 1.10, hive Metastore has gradually become the storage of table related meta information in Flink streaming SQL. For example, Kafka tables can be saved through hive catalog. In this way, tables can be used directly at startup.

Through DDL, you can write SQL to Kafka, MySQL or DFS after submitting SQL to cluster. After using hive catalog, does it mean that you only need to write DDL once, and all subsequent stream computing jobs use Kafka’s table directly?

Not exactly, because there are still some defects. For example, a typical Kafka table has some execution related parameters. Because Kafka generally stores data within 15 days, it needs to specify the time offset of each consumption, and the time offset is constantly changing. Each time a job is submitted, the parameters of Kafka table are different. These parameters are stored in the catalog. In this case, only another table can be created, so the fields and parameters need to be rewritten, which is very cumbersome.

  • Flink 1.11, the community has developed table hints, which only focuses on one function in 1.11, namely dynamic table options. It’s very easy to use. When you select from in SQL, you can specify its dynamic options by writing table hints after table. In different scenarios, you can specify different dynamic parameters.
  • Flink 1.11 introduces the like syntax. Like is a standard SQL definition. It is equivalent to clone a table to reuse its schema. Like supports multiple constraints. You can choose to inherit or to completely override.

Table Hints:

SELECT id, name FROM kafka_table1 /*+ OPTIONS('scan.startup.mode'='earliest-offset') */;

LIKE:

CREATE TABLE kafka_table2 WITH ( 'scan.startup.mode'='earliest-offset') LIKE kafka_table1;

On the basis of docking hive catalog, these two methods are very good supplements, which can avoid writing a lot of schemas every time.

■ built in connectors

Flink SQL 1.11 introduces three new built-in connectors to facilitate debugging, testing, pressure testing and online observation

  • Datagen source: a source that generates data out of nothing. You can define the generation strategy, such as sequence, such as random generation. It is convenient for offline functional testing, and can also be used for performance testing.
  • Print sink: print data directly at the runtime of the task node. For example, if there is a lack of data in a sink of an online job, I don’t know whether there is a problem with the data sent from the upstream or the sink logic. At this time, I can connect an additional print sink to check whether there is a problem with the upstream data.
  • Blackhole sink: silently eat the data to facilitate functional debugging.

The purpose of these three connectors is to exclude the influence of connectors in debugging and testing. Generally speaking, connectors are uncontrollable in stream computing. Many problems mix connectors together and become more complex and difficult to check.

SQL-API

■ TableEnvironment

Tableenvironment, as the programming entrance of SQL layer, is undoubtedly very important

  • Table sqlQuery: returns the table interface from a select query and translates the user’s SQL into Flink’s table.
  • Void sqlupdate: in essence, it is to execute a DDL / DML. However, in terms of behavior, when it is DDL, it will be executed directly; when it is DML, it will be silently cached to tableenvironment and will not be executed until the subsequent execute call.
  • Execute: actually execute and submit the job to the cluster.

Tableenvironment has a silent cache execution plan, and many APIs feel confused. Therefore, the 1.11 community refactored the programming interface to provide a clean and bug free interface.

  • Single SQL execution: tableresult executesql (string SQL)
  • Multiple SQL execution: statementset createstatementset ()
  • Tableresult: supports collect, print and getjobclient

Now executesql is a unified interface, no matter what SQL, query, DDL or DML, it can be easily used directly. In addition, there is a clear boundary between datastream and datastream

  • Todatastream called: be sure to use stream ExecutionEnvironment.execute
  • Never called todatastream: be sure to use TableEnvironment.executeSql

■ SQL-Client

SQL client aligns a lot of DDLS that are already supported in Flink in 1.11. In addition, it is worth noting that the community has also developed a tableau result display mode to show the results more naturally. Instead of switching pages, it displays the results directly from the command line

How to use Flink 1.11 SQL

Summary and Prospect

The above interpretation mainly focuses on the user interface. The community already has rich documents. You can go to the official website to view the detailed documents of these functions, so as to have a deeper understanding and use.

Flink SQL 1.11 has made a start in CDC, laying a solid foundation on the internal mechanism and API. In the future, more built-in CDC support will be built in to directly connect with the database binlog and support more Flink SQL syntax. The subsequent versions will also provide more stream batch integration support from the bottom layer, which will bring more stream batch integration possibilities to the SQL layer.