This article is the beginning of a series of Flink SQL articles on the interpretation of the major functional features of the new version of Apache Flink. The Flink SQL series articles are shared by its core contributors, covering basic knowledge, practice, tuning, internal implementation and other aspects, so as to give you a comprehensive understanding of Flink SQL from simple to deep.
1. Development history
On August 22 this year, Apache Flink released version 1.9.0 (hereinafter referred to as 1.9). In Flink 1.9, the table module ushered in the upgrade of the core architecture and introduced many functions contributed by Alibaba blink team. This paper combs the architecture of the table module and introduces how to use the blink planner.
Flink’s table module includes table API and SQL. The table API is a kind of SQL API. Through the table API, users can operate data as they operate tables, which is very intuitive and convenient. As a declarative language, SQL has standard syntax and specifications. Users can process data without caring about the underlying implementation, which is very easy to use. Flink table API and SQL About 80% of the code is public. As a stream batch unified computing engine, Flink’s runtime layer is unified, but before Flink 1.9, Flink API layer has been divided into datastream API and dataset API, and table API & SQL is located above datastream API and dataset API.
< P style = “text align: Center” > Flink 1.8 table architecture</p>
In the Flink 1.8 architecture, if users need to simultaneously stream computing and batch processing, users need to maintain two sets of business codes, and developers need to maintain two sets of technology stacks, which is very inconvenient Flink community has long imagined that batch data can be regarded as a bounded stream data, and batch processing can be regarded as a special case of stream computing, so as to realize stream batch unification. Alibaba’s blink team has done a lot of work in this area, and has realized the stream batch unification of table API & SQL layer Fortunately, Alibaba has given back blink’s open source to the Flink community. In order to realize the flow batch unification of the whole Flink system, based on some prior experience of the blink team, the developers of the Flink community have basically finalized the future technical architecture of Flink after several rounds of discussions.
< P style = “text align: Center” > Flink’s future architecture</p>
In Flink’s future architecture, dataset API will be abolished. Only datastream API and table API & SQL are user-oriented APIs. In the implementation layer, the two APIs share the same technology stack, use unified DAG data structure to describe jobs, use unified streamoperator to write operator logic, and use unified streaming distributed execution engine to realize thorough streaming batch Unified. Both of these APIs provide the functions of stream calculation and batch processing. The datastream API provides a lower level and more flexible programming interface. Users can describe and arrange operators by themselves, and the engine will not do too much interference and optimization. The table API and SQL provide intuitive table API and standard SQL support, and the engine will optimize according to the user’s intention and choose the best execution plan 。
2. Flink 1.9 table architecture
Blink’s table module architecture has achieved stream batch unification when it was open-source. It has taken the first step towards Flink’s future architecture and is ahead of the Flink community Therefore, when Flink 1.9 is incorporated into the blink table code, in order to ensure that the existing architecture of Flink table and the architecture of blink table can coexist and evolve towards the future architecture of Flink, the community’s developers focus on flip-32 (flip, or Flink improvement proposals), and specially record some proposals for making major changes to Flink. Flip-32 is: structure flip table for future contributions) has been reconstructed and optimized, so that the new architecture of Flink table has the ability of flow batch unification. It can be said that Flink 1.9 is the first step towards the future architecture of flow batch complete unification.
< P style = “text align: Center” > Flink 1.9 table architecture</p>
In the new architecture of Flink table, there are two query processors: Flink query processor and blink query processor, which correspond to two planners respectively. We call them old planner and blink planner. The query processor is the specific implementation of planner. Through the processes of parser, optimizer and CodeGen, table API & SQL jobs are transformed into transformation DAG (directed acyclic graph composed of transformation, which represents the transformation logic of jobs) recognized by Flink runtime, and finally the job is scheduled and executed by Flink runtime.
Flink’s query processor has different branches for flow computing and batch jobs. The underlying API of flow computing job is datastream API, and the underlying API of batch job is dataset API. Blink’s query processor realizes the unification of flow batch job interface, and the underlying API is transformation.
3. Flink planner and blink planner
The new architecture of Flink table implements the plug-in of query processor. The community retains the original Flink Planner (old Planner) completely, and introduces a new blink planner. Users can choose whether to use the old planner or the blink planner.
In the model, the old planner does not consider the unity of flow calculation jobs and batch jobs. The implementation of flow calculation jobs and batch jobs are different. At the bottom, they will be translated to datastream API and dataset API respectively. The blink planner regards the batch data set as the bound datastream, and the flow calculation job and batch processing job will be translated to the transformation API finally In terms of architecture, blink planner implements batchplanner and streamplaner respectively for batch processing and flow calculation, which share most of the code and a lot of optimization logic The old planner implements two completely independent systems for batch processing and flow computing code, and basically does not realize code and optimization logic reuse.
In addition to the advantages of model and architecture, blink planner precipitates many practical functions in the massive business scenarios within Alibaba group, focusing on three aspects:
- Blink planner improves the code generation mechanism and optimizes some operatorsIt provides rich and practical new functions, such as dimension table join, top n, minibatch, streaming de duplication, data skew optimization of aggregation scenarios, etc.
- The optimization strategy of blink planner is based on the optimization algorithm of common subgraph, including cost based optimization (CBO) and rule-based optimization (CRO). The optimization is more comprehensive. At the same time, blink planner supports to obtain statistics of data source from catalog, which is very important for CBO optimization.
- Blink planner provides more built-in functions and more standard SQL support, TPC-H has been fully supported in Flink version 1.9, and higher-level tpc-ds support is planned to be implemented in the next version.
As a whole, blink query processor is more advanced in architecture and function. For stability reasons, Flink 1.9 still uses Flink planner by default. If you need to use the blink planner, you can explicitly specify it in the job.
4. How to enable blink planner
In the IDE environment, you can enable the blink planner only by introducing the related dependencies of two blink planners.
<dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-table-api-scala-bridge_2.11</artifactId> <version>1.9.0</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-table-planner-blink_2.11</artifactId> <version>1.9.0</version> </dependency>
The configuration of flow calculation job and batch job is very similar. You only need to set streamingmode or batchmode in the environment settings. The settings of flow calculation job are as follows:
// ********************** // BLINK STREAMING QUERY // ********************** import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.table.api.EnvironmentSettings; import org.apache.flink.table.api.java.StreamTableEnvironment; StreamExecutionEnvironment bsEnv = StreamExecutionEnvironment.getExecutionEnvironment(); EnvironmentSettings bsSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build(); StreamTableEnvironment bsTableEnv = StreamTableEnvironment.create(bsEnv, bsSettings); // or TableEnvironment bsTableEnv = TableEnvironment.create(bsSettings); bsTableEnv.sqlUpdate(…); bsTableEnv.execute();
The batch job settings are as follows:
// ****************** // BLINK BATCH QUERY // ****************** import org.apache.flink.table.api.EnvironmentSettings; import org.apache.flink.table.api.TableEnvironment; EnvironmentSettings bbSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inBatchMode().build(); TableEnvironment bbTableEnv = TableEnvironment.create(bbSettings); bbTableEnv.sqlUpdate(…) bbTableEnv.execute()
If the job needs to run in a cluster environment, set the scope of the blink planner related dependencies to provided when packaging, indicating that these dependencies are provided by the cluster environment. This is because Flink has packaged blink planner related dependencies when compiling and packaging, so it does not need to be introduced again to avoid conflicts.
5. Community long term plan
At present, tableapi & SQL has become a first-class citizen of Flink API, and the community will invest more energy in this module. In the near future, when the blink planner is stable, it will be the default planner, and the old planner will exit the stage of history at the right time. At present, the community is also trying to give datastream batch processing capabilities, so as to unify the stream batch technology stack. At that time, the dataset API will also exit the stage of history.
▼ Apache Flink community recommendation ▼
Flink forward Asia, a top-level event in Apache Flink and big data, is opening in 2019. At present, it is collecting topics and offering limited early bird tickets. To learn more about Flink forward Asia 2019, check out:
The first Apache Flink geek challenge starts with great weight. It focuses on two hot areas: machine learning and performance optimization. You can get 400000 bonus. To join the challenge, please click: