Practice and exploration of Flink flow batch integration

Time:2020-9-4

Since the Google dataflow model was proposed, flow batch integration has become the most mainstream development trend of distributed computing engine. The integration of flow and batch means that the computing engine has the advantages of low latency of stream computing and high throughput and high stability of batch computing. It provides a unified programming interface to develop applications in two scenarios and ensures that their underlying execution logic is consistent. For users, the integration of flow and batch greatly reduces the cost of development and maintenance, but at the same time, it is a great challenge for the computing engine.

As one of the earliest adopters of dataflow model, Apache Flink is very advanced in open source projects in terms of the completion of flow batch integration. Based on the community information and the author’s experience, this paper introduces the current situation of Flink (1.10) flow batch integration and the future development plan.

survey

I believe many readers know that Flink follows the idea of dataflow model: batch processing is a special case of stream processing. However, considering the execution efficiency, resource requirement and complexity of batch processing scenario, at the beginning of Flink design, although the bottom layer of streaming application and batch processing application are all stream processing, they are separated in programming API. This allows Flink to continue to use batch optimization technology at the execution level, simplify the architecture and remove unnecessary features such as water mark and checkpoint.

Practice and exploration of Flink flow batch integration
<p style=”text- align:center “> Figure 1. Flink classic architecture</p>

On the Flink architecture, the runtime layer responsible for the physical execution environment is a unified flow processing. There are two independent APIs, datastream and dataset, which are based on different task types (stream task / batch task) and UDF interface (transformation / operator). On the other hand, although the table API and SQL API based on relational algebra are unified on the surface, in fact, the programming entry (environment) is separate, and the logic of translating the flow batch jobs to datastream API and dataset API is inconsistent.

Therefore, to realize the real integration of streaming and batching, Flink needs to complete the transformation of table / SQL API and datastream / dataset API, transplant batch processing completely to stream processing, and need to take into account the efficiency and stability as the foundation of batch processing. At present, the integration of streaming and batch is also a very important point in Flink’s long-term goal. The completion of streaming batch integration will mark Flink’s entry into a new era of big version of 2. X.

After the integration of streaming and batching, the ideal architecture is as follows:

Practice and exploration of Flink flow batch integration
<p style=”text- align:center “> Figure 2. Flink’s future architecture</p>

Planner is independent from the table / SQL API layer and becomes a pluggable module, while the original datastream / dataset layer will be simplified to only datastream (streamtransformation and stream operator in Figure 2 are the main contents of stream DAG, representing UDF and the operator executing UDF respectively). The dataset API will be abandoned.

Improvement of table / SQL API

The transformation of table / SQL API started relatively early. As of the release of version 1.10, it has reached the goal of integrating streaming and batch. However, in version 1.7, the table API was only used as a Lib Based on the datastream / dataset API and did not receive the community’s attention.

At that time, Ali’s blink had done a lot of optimization on table / SQL. In order to merge the advanced features of blink into Flink, Ali’s engineers promoted the community to reconstruct the architecture of the table module [5] and upgraded the table / SQL API to the main programming API.

Since then, the code responsible for translating SQL / table API to datastream / dataset API in the table layer has been abstracted as a pluggable table planner module, and blink has also contributed the main features to the community in the form of blink planner. Therefore, the two planners coexist.

Practice and exploration of Flink flow batch integration
<p style=”text- align:center “> Figure 3. Flink’s current transition architecture</p>

Flink’s default legacy planner translates SQL / table programs into datastream or dataset programs, while the new blink planner translates them into datastream programs. That is to say, through the blink planner and Flink table API, the calculation of flow batch integration has been realized. To understand how the blink planner does this, it is necessary to have a certain understanding of the working principle of the planner.

The evolution of legacy planner’s representation of user logic in Flink architecture is as follows:

Practice and exploration of Flink flow batch integration
<p style=”text- align:center “> Figure 4. Legacy planner architecture</p>

  1. SQL parser based on calculate is used to parse the SQL submitted by users. Different types of SQL are parsed into different operations (for example, DDL corresponds to createtableoperation, DSL corresponds to queryoperation), and AST is expressed in the form of relational algebra calculate relnode.
  2. According to the different tableenvironments specified by users, different translation methods are used to translate the logical relational algebra node relnode into stream transformation or batch operator tree.
  3. The transformation or operator tree is translated into a job representation containing the execution environment configuration, i.e. streamgraph or plan.
  4. Optimize streamgraphs and plans and wrap them as serializable jobgraphs.

Because most syntax and semantics of batch SQL and streaming SQL are consistent, the difference is that streaming SQL has extended syntax to support the characteristics of stream processing such as watermark and time characteristic, so SQL parser is shared by batch / stream. The key point lies in the translation of relational algebra relnode.

Practice and exploration of Flink flow batch integration

<p style=”text- align:center “> Figure 5. Legacy planner relnode</p>

Flink extends its own flinkrelnode based on calculate relnode, which has three subclasses: flinklogicalrel, datasetrel and datastreamrel. Flinklogicalrel represents the relational algebraic nodes of logic. For example, flinklogicalrel corresponding to common map functions is datastream calc. Datasetrel and datastreamrel represent the physical execution of flinklogicalrel under batch and stream processing, respectively.

During SQL optimization, flinklogicalrel is converted to datasetrel or datastreamrel depending on the programming entry. Batchtableenvironment uses batchoptimizer to optimize based on calculate rule, while streamtableenvironment uses streameoptimizer for optimization. For example, a relnode like tablescan is translated into batchtablesourcescan in batch environment and streamtablesourcescan in stream environment. These two types of physical relational algebraic nodes can be directly mapped to the operator of dataset or the transformation of datastream.

The biggest problem with the above-mentioned method is that the optimization rules of calcium cannot be reused. For example, for the optimization of filter push-down of data sources, a set of datesetrel and datastreamrel need to be made respectively, and the operators of dataset and datastream layers also need to be modified accordingly. The development and maintenance cost is very high, and this is also the blink planner The main driving force to promote the integration of flow and batch.

As mentioned above, the most important point of blink planner is to abandon the translation methods related to dataset and transplant datesetrel to datastream. Of course, the premise is that datastream can express the semantics of dataset. Students who are familiar with batch processing may have questions: there are no sorting operators specific to batch processing in datastream. How will this be expressed?

In fact, dynamic code generation is widely used in table planner, which can bypass the datastream API and directly translate it to the underlying transformation and stream operator. It is not necessary for datastream to have ready-made operators. Therefore, the relationship between table API of blink planner and datastream API is more parallel. This is also the meaning of decoupling table API and datastream / dataset API mentioned in flip-32 [5]

Decouple table programs from DataStream/DataSet API
Allow table programs to be self-contained. No need for a Stream/ExecutionEnvironment entrypoint anymore. A table program definition is just API that reads and writes to catalog tables.

After the transformation of table, the whole API architecture is as follows, which is also the architecture implemented in version 1.10

Practice and exploration of Flink flow batch integration

<p style=”text- align:center “> Figure 6. Blink planner architecture</p>

In fact, the earlier version of datastream does not support batch jobs very well. In order to support blink planner’s batch on stream, datastream has also made a lot of optimization. These optimizations are necessary for the table API, so the pre work of merging blink planner into Flink master will be analyzed in the following section along with the unfinished improvements of datastream.

In addition, although blink planner is integrated with flow batch in computing, the tablesource and tablesink of Flink table API are still stream batch separated, which means that most tables based on batchtablesource / batchtablesink in most batch processing scenarios can’t cooperate well with streaming batch integrated computing, which will be processed in flip-95 [9].

Improvement of datastream API

In terms of datastream API, although the current datastream API can support bounded data stream, this support is not complete and there is still a gap in efficiency compared with dataset API. In order to realize the complete integration of flow and batch, Flink community intends to introduce the concept of boundedstream into datastream to represent bounded data flow, completely replacing dataset in various senses.

Boundedstream will be a special case of datastream, which also uses transformation and stream operator, and also needs to inherit batch optimization of dataset. These optimizations can be divided into task thread pattern, scheduling strategy, fault tolerance and computing model and algorithm.

Task thread model

Batch processing business scenarios usually pay more attention to high throughput. For this reason, batch task is pull based, which is convenient for task to pull data in batches. After a task is started, it will take the initiative to read the external data source through the source API inputformat of the dataset. Each task only reads and processes one split at the same time.

In contrast, the general stream processing business scenario pays more attention to latency, so stream task is push based.

The source API sourcefunction of datastream will be executed by an independent source thread, and external data will be read all the time, and the data will be continuously pushed to the stream task. Each source thread can read one or more split / partition / shards concurrently.

Practice and exploration of Flink flow batch integration

Practice and exploration of Flink flow batch integration
Figure 7. Stream / batch thread model (source: Flink forward)

In order to solve the difference of task thread model, Flink community plans to reconstruct source API to unify task thread model in different external storage and business scenarios. The overall idea is to add a new set of source API, which can support multiple thread models and cover two business requirements of streaming batch. For details, see flip-27 [6] or an earlier blog [7]. At present, flip-27 is still in the initial development stage.

Scheduling strategy and fault tolerance

As we all know, batch jobs and streaming jobs are very different in task scheduling. Multiple tasks of batch jobs do not need to be online at the same time. One batch of tasks can be scheduled according to the dependency relationship, and then another batch can be run after they are finished.

On the contrary, all tasks of a flow job need to be scheduled when the job starts before it can start processing data. The former is usually called lazy scheduling, and the latter is called Eagle scheduling. In order to realize the integration of flow and batch, Flink needs to support both scheduling modes in streamgraph, that is, adding lazy scheduling.

The problem with scheduling is fault tolerance, which is not hard to understand, because the task needs to be rescheduled to recover after an error occurs. One of the characteristics of lazy scheduling is that the intermediate results of task calculation need to be saved in a highly available storage, and then can be obtained after the next task is started.

Before version 1.9, Flink did not persist intermediate results. As a result, if the task manager crashes, the intermediate results will be lost, and the whole job needs to read data from the beginning or recover from checkpoint. This is normal for real-time stream processing. However, batch processing does not have the concept of checkpoint. Batch processing usually relies on the persistence of intermediate results to reduce the range of tasks that need to be recalculated. Therefore, the Flink community introduces pluggable shuffle service to provide persistence of buffer data to support fine-grained fault-tolerant recovery. For details, see flip-31 [8].

Calculation model and algorithm

Similar to the table API, the same calculation algorithm in stream processing and batch processing may be different. A typical example is join: in stream processing, it is shown as the continuous association between the elements of two streams. Any new input from either side needs to be associated with all elements of the other party, that is, the most basic nested loop join. In batch processing, Flink can optimize it to hash join, that is, read all the data of one party and build hash Table, and then read the other party to associate with hash table (see Figure 8).

Practice and exploration of Flink flow batch integration

Figure 8. Join batch optimization

The essence of this difference is the optimization of the operator when the data set is bounded. From the extension point of view, whether the data set is bounded is an optimization parameter for Flink to judge how to execute the operator, which also confirms the idea that batch processing is a special case of stream processing. Therefore, from the perspective of programming interface, boundedstream, as a subclass of datastream, can provide the following optimization based on the boundedness of input:

  • Provide operators that can only be applied to bounded data streams, such as sort.
  • Some operators can be optimized algorithmically, such as join.

In addition, batch processing also has a feature that it does not need to output intermediate results during calculation, but outputs the final result at the end, which largely avoids the complexity of processing multiple intermediate results. Therefore, boundedstream also supports non incremental execution mode. This mainly affects the operators related to time charateristic

  • Processing time timer will be masked.
  • The watermark extraction algorithm is no longer effective, and the watermark directly jumps from – ∞ at the beginning to + ∞ at the end.

Summary

Based on the idea that batch processing is a special case of stream processing, it is completely feasible to express batch processing with stream processing in semantics. However, the difficulty of flow batch integration lies in the optimization of batch processing scenario as a special scenario. For Flink, the difficulty is mainly reflected in the differences of task thread model, scheduling strategy, calculation model and algorithm. At present, Flink has realized the integration of streaming and batching on the explicit table / SQL API, and the datastream API of the lower level bias program will also realize the integration of streaming and batching in Flink 2.0.

Tips:For the original article and detailed references, please refer to the original link below

Link to the original text:

http://www.whitewood.me/2020/…

The author introduces:

Lin Xiaopu, senior development engineer of Netease games, is responsible for the development and operation and maintenance of the real-time platform of the game data center. Currently, he focuses on the development and application of Apache Flink. It’s fun to explore problems.

<p style=”text- align:center “> recommendation of community activities#</p>

Practice and exploration of Flink flow batch integration

Pratt & Whitney global developers, this time, is very different! The first Apache top-level project online conference Flink Forward world broadcast Chinese essence version came, focusing on Alibaba, Google, AWS, Uber, Netflix, Sina micro-blog and other domestic and foreign front-line manufacturers, the classic Flink application scenario, the latest functions and future plans have a clear view. Click the link below to learn more about the conference: https://developer.aliyun.com/…