Apache Flink infrastructure and concepts



Apache Flink is an open source computing platform for distributed data stream processing and batch data processing. It can support streaming and batch processing applications based on the same Flink runtime. The existing open source computing solutions take streaming and batch processing as two different application types, because they provide totally different SLAs: stream processing generally needs to support low latency and exact once guarantee, while batch processing needs to support high throughput and efficient processing. Therefore, two sets of implementation methods are usually given in the implementation, or a unique one is adopted The open source framework is established to implement each of them. For example, open source solutions for batch processing include MapReduce, tez, crunch and spark, and open source solutions for stream processing are samza and storm.

Flink is totally different from some traditional schemes in the implementation of stream processing and batch processing. It looks at flow processing and batch processing from another perspective, unifying the two: Flink fully supports stream processing, that is, when viewed as stream processing, the input data stream is unbounded; batch processing is regarded as a special stream processing, but its input data stream is defined as bounded. Based on the same Flink runtime, flow processing and batch processing APIs are provided respectively, and these two APIs are also the basis for realizing the upper level application framework of flow oriented and batch processing type.

Basic characteristics

As for the features supported by Flink, I’m just going to sort out the features through classification. Some specific concepts and their principles will be explained in detail in the following sections.

Stream processing characteristics

  • Support high throughput, low latency, high performance stream processing
  • Support window operation with event time
  • Exact once semantics supporting stateful computing
  • It supports highly flexible window operations and window operations based on time, count, session and data driven
  • Support continuous flow model with backpressure function
  • Support fault tolerance based on lightweight distributed snapshot
  • A runtime supports both batch on streaming processing and streaming processing
  • Flink implements its own memory management within the JVM
  • Support iterative computing
  • Support program automatic optimization: avoid shuffle, sorting and other expensive operations under specific circumstances, and it is necessary to cache the intermediate results

API support

  • For streaming data class application, provide datastream API
  • For batch application, provide dataset API (support Java / Scala)

Libraries support

  • Support machine learning (flinkml)
  • Support graph analysis (gelly)
  • Support relational data processing (table)
  • Support complex event processing (CEP)

Integration support

  • Support Flink on yarn
  • HDFS support
  • Support input data from Kafka
  • Supports Apache HBase
  • Support Hadoop program
  • Support Tachyon
  • Support elasticsearch
  • Rabbitmq support
  • Apache storm support
  • Support S3
  • Support xtreemfs

Basic concepts

Stream & Transformation & Operator

The Flink program implemented by users is composed of two basic building blocks: stream and transformation. Stream is an intermediate result data, while transformation is an operation. It calculates and processes one or more input streams and outputs one or more result streams. When a Flink program is executed, it is mapped to streaming dataflow. A streaming dataflow is composed of a group of streams and transformation operators. It is similar to a DAG diagram. It starts from one or more source operators and ends with one or more sink operators.

The following is a schematic diagram of mapping from Flink program to streaming dataflow, as shown below:

Apache Flink infrastructure and concepts

In the above figure, flinkkafka consumer is a source operator, map, keyby, timewindow and apply are transformation operators, and rollingsink is a sink operator.

Parallel Dataflow

In Flink, programs are inherently parallel and distributed: a stream can be divided into multiple stream partitions, and an operator can be divided into multiple operator subtasks. Each operator subtask is executed independently in different threads. The parallelism of an operator is equal to the number of operator subtasks. The parallelism of a stream is always equal to the parallelism of the operator that generates it.

An example of parallel dataflow is shown in the following figure:

Apache Flink infrastructure and concepts

The parallel view of streaming dataflow in the above figure shows two modes of stream between two operators

  • One to one mode

For example, from source [1] to map () [1], it maintains the partitioning feature of the source and the order of element processing in the partition, that is, the subtask of map () [1] sees the order of records in the data stream, which is consistent with the record order seen in source [1].

  • Redistribution mode

This mode changes the partition of the input data stream, such as from map () [1], map () [2] to keyby () / window() / apply() [1], keyby() / window() / apply() [2]. The upstream subtask sends data to multiple downstream subtasks, changing the partition of the data stream, which is related to the operator selected in the actual application.

In addition, the source operator corresponds to two subtasks, so the parallelism is 2, while the sink operator has only one subtask, so the parallelism is 1.

Task & Operator Chain

In the Flink distributed execution environment, multiple operator subtasks will be strung together to form an operator chain, which is actually an execution chain. Each execution chain will be executed in an independent thread on the taskmanager, as shown in the following figure:

Apache Flink infrastructure and concepts

The upper part in the figure above represents an operator chain, and multiple operators are connected through a stream, and each operator corresponds to a task at runtime; the lower part of the figure is a parallel version of the upper part, that is, each task is parallelized into multiple subtasks.

Time & Window

Flink supports window operations based on time and data, as shown in the following figure:

Apache Flink infrastructure and concepts

In the above figure, the time-based window operation processes the records in the stream at each same time interval. Generally, the number of records processed by window operations in each time interval is not fixed. For data-driven window operation, a fixed number of records in the stream can be selected as a window to process the records in the window.

The different types of window operation can be divided into the following types: tumbling windows, sliding windows and session windows. For details, please refer to relevant information.

When processing records in a stream, records usually contain various typical time fields. Flink supports multiple time processing, as shown in the following figure:

Apache Flink infrastructure and concepts

The figure above describes the location and meaning of different times in the Flink based flow processing system. Among them, event time represents the time when the event is created, occurrence time refers to the time when the event enters into Flink dataflow, and processing time represents the local system time (on the taskmanager node) of an operator to handle the event. Here, let’s talk about the problem of processing based on event time. Generally, according to the event time, the whole streaming application will be delayed, because in an event based processing system, the events entering the system may be based on the event In order to enhance the throughput of event processing, multiple streams will be naturally partitioned, and each stream partition is ordered internally. However, in order to ensure global order, the processing of multiple stream partitions must be considered at the same time, and a certain time window should be set for temporary storage of data. When multiple stream partitions are based on event The delay processing can only be performed after the time permutation is aligned. Therefore, the longer the time window of temporary data record is set, the worse the processing performance is, and even seriously affects the real-time performance of stream processing.

For time-based streaming processing, you can refer to the official documents, learn from Google’s watermark implementation method in Flink, and refer to relevant information.

Basic structure

The architecture of Flink system is similar to spark, which is based on the master slave style architecture, as shown in the following figure:

Apache Flink infrastructure and concepts

When the Flink cluster is started, one jobmanager process and at least one taskmanager process will be started. In local mode, a jobmanager process and a taskmanager process are started within the same JVM. After the Flink program is submitted, it will create a client for preprocessing and convert it into a parallel data stream, which corresponds to a Flink job, which can be executed by job manager and task manager. In terms of implementation, Flink implements jobmanager and taskmanager based on actor, so the information exchange between job manager and task manager is handled by event.

As shown in the above figure, Flink system mainly includes the following three main processes:


Job manager is the coordinator of Flink system. It is responsible for receiving Flink jobs and scheduling the execution of multiple tasks that make up the job. At the same time, the job manager is also responsible for collecting job status information and managing the slave node task manager in the Flink cluster. Job manager is responsible for various management functions. The events it receives and processes mainly include:


When the Flink cluster is started, the taskmanager will register with the jobmanager. If the registration is successful, the jobmanager will reply the message “acknowledge registration” to the taskmanager.


Flink program submits Flink job to jobmanager through client. The basic information of job is described in the form of jobgraph in the message submitjob.


If the request contains a successful message, cancel is returned.


Taskmanager will ask jobmanager to update the status information of executionvertex in executiongraph. If the update is successful, it will return true.


The task running on the task manager requests the next input split to be processed. If it succeeds, nextinputsplit is returned.


The execution graph sends this message to the job manager to indicate changes in the status of the Flink job, such as running, canceling, finished, and so on.


Taskmanager is also an actor, which is actually responsible for the calculation of a set of tasks on which Flink jobs are executed. Each task manager is responsible for managing the resource information on its node, such as memory, disk, network, and reporting the status of resources to the jobmanager when it is started. The task manager side can be divided into two phases:

Registration stage

The task manager will register with the jobmanager, send a register task manager message, wait for the job manager to return the acknowledge registration, and then the task manager can start the initialization process.

Operational phase

In this stage, the task manager can receive and process task related messages, such as submittask, canceltask, and failtask. If the taskmanager is unable to connect to the jobmanager, the task manager will lose contact with the jobmanager and will automatically enter the “registration stage”. Only after the registration is completed can the task related messages continue to be processed.


When a user submits a Flink program, it will first create a client. The client will preprocess the Flink program submitted by the user and submit it to the Flink cluster for processing. Therefore, the client needs to obtain the address of the job manager from the Flink program configuration submitted by the user, establish a connection to the jobmanager, and submit the Flink job to the jobmanager. The client will assemble the Flink program submitted by the user into a jobgraph and submit it in the form of a jobgraph. A jobgraph is a Flink dataflow, which is a DAG composed of multiple jobvertexes. Among them, a jobgraph contains the following information of a Flink program: jobid, job name, configuration information, a set of jobvertex, etc.

Component stack

Flink is a hierarchical architecture system, each layer contains components that provide a specific abstraction to serve the upper components. The component stack of Flink hierarchy is shown in the following figure:

Apache Flink infrastructure and concepts

Next, we will explain each layer from bottom to top

Deployment layer

This layer mainly involves the deployment mode of Flink, which supports multiple deployment modes: local, cluster (standalone / yarn) and cloud (GCE / EC2). The deployment mode of standalone is similar to spark. Here, let’s take a look at the deployment mode of Flink on yarn, as shown in the following figure:

Apache Flink infrastructure and concepts

If you know about yarn, you are very familiar with the principle of the above figure. In fact, Flink also implements the components running on the yarn cluster: the Flink yarn client is responsible for communicating with yarn RM to negotiate resource requests, and Flink jobmanager and Flink taskmanager apply to the container to run their own processes. As can be seen from the above figure, yarn am and Flink jobmanager are in the same container, so am can know the address of Flink jobmanager, so am can apply for the container to start Flink taskmanager. After Flink runs successfully on the yarn cluster, Flink yarn client can submit Flink jobs to Flink job manager for subsequent mapping, scheduling and calculation.

Runtime layer

The runtime layer provides all core implementations supporting Flink computing, such as supporting distributed stream processing, mapping from jobgraph to executiongraph, scheduling, and so on, providing basic services for the upper API layer.

API layer

The API layer mainly implements the unbounded stream oriented stream processing and batch oriented batch processing API, in which the stream oriented processing corresponds to the datastream API, and the batch processing corresponds to the dataset API.

Libraries layer

This layer can also be called Flink application framework layer. According to the division of API layer, the implementation computing framework built on top of API layer to meet specific application also corresponds to flow oriented and batch oriented. Flow oriented processing support: CEP (complex event processing), SQL like operation (table based relational operation); batch processing support: flinkml (Machine Learning Library), gelly (graph processing).

Internal principle

fault tolerance

It is based on the principle of flick stream. When the stream processing fails, the data stream processing can be resumed through these snapshots. To understand Flink’s fault tolerance mechanism, we need to first understand the concept of barrier
Stream barrier is the core element of Flink distributed snapshot. It will be treated as the record of data flow, inserted into the data stream, grouped the records in the data flow, and pushed forward along the direction of data flow. Each barrier will carry a snapshot ID, and records belonging to the snapshot will be pushed to the front of the barrier. Because barrier is very lightweight, it does not interrupt the data flow. The data flow with barrier is shown in the following figure:

Apache Flink infrastructure and concepts

Based on the above figure, we illustrate the following points:

  • A barrier appears. The records that appear before the barrier belong to the corresponding snapshot of the barrier, and the records that appear after the barrier belong to the next snapshot
  • Multiple barriers from different snapshots may appear in the data stream at the same time, that is to say, multiple snapshots may be generated simultaneously at the same time
  • When an intermediate operator receives a barrier, it will send the barrier to the data stream of the snapshot belonging to the barrier. When the sink operator receives the barrier, it will confirm the snapshot to the checkpoint coordinator. Until all sink operators confirm the snapshot, it is considered that the snapshot has been completed

It should also be emphasized here that snapshot does not only check the state of data stream, but also includes the state held by the operator, so as to ensure the correct recovery of data flow processing when the stream processing system fails. That is, if an operator contains any form of state, that state must be part of the snapshot.

There are two states of the operator: one is the system state. When an operator performs calculation and processing, it needs to buffer the data. Therefore, the state of the data buffer is associated with the operator. Take the buffer operated by window as an example, Flink system will collect or aggregate the recorded data and put it in the buffer until the data in the buffer is processed; the other is that User defined state (the state can be created and modified through a conversion function), which can be a simple variable such as a Java object in a function, or a key / value state related to the function.

For streaming applications with slight state, very light-weight snapshots will be generated and very frequent, but the performance of data stream processing will not be affected. The state of the streaming application is stored in a configurable storage system, such as HDFS. During the execution of a checkpoint, the stored state information and its interaction process are shown in the following figure:

Apache Flink infrastructure and concepts

In the checkpoint process, there is another important operation stream alignment. When the operator receives multiple input data streams, the data streams need to be aligned in the snapshot barrier, as shown in the following figure:

Apache Flink infrastructure and concepts

The specific arrangement process is as follows:

  • The operator receives the snapshot barrier n from one incoming stream, and then pauses processing until the other incoming stream’s barrier n (otherwise the records belonging to the two snapshots are mixed together) arrive at the operator
  • Streams received barrier n are temporarily suspended, and records from these streams are not processed, but are put into a buffer
  • Once the last stream receives barrier n, the operator will emit all records temporarily stored in the buffer, and then send a snapshot n to the checkpoint coordinator
  • Continue processing records from multiple streams

Based on the stream alignment operation, it can realize exactly Once semantics, but it will also bring delay to the stream processing application, because in order to align the barrier, a part of the stream records will be temporarily cached into the buffer, which may be more obvious in the scenario of high data stream parallelism. Generally, the latest stream aligned with barrier is used as the time point for processing cache records in buffer. In Flink, a switch is provided to select whether to use stream alignment. If it is turned off, exactly once will become at least once.

Scheduling mechanism

On the jobmanager side, a Flink job in the form of a jobgraph submitted by the client will be received, and the job manager will map a jobgraph transformation into an execution graph, as shown in the following figure:

Apache Flink infrastructure and concepts

It can be seen from the above figure that:

Jobgraph is a user logical view representation of a job. It represents a user’s processing of data stream as a single DAG graph (corresponding to jobgraph). DAG graph is composed of a vertex (jobvertex) and an intermediate dataset (intermediatedataset). Jobvertex represents the conversion operations of log data stream, such as map, flatmap, filter, keyby, and intermediatedata The taset is generated by the upstream jobvertex and serves as the input to the downstream jobvertex.

The execution graph is the parallel representation of the jobgraph, that is, the logical view of the actual job manager scheduling a job to run on the task manager. It is also a DAG diagram, which is composed of executionjobvertex, intermediateresult (or mediateresultpartition). The executionjobvertex actually corresponds to the jobvertex in the jobgraph, but it is a kind of internal DAG diagram Parallel representation, consisting of multiple parallel executionvertexes. In addition, there is an important concept here, which is execution. It is an attempt of an execution vertex. In other words, an executionvertex may correspond to the execution of multiple running states. For example, an executionvertex operation generates a failed execution, and then a new execution will be created to run. At this time, it corresponds to the two operations Line attempt. Each execution is uniquely identified by the executionattemptid, and the task status exchange between the taskmanager and the jobmanager is realized by the executionattemptid.

Let’s take a look at an example of physical scheduling based on the allocation and use of resources, which is from the official website, as shown in the following figure:

Apache Flink infrastructure and concepts

The explanation is as follows:

  • Top left Subgraph: there are two task managers, and each task manager has three task slots
  • The left Subgraph: a Flink job logically contains 1 data source, 1 mapfunction and 1 reducefunction, corresponding to a jobgraph
  • The left Subgraph: the user submitted Flink jobs configure each operator – the parallelism of data source is set to 4, the parallelism of mapfunction is 4, and the parallelism of reducefunction is 3, which corresponds to the execution graph in the job manager
  • Upper right Subgraph: on taskmanager 1, there are two DAG graphs composed of parallel executionvertex, each of which occupies a task slot
  • Right Subgraph: on Task Manager 2, there are two DAG graphs composed of parallel execution vertexes, which also occupy a task slot
  • The four execution runs on two taskmanagers are executed in parallel

Iterative mechanism

Machine learning and graph computing applications will use iterative calculation. Flink implements iterative algorithm by defining step function in iterative operator. This kind of iterative algorithm includes iterate and delta iterate. In implementation, they call step function repeatedly in the current iteration state until the given conditions are met. In the following, the principles of iterate and delta iterate iterative algorithms are explained


Iterate operator is a simple form of iteration: in each iteration, the input of step function is either the whole data set input or the result of the previous iteration. Through this iteration, the input required for the next round of calculation (also known as the next partial) can be calculated After the termination conditions of iteration are met, the final iteration result will be output. The specific implementation process is shown in the following figure:


The step function is executed in each iteration. It can be a data stream composed of operators such as map, reduce and join. The following is an example given on the official website to illustrate iterate operator, which is very simple and intuitive, as shown in the following figure:

Apache Flink infrastructure and concepts

In the above iteration process, the input data are numbers from 1 to 5. The step function is a simple map function, which will add 1 to each input number, and next partial Solution corresponds to the result processed by the map function. For example, in the first round of iteration, the result of adding 1 to the input number is 2, and the result of adding 1 to the input number 2 is 3. Until the input number 5 is added with 1, the result becomes 6. These newly generated result numbers 2 ~ 6 will be used as the input of the second iteration. If the termination condition of iteration is 10 rounds, the final result is 11 ~ 15.

Delta Iterate

Delta iterate operator implements incremental iteration. Its implementation principle is shown in the following figure:

Apache Flink infrastructure and concepts

Incremental iteration is realized based on Delta iterate operator. It has two inputs, one is the initial workspace, which represents the input of incremental stream data to be processed, and the other is the initial solution set, which is the result processed by the operator in the stream direction. In the first iteration, the step function will be applied to the initial worksheet, and the calculated result worksheet will be used as the input of the next iteration, and the initial solution set will be updated incrementally. If repeated iterations know that the iteration termination condition is met, the final iteration result will be output according to the result of solution set.
For example, we now know that a solution collection stores the most purchased goods in the existing commodity categories, while the number of newly purchased goods in the online real-time transaction is input by the workbench. After calculation, the result of the largest purchase volume of the new product category will be generated. If the purchase volume of some categories suddenly increases, it needs to To update the results in the solution set (the most purchased products may not be the most after incremental iterative calculation), the final result set of products with the largest purchase volume in the final product category will be output. For more detailed examples, please refer to the “propagate minimum in graph” on the official website, which will not be mentioned here.

Backpressure monitoring

Backpressure will be paid more attention in the streaming computing system. Because the processing speed and mode of multiple operators processing on a stream may be very different, there is an upstream operator. If the processing speed is too fast, the downstream operator may accumulate stream records, which will cause processing delay or collapse due to heavy load of downstream operators Some systems may lose data). Therefore, if the downstream operator can not keep up with the processing speed of the upstream operator, if the downstream operator can propagate its processing state to the upstream operator to slow down the processing speed of the upstream operator, the above problems will be alleviated. For example, the existing stream processing system will be informed of the problems through the alarm.

The Flink web interface provides the monitoring of backpressure behavior of running jobs. It is implemented by using the sampling thread to sample the stack trace of running tasks. The specific implementation method is shown in the following figure:

Apache Flink infrastructure and concepts

The job manager will repeatedly call the task running thread of a job Thread.getStackTrace (), by default, the jobmanager will trigger 100 stack trace calls to each task of a job every 50ms. The backpressure is determined according to the call result. Flink determines the backpressure state of the currently running job by calculating a radio. You can see this radio value on the web interface. It represents the number of stack traces that are blocked in an internal method call. For example, radio = 0.01, which means that only one method call is blocked out of 100. Flink currently defines the following backpressure states:

  • OK: 0 <= Ratio <= 0.10
  • LOW: 0.10 < Ratio <= 0.5
  • HIGH: 0.5 < Ratio <= 1

In addition, Flink also provides three parameters to configure the backpressure monitoring behavior:

Apache Flink infrastructure and concepts

Through the backpressure state defined above and adjusting the corresponding parameters, you can determine whether the status of the currently running job is normal or not, and ensure that the service provided by the job manager is not affected.

Reference link

  • Analysis of basic concepts of Apache Flink

Learning material sharing

The collection is ready12 setsMicroservices, spring boot, and spring cloud core technical data. This is part of the information directory:

  • Spring security authentication and authorization
  • Spring boot project practice (background service architecture and operation and maintenance architecture of small and medium sized Internet companies)
  • Spring boot project (enterprise rights management project)
  • Spring cloud microservice architecture project (distributed transaction solution)
  • Spring cloud + spring boot + docker complete video tutorial
  • Spring cloud website project (real estate sales)
  • Spring cloud microservice project practice (large e-commerce architecture system)
  • Single point landing basic to actual combat
  • Spring boot project actual combat (enterprise wechat ordering system) (primary practice)
  • Spring cloud Internet application project (weather forecast system)
  • Spring source code deep analysis + annotation development full set of video tutorial
  • Spring boot project practice (financial product system)

Screenshot of contents:

Apache Flink infrastructure and concepts

Official account back office replyarch028Access to information:

Apache Flink infrastructure and concepts