Catalogue of articles:
- Exactly-Once semantics in Apache Flink applications
- Exactly-Once Semantics of Flink Application End-to-End
- The sample Flink application starts the pre-commit phase
- Implementation of two-stage submission of Operator in Flink
Apache Flink has introduced an important milestone feature for stream computing since the release of version 1.4.0 in December 2017: TwoPhase CommitSink Function (related Jira). It extracts the general logic of the two-stage commit protocol, which makes it possible to build an end-to-end Exactly-One program through Flink. It also supports some data sources and sinks, including Apache Kafka 0.11 and later. It provides an abstraction layer that allows users to implement end-to-end Exactly-Once semantics with only a few methods.
For more information on the use of TwoPhaseCommitSinkFunction, see the documentation: TwoPhaseCommitSinkFunction. Or you can read Kafka 0.11 sink directly: kafka.
Next, the new function and Flink’s implementation logic will be analyzed in detail, which can be divided into the following points.
- Describes how Flink checkpoint mechanism ensures Exactly-One of Flink program results
- Shows how Flink interacts with data sources and data output through a two-phase commit protocol to provide end-to-end Exactly-Once guarantees
- Through a simple example, learn how to use TwoPhaseCommitSinkFunction to achieve Exactly-Once file output
Exactly-Once semantics in Apache Flink applications
When we say “Exactly-Once”, we mean that each input event affects the final result only once. Even if the machine or software fails, there is neither duplicate data nor lost data.
Flink provided Exactly-Once semantics long ago. Over the past few years, we have described Flink’s checkpoint mechanism in depth, which is the core of Flink’s ability to provide Exactly-Once semantics. The Flink document also provides a comprehensive overview of this functionality.
Before continuing, let’s look at a brief introduction to the checkpoint mechanism, which is crucial to understanding the following topics.
- The next checkpoint is a consistent snapshot of the following:
- Current state of the application
- Position of input stream
Flink can configure a fixed point of time, generate checkpoints periodically, and write checkpoint data to persistent storage systems, such as S3 or HDFS. Writing checkpoint data to persistent storage occurs asynchronously, which means that Flink applications can continue processing data during checkpoint.
If a machine or software failure occurs, the Flink application will be restored from the latest checkpoint after restarting; Flink will restore the application state, roll back the input stream to the location saved by the last checkpoint, and then start running again. This means that Flink can calculate as if it had never had a fault.
Before Flink 1.4.0, Exactly-Once semantics were limited to the Flink application and did not extend to most external systems sent after Flink data processing. Flink applications interact with various data output terminals, and developers need to be able to maintain the context of components themselves to ensure Exactly-Once semantics.
In order to provide end-to-end Exactly-Once semantics – that is, in addition to Flink application internals, Flink writes external systems that also need to satisfy Exactly-Once semantics – these external systems must provide methods of submission or rollback, and then coordinate through Flink’s checkpoint mechanism.
In distributed systems, two-stage commit protocol is a common method to coordinate commit and rollback. In the next section, we will discuss how Flink’s TwoPhase CommitSink Function leverages the two-phase commit protocol to provide end-to-end Exactly-Once semantics.
2. Exactly-Once Semantics of Flink Application End-to-End
We will introduce the two-stage commit protocol and how it implements end-to-end Exactly-Once semantics in a Flink program that reads and writes Kafka. Kafka is a popular message middleware, often used with Flink. Kafka added transaction support in its latest version 0.11. This means that Kafaka is now read and written through Flink, and there is the necessary support for providing end-to-end Exactly-Once semantics.
Flink’s support for end-to-end Exactly-Once semantics is not limited to Kafka, you can use it with any source/output that provides the necessary coordination mechanism. For example, Pravega, an open source streaming media storage system from DELL/EMC, supports end-to-end Exactly-Once semantics through Flink’s TwoPhase CommitSink Function.
In the sample program discussed today, we have:
- Data source read from Kafka (Flink built-in Kafka Consumer)
- Window aggregation
- Write the data back to Kafka’s data output (Flink’s built-in Kafka Producer)
To provide Exactly-Once assurance for data output, it must submit all data to Kafka through a transaction. The submission bundles all the data to be written between the two checkpoints. This ensures that the written data can be rolled back in case of a failure. However, in distributed systems, there are usually multiple write tasks running concurrently. Simple submission or rollback is not enough, because all components must be “consistent” when submitting or rollback to ensure consistent results. Flink uses two-stage commit protocol and pre-commit phase to solve this problem.
At the beginning of checkpoint, the “pre-commit” phase of the two-stage commit protocol. When checkpoint starts, Flink’s JobManager injects the checkpoint barrier (dividing the records in the data stream into the current checkpoint and the next checkpoint) into the data stream.
Brarrier is passed between operators. For each operator, it triggers the status snapshot of the operator to be written to the state backend.
The data source saves the offset of consumption Kafka, and then passes the checkpoint barrier to the next operator.
This approach applies only to the “internal” state of the operator. The so-called internal state refers to the sum value saved and managed by Flink state backend – for example, calculated by window aggregation in the second operator. When a process has its internal state, it does not need to perform any other operation in the pre-commit phase except that data changes need to be written to the state backend before checkpoint. Flink is responsible for correctly submitting these writes when checkpoint succeeds, or aborting them in case of failure.
3. Sample Flink application start pre-commit phase
However, when the process has an “external” state, some additional processing is required. External states are usually written to external systems such as Kafka. In this case, in order to provide Exactly-Once guarantees, the external system must support transactions in order to integrate with the two-phase commit protocol.
The data in this example needs to be written to Kafka, so the data Sink has an external state. In this case, in the pre-commit phase, in addition to writing its status to the state backend, the data output terminal must also pre-commit its external transactions.
The pre-commit phase ends when the checkpoint barrier is passed through all operators and the checkpoint callback triggered successfully completes. All triggered status snapshots are considered part of the checkpoint. Checkpoint is a snapshot of the entire application state, including the pre-submitted external state. If something goes wrong, we can roll back to the time when the snapshot was successfully completed last time.
The next step is to notify all operators that checkpoint has succeeded. This is the commit phase of the two-phase commit protocol, and JobManager issues callbacks for each operator in the application that checkpoint has completed.
Data sources and widnow operators have no external state, so they do not need to perform any operations during the submission phase. However, the Data Sink has an external state and should commit an external transaction at this time.
We summarize the above knowledge points:
- Once all operators complete the pre-commit, a commit is submitted.
- If at least one pre-commit fails, all other commits will be aborted and we will roll back to the last successful checkpoint.
- After successful pre-submission, commit submissions need to be guaranteed final success – both the operator and external systems need to be guaranteed that. If commit fails (for example, due to intermittent network problems), the entire Flink application will fail, the application will restart according to the user’s restart strategy, and will attempt to resubmit. This process is critical because if commit fails in the end, it will result in data loss.
Therefore, we can confirm that all operators agree with the final result of checkpoint: all operators agree that the data has been submitted, or that the submission has been aborted and rolled back.
Implementing two-stage submission of Operator in Flink
Complete implementation of the two-phase commit protocol may be a bit complicated, which is why Flink extracts its general logic into the abstract class TwoPhaseCommitSinkFunction.
Next, based on a simple example of output to a file, we show how to use TwoPhaseCommitSinkFunction. Users only need to implement four functions to implement Exactly-Once semantics for data output:
- BeginTransaction – Before the transaction starts, we create a temporary file in the temporary directory of the target file system. Later, we can write the data to this file when processing the data.
- PreCommit – In the pre-commit phase, we refresh the file to storage, close the file, and no longer rewrite it. We will also start a new transaction for any subsequent files that belong to the next checkpoint.
- Commit – In the commit phase, we move pre-commit files atomically to the real target directory. It should be noted that this increases the latency of output data visibility.
- Abort – In the abort phase, we delete the temporary file.
We know that if anything goes wrong, Flink will restore the state of the application to the latest checkpoint point. In an extreme case, the pre-submission was successful, but a failure occurred before the commit notification reached the operator. In this case, Flink restores the state of the operator to that which has been pre-submitted but has not yet been actually submitted.
We need to save enough information in the checkpoint state in the pre-commit phase to terminate or commit transactions correctly after restart. In this example, the information is the path of the temporary file and the target directory.
TwoPhase CommitSink Function has taken this into account and will issue a commit first when it recovers from the checkpoint point. We need to implement submissions idempotently, which is generally not difficult. In this example, we can recognize that temporary files are not in the temporary directory, but have moved to the target directory.
In TwoPhase CommitSink Function, there are other boundary conditions that will also be taken into account. Refer to the Flink documentation for more information.
Summarize the main points of this paper:
- Flink’s checkpoint mechanism is the basis for supporting two-stage commit protocols and providing end-to-end Exactly-One semantics.
- The advantage of this scheme is that Flink does not transfer and store data over the network as some other systems do – it does not need to write each phase of the calculation to disk as most Batch Processors do.
- Flink’s TwoPhase CommitSink Function extracts the general logic of the two-stage commit protocol. Based on this, it is possible to build an end-to-end Exactly-One by combining Flink with the external system supporting transactions.
- Since Flink 1.4.0, both Pravega and Kafka 0.11 producer have provided Exactly-Once semantics; Kafka introduced transactions for the first time in version 0.11, providing the possibility of using Kafka producer to provide Exactly-Once semantics in Flink programs.
- The transaction of Kafaka 0.11 producer is implemented on the basis of TwoPhase CommitSink Function, which increases very low overhead compared with at-least-once producer.
This is an exciting feature and we expect Flink TwoPhase CommitSink Function to support more data receivers in the future.
Author: Piotr Nowojski
Translation | Zhou Kaibo
Zhou Kaibo, Alibaba Technologist and Master of Sichuan University, joined Alibaba Search Division after graduation in 2010, engaged in the research and development of search offline platform, and participated in the reconstruction of search background data processing architecture from MapReduce to Flink. At present, Ali Computing Platform Division focuses on the construction of Flink-based one-stop computing platform.
Read the original text
This article is the original content of Yunqi Community, which can not be reproduced without permission.