Flink’s time and watermarks

Time:2021-1-18

Flink's time and watermarks

When we use Flink, we can’t avoid dealing with time and watermarks. Understanding these concepts is the basis of developing distributed flow processing applications. What temporal semantics does Flink support? How does Flink handle out of order events? What is the water line? How are watermarks generated? What is the transmission mode of water level? Let’s start with these questions.

<!– more –>

Temporal semantics

Basic concepts

Time is one of the most important concepts in stream processing such as Flink. In Flink, time can be divided into three types: event time, processing time and induction time, as shown in the figure below

Flink's time and watermarks

  • Event Time

Event timeThe time of the event itself, that is, the actual time of the event in the data stream, is usually described by the time stamp of the event. The time stamp of these events usually exists before they enter the stream processing application, and the event time reflects the real time of the event. Therefore, the result of the calculation operation based on the event time is deterministic. No matter how fast the data flow is processed and whether the order of the event arrival operator is disordered, the final result is the same.

  • Ingestion Time

Intake timeThe time when an event enters the Flink, that is, the processing time of each event in the data source operator is used as the time stamp of the event time, and watermarks are automatically generated.

Conceptually, induction time is between event time and processing time. Compared with processing time, it consumes more performance, but the results are more predictable. Because induction time uses a stable timestamp (assigned once at the data source), different window operations on records will refer to the same timestamp. In processing time, each window operator can assign records to different windows.

Compared with event time, induction time can’t handle any out of order events or late data, that is, it can’t provide definite results, but the program doesn’t have to specify how to generate the water level. Internally, induction time is very similar to event time, but it can automatically assign time stamps and generate water marks.

  • Processing Time

processing time According to the system clock of the processing machine, the current time of the data stream is determined, that is, the time of the current system when the event is processed. Taking the window operator as an example (about window, we will analyze it in detail below), the window operation based on processing time is triggered by machine time. Due to the different rate of data arriving at the window, the use of processing time in the window operator will lead to uncertain results. When the processing time is used, there is no need to wait for the arrival of the waterline to trigger the window, so it can provide a lower delay.

contrast

After the above analysis, we should have a general understanding of the temporal semantics of Flink. I don’t know if you have such a question: since event time can solve all problems, why use processing time? In fact, the processing time has its own specific use scenarios. Because the processing time does not consider the delay and disorder of events, the delay of processing data is low. Therefore, if some applications pay more attention to processing speed than accuracy, then processing time can be used, such as real-time monitoring dashboard. In a word, although the delay of processing time is low, the result is uncertain. Although the event time has delay, it can ensure the accuracy of processing results, and can process delayed or even disordered data.

use

The last summary describes the basic concepts of three kinds of time semantics, and then explains how to configure these three kinds of time semantics in the program from the code level. Let’s start with a piece of code

/** The time characteristic that is used if none other is set. */
 private static final TimeCharacteristic DEFAULT_TIME_CHARACTERISTIC = TimeCharacteristic.ProcessingTime;
//Omitted code
/** The time characteristic used by the data streams. */
 private TimeCharacteristic timeCharacteristic = DEFAULT_TIME_CHARACTERISTIC;

The above two lines of code are extracted from the streamexecutionenvironment class. It can be seen that the default time semantics of Flink in the stream processor is processing time. How to modify the default time semantics? Very simple, let’s take another look at the code. The following code fragment is also from the streamexecutionenvironment class:

/**
 *If processing time or event time is used, the default interval of waterline is 200 ms
 *It can be set through executionconfig? Setautowatermarkinterval (long)
 * @param characteristic The time characteristic.
 */
 @PublicEvolving
 public void setStreamTimeCharacteristic(TimeCharacteristic characteristic) {
 this.timeCharacteristic = Preconditions.checkNotNull(characteristic);
 if (characteristic == TimeCharacteristic.ProcessingTime) {
 getConfig().setAutoWatermarkInterval(0);
 } else {
 getConfig().setAutoWatermarkInterval(200);
 }
 }

The above methods can be configured with different time semantics. The parameter timecharacteristic is an enumeration class, including three elements: processingtime, digestiontime and eventtime. The specific usage is as follows:

//env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
//env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

watermarks

Before explaining watermarks, let’s look at a real case around us. College entrance examination is a very familiar scene. If the arrangement of the college entrance examination is simply regarded as a stream processing application, then the start time to the end time of each examination subject is a window, each examinee can be understood as a record, the time when the examinee arrives at the examination room can be understood as the time stamp of the record, and the examination can be understood as some kind of operator operation. As we all know, the entrance examination is not allowed to enter 15 minutes after the start of the college entrance examination. This rule can be understood as a water line. For example, the first Chinese examination in the morning starts at 9:30, and it is allowed to enter the examination room before 9:45. Then 9:45 can be understood as a water line. Before the examination, some students like to go to the examination room in advance, and some students like to go to the examination room by card. Suppose there is a classmate namedThe test is sure to win, TA is stuck in the examination room, but in the morning because of eating something unclean, suddenly feel sick, helpless in the toilet for 16 minutes, then according to the regulations, at this time, kaobisheng is not allowed to enter the examination room, because at this time, it has been the default that all candidates have been in the examination room, at this time, the test has also been triggered, so the card Bisheng can be understood as late event. The above is a simple understanding of the window, event time and water level. Let’s begin to explain what water level is.

Basic concepts

In the previous section, we explained in detail three kinds of time semantics provided by Flink. When explaining these three kinds of time semantics, we mentioned a noun — waterline. So what is waterline? Let’s take an example first. If you want to count the topn of the popular products in the past one hour every five minutes, which is a typical sliding window operation, when should the window based on event time start to calculate? In other words, how long do we have to wait to determine that we have received all the events before a specific point in time? On the other hand, due to network delay and other reasons, disordered data will be generated. When performing window operation, we can not wait indefinitely. We need a mechanism to tell the window to trigger window calculation at a specific time, that is, it is considered to be less than or equal to the time The data of all the points have arrived. This mechanism is watermark, which can be used to deal with out of order events.

Waterline is a global progress indicator, which indicates a certain time point when it can be determined that there will be no more delayed events. Essentially, the waterline provides a logical clock to inform the system of the current event time. For example, when an operator receives the waterline at the time of W (T), it can boldly assume that it will not receive any event with a timestamp less than or equal to w (T). Waterline is very important for window based on event time and processing out of order data. Once the operator receives a waterline, it is equivalent to receiving a signal from a cloud piercing arrow: all data in a specific time interval have been assembled, and the window trigger calculation can be carried out.

Now that it has been said that there will be disorder in events, it’s not very sure how serious the disorder is. In short, there will always be some late events coming slowly. Therefore, the water level line is actually a kind of water levelaccuracyAnddelayIf the water level is set very strictly, that is, there is no data left behind. Although the accuracy is improved, it increases the delay of data processing. On the contrary, if the waterline is set very aggressively, that is, late data is allowed to occur, the data processing delay will be reduced, but the accuracy of the data will be lower.

Therefore, the water level is the golden mean, too much is better than too much. In many practical applications, the system can not obtain enough information to determine the perfect water level, so what should be done? Flink provides some mechanisms to deal with the lateness that may be later than the waterline. According to the needs of the application, users can discard these lateness, or write them to the log, or use them to correct the previous results.

It said that there is no perfect water line, which may be very abstract. Next, let’s look at another picture. From the picture, we can intuitively observe the relationship between the real water level and the ideal perfect water level, as shown in the following figure:

Flink's time and watermarks

The light gray straight dotted line in the figure above represents the ideal water line, the dark gray curved dotted line represents the real water line, and the black straight line represents the deviation between the two. In the ideal state, this deviation is 0, because it is always handled immediately when time occurs, that is, the real time of the event is consistent with the time of handling the event. For example, the event generated at 12:01 is processed at 12:01, and the event generated at 12:02 is processed at 12:02. But in reality, there will always be late data generation, such as network delay, so the real situation will be like the dark gray curved dotted line, that is, the data generated at 12:01 may be processed after 12:01, the data generated at 12:02 may be processed after 12:02, and the data generated at 12:03 may be processed after 12:03. This kind of dynamic deviation is very common in distributed processing system.

Water line diagram

In the previous section, the concept of water level line is explained in detail through language description. In this section, the meaning of water level line will be analyzed through diagram, which can deepen the understanding of water level line. As shown in the figure below:

Flink's time and watermarks

As shown in the figure above, the rectangle represents a record, the triangle represents the time stamp (real occurrence time) of the record, and the circle represents the waterline. It can be seen that the above data is out of order. For example, when the operator receives a waterline of 2, it can be considered that the data with a timestamp less than or equal to 2 has arrived, and the calculation can be triggered at this time. Similarly, when a waterline of 5 is received, it can be considered that the data whose timestamp is less than or equal to 5 has arrived, and the calculation can be triggered at this time.

It can be seen that the waterline is monotonically increasing, and it is related to the time stamp of the record. A waterline with a time stamp of t means that the time stamp of all the next records will be greater than t.

Propagation of water level

Now, maybe you have a preliminary understanding of what a waterline is. Next, we will introduce how the waterline spreads inside Flink. The propagation strategy of water level can be summarized into three points

  • Firstly, the waterline propagates between operators in the form of broadcast
  • Long.MAX_ Value indicates the end of the event time, that is, there will be no data coming in the future
  • The input of a single partition is the maximum, and the input of multiple partitions is the minimum

about Long.MAX_ The explanation of value is as follows:

/** 
 *When a source is closed, a Long.MAX_ When an operator receives the waterline of value,
 *It is equivalent to receiving a signal that there will be no more data input in the future
 */
@PublicEvolving
public final class Watermark extends StreamElement {
​
 //Indicates the end of the event time
 public static final Watermark MAX_WATERMARK = new Watermark(Long.MAX_VALUE);
 //Omitted code
}

An explanation of the other two strategies can be obtained from the following figure:

Flink's time and watermarks

As shown in the figure above, a task maintains a partition waterline for each partition When receiving the waterline from each partition, the task will first compare the current partition waterline value with the received waterline value. If the newly received waterline value is greater than the current partition waterline value, it will update the corresponding partition waterline value to a larger waterline value (as shown in step 2 in the figure above). Then, the task will adjust the event clock to the current partition As shown in step 2 of the figure above, the minimum value of the current partition water level is 3, so the event time clock is updated to 3, and then the water level with 3 is broadcast to the downstream task. The processing logic of step 3 and step 4 is the same as above.

At the same time, we can notice that this design actually has a limitation, which is embodied in the fact that it does not distinguish whether the partition comes from different flows. For example, for union or connect operations of two or more flows, the event time clock is updated according to the minimum value of all partition watermarks, which leads to all input records based on the same thing This one size fits all approach is justifiable for different partitions of the same stream. However, for multiple streams, forcing one clock to synchronize will bring large performance overhead to the whole cluster. For example, when the water level of two streams is very different, one of them has to wait for the slowest stream, while the faster stream will be recorded Cache events in state until the event time clock reaches the point in time that allows them to be processed.

Generation of water level line

In general, the waterline should be generated immediately after receiving the data source, that is, the closer to the data source, the better. Flink provides two ways to generate watermarks, one of which is completed in the data source, that is, using the sourcefunction to allocate timestamps and watermarks when the application reads the data stream. Another way is to implement the user-defined function of the interface, which includes two ways: one is to generate the water mark periodically, that is, to implement the assignerwithperiodicwatermarks interface, the other is to generate the water mark at a fixed point, that is, to implement the assignerwithpunctuatedwatermarks interface. The details are shown in the figure below:

Flink's time and watermarks

Data source mode

This method is mainly to realize the user-defined data source, and the data source allocation timestamp and waterline are mainly realized through the internal sourcecontext object. First, let’s look at the source code of the sourcefunction, as follows:

public interface SourceFunction<T> extends Function, Serializable {
​
 void cancel();
​
 interface SourceContext<T> {
​
 void collect(T element);
 /**
 *Used to output records and attach a timestamp associated with them
 */
 @PublicEvolving
 void collectWithTimestamp(T element, long timestamp);
 /**
 *Used to output incoming watermarks
 */
 @PublicEvolving
 void emitWatermark(Watermark mark);
 /**
 *Mark yourself idle
 *If a certain partition does not generate data, it will hinder the progress of the global water level,
 *Because no new record is received, it means that no new water level will be issued,
 *According to the propagation strategy of waterline, the whole application will stop working
 *Flink provides a mechanism to temporarily mark data source functions as idle,
 *In the idle state, the propagation mechanism of Flink waterline ignores the idle data flow partition
 */
 @PublicEvolving
 void markAsTemporarilyIdle();
​
 Object getCheckpointLock();
​
 void close();
 }
}

As can be seen from the above code, the allocation of timestamps and watermarks can be realized through the method of sourcecontext object.

How to customize functions

To allocate timestamps by using user-defined functions, you only need to call the assigntimestampsandwatermarks() method and pass in an allocator that implements the assignerwithperiodicwatermarks or assignerwithpunctuatedwatermarks interface, as shown in the following code:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment()
 env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
 SingleOutputStreamOperator<UserBehavior> userBehavior = env
 .addSource(new MysqlSource())
 .assignTimestampsAndWatermarks(new MyTimestampsAndWatermarks());
  • Assigner with periodic watermarks

The allocator implements a user-defined function of assignerwithperiodicwatermarks. It extracts the time stamp by rewriting the extracttimestamp() method. The extracted time stamp will be attached to each record, and the watermarks obtained from the query will be injected into the data stream.

Periodic generation of watermarks refers to sending watermarks at fixed time intervals and advancing the event time. The default time interval is also mentioned above. The default time interval is determined according to the selected time semantics. If using processing time or event The default interval of water level is 200 ms. of course, the user can set the interval by himself. For how to set the interval, let’s look at a piece of code. The code comes from the executionconfig class

/**
 *Sets the time interval for generating watermarks
 *Note: the time interval of automatic generation of watermarks cannot be negative
 */
 @PublicEvolving
 public ExecutionConfig setAutoWatermarkInterval(long interval) {
 Preconditions.checkArgument(interval >= 0, "Auto watermark interval must not be negative.");
 this.autoWatermarkInterval = interval;
 return this;
 }

Therefore, if you want to adjust the default interval of 200 milliseconds, you can call the setautowatermarkinterval() method, as follows:

//Generate watermarks every 3 seconds
env.getConfig().setAutoWatermarkInterval(3000);

It is specified that the water level line will be generated every 3 seconds, that is, a water level line will be automatically injected into the flow every 3 seconds. At the code level, Flink will call the getcurrentwatermark() method of assignerwithperiodicwatermarks every 3 seconds. Each time the method is called, if the obtained value is not empty and greater than the timestamp of the previous water level line, a water level line will be injected into the flow A new water line. This check can effectively ensure the increasing time of the event, once the check fails, the waterline will not be generated. The following is an example of periodic allocation of water level
“java
public class MyTimestampsAndWatermarks implements AssignerWithPeriodicWatermarks<UserBehavior> {
//The tolerance interval of 1 minute is defined, that is, the maximum out of order time of allowed data
private long maxOutofOrderness = 60 * 1000;
//Maximum time stamp observed
private long currentMaxTs = Long.MIN_VALUE;

@Nullable
@Override
public Watermark getCurrentWatermark() {
//Generate watermarks with 1 minute tolerance
return new Watermark(currentMaxTs – maxOutofOrderness);
}

@Override
public long extractTimestamp(UserBehavior element, long previousElementTimestamp) {
//Gets the timestamp of the current record
long currentTs = element.timestamp;
//Update maximum timestamp
currentMaxTs = Math.max(currentMaxTs, currentTs);
//Returns the timestamp of the record
return currentTs;
}
}

By looking at the inheritance relationship of timestampassignerd, we can find that (the inheritance relationship is as shown in the figure below). In addition, Flink also provides two built-in water level allocators: ascending timestampextractor and boundedoutordernesstimestampextractor.

![](https://upload-images.jianshu.io/upload_images/22116987-83e4971769f856e5.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)


As for * * ascending timestamp extractor * *, it is generally used when the time stamp of the data set is monotonically increasing and there is no disorder. This method uses the current time stamp to generate the water mark, and the usage is as follows:

SingleOutputStreamOperator<UserBehavior> userBehavior = env
.addSource(new MysqlSource())
.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<UserBehavior>() {
@Override
public long extractAscendingTimestamp(UserBehavior element) {
return element.timestamp*1000;
}
});

About * * bound out of order timestampextractor * *, it is used when there is out of order data in the dataset, that is, the data has delay (the time difference between any newly arrived element and the element with the largest timestamp that has already arrived). This method can receive a parameter indicating the maximum expected delay, as follows:

SingleOutputStreamOperator<UserBehavior> userBehavior = env
.addSource(new MysqlSource())
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<UserBehavior>(Time.seconds(10)) {
@Override
public long extractTimestamp(UserBehavior element) {
return element.timestamp*1000;
}
} );

The above code receives a 10 second delay parameter, which means that if the difference between the event time of the current element and the maximum timestamp of the arrived element is within 10 seconds, the element will be processed. If the difference is more than 10 seconds, it means that the calculation it should have participated in has been completed. Flink calls it late data, and Flink provides different strategies to deal with it Manage these late data.

** * assignerwithpunctuated watermarks**

This method is based on some events (special ancestor or marker indicating the system progress) to trigger the generation and sending of water level. A water level is injected into the flow based on specific events. Each element in the flow has the opportunity to judge whether to generate a water level. If the obtained water level is not empty and larger than the previous water level, the water level is generated and injected into the flow.

Implement the interface of assignerwithpunctuatedwatermarks, and override the method checkandgetnextwarkmark(), which will be called immediately after the extracttimestamp() method for each event, so as to determine whether to generate a new watermark. If the method returns a watermark that is not empty and larger than the previous value, the new watermark will be issued.

The following will implement a simple fixed-point water level distributor

public class MyPunctuatedAssigner implements AssignerWithPunctuatedWatermarks<UserBehavior> {
//The tolerance interval of 1 minute is defined, that is, the maximum out of order time of allowed data
private long maxOutofOrderness = 60 * 1000;
@Nullable
@Override
public Watermark checkAndGetNextWatermark(UserBehavior element, long extractedTimestamp) {
//If the behavior of the user reading the data is to buy, a waterline is generated
if(element.action.equals(“buy”)){
return new Watermark(extractedTimestamp – maxOutofOrderness);
}else{
//No waterline
return null;
}
}
@Override
public long extractTimestamp(UserBehavior element, long previousElementTimestamp) {
return element.timestamp;
}
}

###Late data

As mentioned above, it is difficult to generate a perfect waterline in reality, which is a trade-off between delay and accuracy. Then, if the generated waterline is too urgent, that is, the waterline may be larger than the time stamp of the later data, which means that the data has delay. Flink provides some mechanisms for processing the delayed data, as follows:

*Discard the late data directly

*Output the late data to a separate data stream, that is, use sideoutputlatedata (New outputtag < > ()) to achieve side output

*Update and issue results based on late events

Due to space limitations, the specific processing of late data is not discussed too much in this paper, which will be explained in detail in the following articles.

##Summary

Starting from the time semantics of Flink, this paper introduces the concept, characteristics and usage of three kinds of time semantics in detail, and then gives a detailed description of a mechanism for Flink to deal with unordered data -- waterline. It mainly describes the basic concept, propagation mode and generation mode of waterline, and illustrates the details, which can deepen the understanding of waterline. Finally, it briefly explains how Flink handles the late data.

* attention to official account: big data technology and multi warehouse
>Free access to 100 g big data**

Recommended Today

Don’t be a tool man. Touching hands teaches you Jenkins!

Hello everyone, I’m a piece of cake, a piece of cake eager to be Cai Bucai in the Internet industry. Soft or hard, praise is soft, white whoring is just!Ghost ~ remember to give me a third company after watching it! This article mainly introducesJenkins If necessary, please refer to If it helps, don’t forgetgive […]