Flink learning — case practice of eventtime and watermarks

Time:2021-3-6

The article is reproduced from:https://blog.csdn.net/xu47043…
Author: bigdata 1024
Encroachment and deletion

There are two ways to generate watermarks
1: With periodic watermarks: periodically trigger the generation and sending of watermarks.
2: With tapped watermarks: trigger the generation and sending of watermarks based on certain events.

The first method is commonly used, so here we use the first method for analysis.
Refer to the usage of with periodic watermarks in the official website
Flink learning -- case practice of eventtime and watermarks
The extracttimestamp method in the code extracts the eventtime from the data itself.

The getcurrentwatermar method is to obtain the current water level and use the current maxtimestamp – maxoutoforderness, where the maxoutoforderness represents the maximum allowed data out of order time.

So here we also implement the interface assignerwithperiodicwatermarks.

Flink learning -- case practice of eventtime and watermarks

1. Implementation of watermark related code

1.1 procedure description
Receive data from socket simulation, and then use map for processing, and then call the method of assigntimestampsandwatermarks to extract timestamps and generate watermarks. Finally, the window printing information is called to verify the time when the window is triggered.

The complete code is as follows:

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.watermark.Watermark;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

import javax.annotation.Nullable;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Iterator;
import java.util.List;


/**
 *
 *Watermark case
 *
 * Created by xuwei.tech.
 */
public class StreamingWindowWatermark1 {

    public static void main(String[] args) throws Exception {
        //Defines the port number of the socket
        int port = 9000;
        //Get the running environment
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //Set the use of eventtime, the default is to use processtime
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

        //Set the parallelism to 1. The default parallelism is the number of CPUs in the current machine
        env.setParallelism(1);

        //Connect socket to get the input data
        DataStream<String> text = env.socketTextStream("192.168.59.133", port, "\n");

        //Parsing the input data
        DataStream<Tuple2<String, Long>> inputMap = text.map(new MapFunction<String, Tuple2<String, Long>>() {
            @Override
            public Tuple2<String, Long> map(String value) throws Exception {
                String[] arr = value.split(",");
                return new Tuple2<>(arr[0], Long.parseLong(arr[1]));
            }
        });

        //Extracting timestamp and generating watermark
        DataStream<Tuple2<String, Long>> waterMarkStream = inputMap.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks<Tuple2<String, Long>>() {

            Long currentMaxTimestamp = 0L;
            Final long maxoutoforderness = 10000l; // the maximum allowed out of order time is 10s

            SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
            /**
             *Define the logic for generating watermarks
             *By default, 100ms is called once
             */
            @Nullable
            @Override
            public Watermark getCurrentWatermark() {
                return new Watermark(currentMaxTimestamp - maxOutOfOrderness);
            }

            //Defines how to extract a timestamp
            @Override
            public long extractTimestamp(Tuple2<String, Long> element, long previousElementTimestamp) {
                long timestamp = element.f1;
                currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp);
                System.out.println("key:"+element.f0+",eventtime:["+element.f1+"|"+sdf.format(element.f1)+"],currentMaxTimestamp:["+currentMaxTimestamp+"|"+
                        sdf.format(currentMaxTimestamp)+"],watermark:["+getCurrentWatermark().getTimestamp()+"|"+sdf.format(getCurrentWatermark().getTimestamp())+"]");
                return timestamp;
            }
        });

        //Group, aggregate
        DataStream<String> window = waterMarkStream.keyBy(0)
                .window(Tumb lingEventTimeWindows.of ( Time.seconds (3) ) // the window is allocated according to the eventtime of the message, which is the same as calling timewindow
                .apply(new WindowFunction<Tuple2<String, Long>, String, Tuple, TimeWindow>() {
                    /**
                     *Sort the data in the window to ensure the order of the data
                     * @param tuple
                     * @param window
                     * @param input
                     * @param out
                     * @throws Exception
                     */
                    @Override
                    public void apply(Tuple tuple, TimeWindow window, Iterable<Tuple2<String, Long>> input, Collector<String> out) throws Exception {
                        String key = tuple.toString();
                        List<Long> arrarList = new ArrayList<Long>();
                        Iterator<Tuple2<String, Long>> it = input.iterator();
                        while (it.hasNext()) {
                            Tuple2<String, Long> next = it.next();
                            arrarList.add(next.f1);
                        }
                        Collections.sort(arrarList);
                        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
                        String result = key + "," + arrarList.size() + "," + sdf.format(arrarList.get(0)) + "," + sdf.format(arrarList.get(arrarList.size() - 1))
                                + "," + sdf.format(window.getStart()) + "," + sdf.format(window.getEnd());
                        out.collect(result);
                    }
                });
        //Test - print the results to the console
        window.print();

        //Note: because Flink is lazy to load, the execute method must be called before the above code can be executed
        env.execute("eventtime-watermark");

    }



}

1.2 detailed explanation of procedure

  1. Receive socket data.
  2. Each line of data is separated by commas, and each line of data is converted to tuple < string, long > type by calling map. The first element in the tuple represents the specific data, and the second line represents the event time of the data.
  3. Extract the timestamp, generate the watermark, the maximum allowed disorder time is 10s, and print (key, event time, current max, timestamp, watermark) and other information
  4. Group aggregation, window size is 3 seconds, output (key, number of elements in the window, time of the earliest element in the window, time of the latest element in the window, window start time, window end time).
2. Tracking the time of watermark through data

Here we focus on the time of watermark and timestamp, and determine the trigger time of window through the output of data.
First, we turn on the socker and input the first data

0001,1538359882000

The output is as follows:
Flink learning -- case practice of eventtime and watermarks
For the convenience of viewing, we summarize the input content into the table
Flink learning -- case practice of eventtime and watermarks
At this point, the time of wartermark is 10 seconds behind the current maxtimestamp. Let’s continue typing

0001,1538359886000

Flink learning -- case practice of eventtime and watermarks

At this time, the input content is as follows:
Flink learning -- case practice of eventtime and watermarks
We summarize again as follows:
Flink learning -- case practice of eventtime and watermarks
Continue to enter:

0001,1538359892000

Flink learning -- case practice of eventtime and watermarks
The output is as follows:
Flink learning -- case practice of eventtime and watermarks
The summary is as follows:
Flink learning -- case practice of eventtime and watermarks
At this point, the window is still not triggered, and the time of watermark is equal to the event time of the first data. So when is window triggered?
We re-enter:

0001,1538359893000

Flink learning -- case practice of eventtime and watermarks

The output is as follows:
Flink learning -- case practice of eventtime and watermarks

The summary is as follows:
Flink learning -- case practice of eventtime and watermarks

The window is still not triggered. At this time, our data has been sent to 10:11:33.000 on October 1, 2018. According to the event time, the earliest data has passed 11 seconds, and the window has not started to calculate. When will the window be triggered?

Let’s add another second and enter:

0001,1538359894000

Flink learning -- case practice of eventtime and watermarks
Output:
Flink learning -- case practice of eventtime and watermarks
The summary is as follows:
Flink learning -- case practice of eventtime and watermarks
Here, let’s make a statement:
The trigger mechanism of the window is to divide the window according to the natural time. If the size of the window is 3 seconds, the window will be divided into the following forms [left closed and right open] within 1 minute:

[00:00:00,00:00:03)
[00:00:03,00:00:06)
[00:00:06,00:00:09)
[00:00:09,00:00:12)
[00:00:12,00:00:15)
[00:00:15,00:00:18)
[00:00:18,00:00:21)
[00:00:21,00:00:24)
[00:00:24,00:00:27)
[00:00:27,00:00:30)
[00:00:30,00:00:33)
[00:00:33,00:00:36)
[00:00:36,00:00:39)
[00:00:39,00:00:42)
[00:00:42,00:00:45)
[00:00:45,00:00:48)
[00:00:48,00:00:51)
[00:00:51,00:00:54)
[00:00:54,00:00:57)
[00:00:57,00:01:00)

The setting of window has nothing to do with the data itself, but is defined by the system.
In the input data, the data is divided into different windows according to its own event time. If there is data in the window, when the watermark time > = event time, it meets the conditions of window trigger. Finally, it decides whether the window trigger or the window in the window to which the event time of the data belongs_ end_ It’s up to you.
In the above test, after the last data arrives, the water level has risen to 10:11:24 seconds, which is exactly the window of the window where the earliest record is located_ end_ So window is triggered.

In order to verify the trigger mechanism of window, we continue to input data:

0001,1538359896000

Flink learning -- case practice of eventtime and watermarks

Output:
Flink learning -- case practice of eventtime and watermarks

The summary is as follows:
Flink learning -- case practice of eventtime and watermarks
At this time, although the watermark time has reached the time of the second data, the window has not been triggered because it has not reached the end time of the window where the second data is located. Then, the window time of the second data is:

[00:00:24,00:00:27)

That is to say, we must input a data of 10:11:27 seconds before the second data window will be triggered. We continue to enter:

Flink learning -- case practice of eventtime and watermarks

Output:
Flink learning -- case practice of eventtime and watermarks

Flink learning -- case practice of eventtime and watermarks

At this point, we have seen that the trigger of window should meet the following conditions:

**1. Watermark time > = window_ end_ time
2. In the [window]_ start_ time,window_ end_ There is data in the interval. Note that the interval is left closed and right open**

If the above two conditions are met at the same time, the window will trigger.

3. Processing out of order data with watermark + window
In the above test, the data are all incremented in chronological order. Now, let’s input some late data to see how watermark combined with window mechanism can deal with the disorder.
Enter two lines of data

0001,1538359899000
0001,1538359891000

Output:
Flink learning -- case practice of eventtime and watermarks

The summary is as follows:
Flink learning -- case practice of eventtime and watermarks

As you can see, although we input a data of 10:11:31, the current max timestamp and watermark remain unchanged. At this point, according to the formula we mentioned above:
1. Watermark time > = window_ end_ time
2. In the [window]_ start_ time,window_ end_ There is data in time

Watermark time (10:11:29) < window_ end_ Time (10:11:33), so window cannot be triggered.

If we input a data of 10:11:43 again, the watermark time will rise to 10:11:33, and the window will trigger. Let’s try:
Input:

0001,1538359903000

Flink learning -- case practice of eventtime and watermarks
Output:

Flink learning -- case practice of eventtime and watermarks
The summary is as follows:

Flink learning -- case practice of eventtime and watermarks
Flink learning -- case practice of eventtime and watermarks

Here, we can see that there are two data in the window, 10:11:31 and 10:11:32, but there is no data of 10:11:33. The reason is that the window is an interval between front closing and back opening, and the data of 10:11:33 belongs to the window of [10:11:33,10:11:36].

How should Flink set the maximum out of order time?
This should be set according to your own business and data situation. If the setting of maxoutoforderness is too small, and the data is out of order or late too much due to network and other reasons, the final result is that a lot of single data will be triggered in the window, and the correctness of the data will be greatly affected.
For seriously disordered data, we need to strictly count the maximum delay time of the data to ensure the accuracy of the calculated data. If the delay setting is too small, it will affect the accuracy of the data. If the delay setting is too large, it will not only affect the real-time performance of the data, but also increase the burden of the Flink job. It is not the data that has strict requirements on the eventtime. Try not to use the eventtime method to process the data, which will cause loss The risk of data.

The above results have shown that for out of order data, Flink can process out of order data in a certain range through watermark mechanism combined with window operation. So how does Flink deal with too much “late element” data?

4: Processing of late element
Three processing schemes of delayed data

4.1: discard (default)
Input:

0001,1538359890000
0001,1538359903000

Flink learning -- case practice of eventtime and watermarks

Output:
Flink learning -- case practice of eventtime and watermarks
The summary is as follows:
Flink learning -- case practice of eventtime and watermarks
Note: the watermark is 10:11:33.000 on October 1, 2018
Next, let’s enter a few more times when event time is less than watermark
Input: [three lines input]

0001,1538359890000
0001,1538359891000
0001,1538359892000

Output:
Flink learning -- case practice of eventtime and watermarks
Note: window is not triggered at this time. Because the window where the input data is located has been executed, Flink’s default processing scheme for these late data is to discard them.

4.2: allowed latency specifies the time allowed for data latency

In some cases, we want to provide a more tolerant time for late data.
Flink provides the allowedlatency method, which can set a delay time for the late data, and the data arriving within the specified delay time can trigger the execution of window.
Modification code:

Flink learning -- case practice of eventtime and watermarks
Now let’s verify:

Input: [input two lines]

0001,1538359890000
0001,1538359903000

Output:
Flink learning -- case practice of eventtime and watermarks
Normal trigger window, no problem.
Summary:
Flink learning -- case practice of eventtime and watermarks
At this time, watermark is 2018-10-01 10:11:33.000
Now let’s input some data of eventtime < watermark to verify the effect

Input: [input three lines]

0001,1538359890000
0001,1538359891000
0001,1538359892000

Flink learning -- case practice of eventtime and watermarks

Output:
Flink learning -- case practice of eventtime and watermarks

Here you can see that each data triggers the window execution.

Summary:
Flink learning -- case practice of eventtime and watermarks
Let’s enter another data and adjust the water to 10:11:34

Input:
Flink learning -- case practice of eventtime and watermarks
Output:
Flink learning -- case practice of eventtime and watermarks

The summary is as follows:
Flink learning -- case practice of eventtime and watermarks
At this point, the water is increased to 10:11:34, and we input several data of event time < watermark to verify the effect
Input:

0001,1538359890000
0001,1538359891000
0001,1538359892000

Flink learning -- case practice of eventtime and watermarks

Output:

Flink learning -- case practice of eventtime and watermarks

It is found that the three lines of input data trigger the execution of window.

Let’s enter another data and adjust the water to 10:11:35

Input:

0001,1538359905000

Output:
Flink learning -- case practice of eventtime and watermarks
At this point, watermark rises to 10:11:35
Let’s input some data of eventtime < watermark to verify the effect

Input:

0001,1538359890000
0001,1538359891000
0001,1538359892000

Flink learning -- case practice of eventtime and watermarks

Output:
Flink learning -- case practice of eventtime and watermarks

It is found that none of these data triggers window.

analysis:
When watemark equals 10:11:33, it’s window_ end_ So the window execution of [10:11:30 ~ 10:11:33] will be triggered.
When the window is executed, we input the data in [10:11:30 ~ 10:11:33) window, and we will find that the window can be triggered.
When the watemark is raised to 10:11:34, we input the data in the [10:11:30 ~ 10:11:33) window, and we will find that the window can also be triggered.
When the watemark is raised to 10:11:35, we can enter the data in the [10:11:30 ~ 10:11:33) window and find that the window will not be triggered.
Because we set the allowedlatency in the front( Time.seconds (2) ), which allows the data delayed within 2 s to continue to trigger the window execution.
So when watermark is 10:11:34, you can trigger window, but not when watermark is 10:11:35.

Conclusion:
For this window, 2 seconds late data is allowed, that is, the first trigger is in watermark > = window_ end_ Time hour
The second (or multiple) trigger condition is watermark < window_ end_ Within the time + allowedlatency time, this window has the arrival time of late data.

Explanation:
When watermark is equal to 10:11:34, when we input data with event time of 10:11:30, 10:11:31, 10:11:32, it can be triggered, because the window of these data_ end_ Time is 10:11:33, that is, 10:11:34 < 10:11:33 + 2 is true.
But when watermark is equal to 10:11:35, when we input data with event time of 10:11:30, 10:11:31, 10:11:32, the window of these data_ end_ At this time, 10:11:35 < 10:11:33 + 2 is false. So eventually, the data is too late to trigger window execution.

4.3: sideoutputlatedata collect late data

Through sideoutputlatedata, the late data can be collected and stored in a unified way, which is convenient for later troubleshooting.
You need to adjust the code first:

Flink learning -- case practice of eventtime and watermarks

Let’s input some data to verify
Input:

0001,1538359890000
0001,1538359903000

Flink learning -- case practice of eventtime and watermarks
Output:
Flink learning -- case practice of eventtime and watermarks

At this time, the window is triggered to execute, and the watermark is 10:11:33
Next, let’s test some data whose event time is less than watermark
Input:

0001,1538359890000
0001,1538359891000
0001,1538359892000

Flink learning -- case practice of eventtime and watermarks

Output:

Flink learning -- case practice of eventtime and watermarks

At this point, these late data are saved to the outputtag through sideoutputlatedata.