[Mr. Zhao Qiang] Flink’s watermark mechanism (based on Flink 1.11.0)

Time:2020-10-15

[Mr. Zhao Qiang] Flink's watermark mechanism (based on Flink 1.11.0)

How to deal with out of order data when using eventtime? We know that there is a process and time in the process of flow processing from event generation to flow through source and then to operator. Although in most cases, the data flowing to the operator is in accordance with the time sequence of the events, it is not ruled out that the data in multiple partitions can not be orderly due to network delay and other reasons, especially if Kafka is used. So we can’t wait indefinitely when we do window calculation. We must have a mechanism to ensure that after a specific time, we must trigger the window to calculate. This particular mechanism is watermark. Watermark is a mechanism used to handle out of order events and to measure the progress of event time. Water mark can be translated as water mark.

1、 The core principle of watermark

The core essence of watermark can be understood as a delayed trigger mechanism.
In the window processing process of Flink, if all the data are confirmed to arrive, the window calculation operation (such as summary, grouping, etc.) can be performed on all the data in the window. If all the data does not arrive, the processing can be started only after all the data in the window arrive. In this case, we need to use the watermarks mechanism, which can measure the progress of data processing (to express the integrity of data arrival), ensure that the event data (all) arrive at Flink system, or can calculate the correct and continuous results as expected in case of disorder and delayed arrival. When any event enters the Flink system, watermarks timestamp will be generated according to the current maximum event time.

How does Flink calculate the value of watermark?

Watermark = max event time of entering Flink (mxteventtime) – specified delay time (T)

How does a window with watermark trigger the window function?
If the stop time of a window is equal to or less than maxeventtime – t (at that time, warkmark), the window is triggered to execute.

The core processing flow is shown in the figure below.

[Mr. Zhao Qiang] Flink's watermark mechanism (based on Flink 1.11.0)

2、 Three uses of watermark

1. Watermark in an orderly stream

If the event time of a data element is ordered, the watermark timestamp will be generated in order with the event time of the data element. At this time, the change of the water level and the event time remain unchanged (because since the event time is ordered, there is no need to set a delay, then t is 0. So water mark = maxtime-0 = maxtime), which is the ideal water level. When the watermark time is greater than the windows end time, the data calculation for windows will be triggered, and so on, and so on, the next window.This situation is actually a special case of out of order data.

2. Watermark in out of order events

In reality, data elements are often not connected to Flink system according to the order of their generation. However, they are often out of order or late, which requires the use of watermarks. For example, in the figure below, set the delay time t to 2.

3. Watermark in parallel data stream

In the case of multiple parallelism, watermarks will have an alignment mechanism, which will take the smallest watermark in all channels.

3、 Set the core code of watermark

1. First of all, the time semantics of event processing should be set correctly. Generally, event time is used.

sEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);    

2. Secondly, the mechanism of generating watermark is specified, including the time of delaying processing and the field corresponding to eventtime. As follows:

[Mr. Zhao Qiang] Flink's watermark mechanism (based on Flink 1.11.0)

Note: the above code can be used regardless of whether the data is ordered or not. Ordered data is only a special case of disordered data.

4、 Watermark programming case

Test data: mobile phone call data of base station, as follows:

[Mr. Zhao Qiang] Flink's watermark mechanism (based on Flink 1.11.0)

Requirement: according to the base station, record the longest call time every 5 seconds.

  • Stationlog is used to encapsulate base station data
package watermark;

//station1,18688822219,18684812319,10,1595158485855
public class StationLog {
    Private string stationid; // base station ID
    Private string from; // call play
    Private string to; // called party
    Private long duration; // duration of the call
    Private long calltime; // call time of the call
    public StationLog(String stationID, String from, 
                      String to, long duration, 
                      long callTime) {
        this.stationID = stationID;
        this.from = from;
        this.to = to;
        this.duration = duration;
        this.callTime = callTime;
    }
    public String getStationID() {
        return stationID;
    }
    public void setStationID(String stationID) {
        this.stationID = stationID;
    }
    public long getCallTime() {
        return callTime;
    }
    public void setCallTime(long callTime) {
        this.callTime = callTime;
    }
    public String getFrom() {
        return from;
    }
    public void setFrom(String from) {
        this.from = from;
    }

    public String getTo() {
        return to;
    }
    public void setTo(String to) {
        this.to = to;
    }
    public long getDuration() {
        return duration;
    }
    public void setDuration(long duration) {
        this.duration = duration;
    }
}
  • Code implementation: watermarkdemo is used to complete the calculation (Note: in order to facilitate our test, the parallelism of the task is set to 1)
package watermark;

import java.time.Duration;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

//Every five seconds, the call log with the longest call time in the past 10 seconds will be output.
public class WaterMarkDemo {
    public static void main(String[] args) throws Exception {
        //Get the running environment of Flink stream processing
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        env.setParallelism(1);
        //Set the time interval for generating the water mark periodically. When the data flow is large, if each event generates a water mark, it will affect performance.
        env.getConfig (). Setautowatermarkinterval (100); // default 100 ms
        
        //Get the input stream
        DataStreamSource<String> stream = env.socketTextStream("bigdata111", 1234);
        stream.flatMap(new FlatMapFunction<String, StationLog>() {

            public void flatMap(String data, Collector<StationLog> output) throws Exception {
                String[] words = data.split(",");
                //Base station ID from to call duration calltime
                output.collect(new StationLog(words[0], words[1],words[2], Long.parseLong(words[3]), Long.parseLong(words[4])));
            }
        }).filter(new FilterFunction<StationLog>() {
            
            @Override
            public boolean filter(StationLog value) throws Exception {
                return value.getDuration() > 0?true:false;
            }
        }).assignTimestampsAndWatermarks(WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                .withTimestampAssigner(new SerializableTimestampAssigner<StationLog>() {
                    @Override
                    public long extractTimestamp(StationLog element, long recordTimestamp) {
                        return  element.getCallTime (); // specifies the field corresponding to eventtime
                    }
                })
        ).keyBy(new KeySelector<StationLog, String>(){
            @Override
            public String getKey(StationLog value) throws Exception {
                return  value.getStationID (); // group by base station
            }}
        ).timeWindow( Time.seconds (5) ) // set time window
        .reduce(new MyReduceFunction(),new MyProcessWindows()).print();

        env.execute();
    }
}
//It is used to process the data in the window, that is, to find the record with the longest conversation time in the window.
class MyReduceFunction implements ReduceFunction<StationLog> {
    @Override
    public StationLog reduce(StationLog value1, StationLog value2) throws Exception {
        //Find the longest call record
        return value1.getDuration() >= value2.getDuration() ? value1 : value2;
    }
}
//What is the output after window processing
class MyProcessWindows extends ProcessWindowFunction<StationLog, String, String, TimeWindow> {
    @Override
    public void process(String key, ProcessWindowFunction<StationLog, String, String, TimeWindow>.Context context,
            Iterable<StationLog> elements, Collector<String> out) throws Exception {
        StationLog maxLog = elements.iterator().next();

        StringBuffer sb = new StringBuffer();
        sb.append ("window range is"). Append( context.window ().getStart()).append("----").append( context.window ().getEnd()).append("\n");;
        sb.append ("base station ID:"). Append( maxLog.getStationID ()).append("\t")
          . append ("call time"). Append( maxLog.getCallTime ()).append("\t")
          . append ("calling number"). Append( maxLog.getFrom ()).append("\t")
          . append ("called number"). Append( maxLog.getTo ()).append("\t")
          . append ("call duration"). Append( maxLog.getDuration ()).append("\n");
        out.collect(sb.toString());
    }
}

[Mr. Zhao Qiang] Flink's watermark mechanism (based on Flink 1.11.0)