Big data development Flink window full analysis

Time:2022-1-5

Flink window background

Flink believes that batch is a special case of streaming, so the underlying engine of Flink is a streaming engine, which implements stream processing and batch processing. Window is the bridge from streaming to batch. Generally speaking, window is a mechanism used to set a finite set of infinite streams to operate on bounded data sets. The set on the stream is delimited by window, such as “calculate the last 10 minutes” or “sum of the last 50 elements”. The window can be driven by time window (e.g. every 30s) or data (count window) (e.g. every 100 elements). The datastream API provides windows for time and count.

The general skeleton structure of a Flink window application is as follows:

// Keyed Window
stream
  .keyBy(...)

There are two necessary operations in the skeleton structure of Flink window:

  • Use the window assignor to assign the elements in the data flow to the corresponding window.
  • When the window trigger conditions are met, the data in the window is processed using the window function. The commonly used window functions arereduceaggregateprocess

scroll window

Time driven

The data is segmented according to the fixed window length. There is no overlap between the windows under the rolling window, and the window length is fixed. We can useTumblingEventTimeWindowsandTumblingProcessingTimeWindowsCreate a scrolling time window based on event time or processing time. The length of the window can beorg.apache.flink.streaming.api.windowing.time.TimeMediumsecondsminuteshoursanddaysTo set.

//Key handling cases
KeyedStream, Tuple> keyedStream = mapStream.keyBy(0);
//Based on time driven, a window is divided every 10s
WindowedStream, Tuple, TimeWindow> timeWindow =
keyedStream.timeWindow(Time.seconds(10));
//Based on event driven, every three events (i.e. data of three same keys) are separated, and a window is divided for calculation
// WindowedStream, Tuple, GlobalWindow> countWindow =
keyedStream.countWindow(3);
//Apply is the application function of the window, that is, the function in apply will be applied to the data of this window.
timeWindow.apply(new MyTimeWindowFunction()).print();
// countWindow.apply(new MyCountWindowFunction()).print();

Event driven

When we want to drive the purchase behavior of every 100 users, we will calculate the window every time 100 “same” elements are filled in the window. It is easy to understand. The following is an implementation case

public class MyCountWindowFunction implements WindowFunction,
  String, Tuple, GlobalWindow> {
    @Override
    public void apply(Tuple tuple, GlobalWindow window, Iterable>
      input, Collector out) throws Exception {
      SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
      int sum = 0;
      for (Tuple2 tuple2 : input){
        sum += tuple2.f1;
      }
      
      //Useless timestamp. The default value is: long MAX_ Value, because time is not concerned based on event count.
      long maxTimestamp = window.maxTimestamp();
      out.collect("key:" + tuple.getField(0) + " value: " + sum + "| maxTimeStamp :"+ maxTimestamp + "," + format.format(maxTimestamp)
      );
  }
}

Sliding time window

Moving window is a more generalized form of fixed window. Sliding window is composed of fixed window length and sliding interval. Characteristics: the window length is fixed and can be overlapped. The sliding window slides forward continuously with a step (slide), and the length of the window is fixed. When using, we need to set slide and size. The size of the slide determines how often Flink creates new windows. If the slide is small, there will be a lot of windows. When the slide is smaller than the size of the window, adjacent windows will overlap and an event will be assigned to multiple windows; If the slide is larger than the size, some events may be discarded

Time based scrolling window

//Based on the time drive, calculate the data of the last 10s every 5S
// WindowedStream, Tuple, TimeWindow> timeWindow =
keyedStream.timeWindow(Time.seconds(10), Time.seconds(5));
SingleOutputStreamOperator applyed = countWindow.apply(new WindowFunction, String, String, GlobalWindow>() {
    @Override
    public void apply(String s, GlobalWindow window, Iterable> input, Collector out) throws Exception {
        Iterator> iterator = input.iterator();
        StringBuilder sb = new StringBuilder();
        while (iterator.hasNext()) {
            Tuple3 next = iterator.next();
            sb.append(next.f0 + ".." + next.f1 + ".." + next.f2);
        }
//                window.
        out.collect(sb.toString());
    }
});

Event based scrolling window

/**
*Sliding window: windows can overlap
*1. Time driven
*2. Event driven
*/
WindowedStream, String, GlobalWindow> countWindow = keybyed.countWindow(3,2);

SingleOutputStreamOperator applyed = countWindow.apply(new WindowFunction, String, String, GlobalWindow>() {
    @Override
    public void apply(String s, GlobalWindow window, Iterable> input, Collector out) throws Exception {
        Iterator> iterator = input.iterator();
        StringBuilder sb = new StringBuilder();
        while (iterator.hasNext()) {
            Tuple3 next = iterator.next();
            sb.append(next.f0 + ".." + next.f1 + ".." + next.f2);
        }
//                window.
        out.collect(sb.toString());
    }
});

Session time window

It is composed of a series of events combined with a timeout gap of a specified length of time, which is similar to the session of a web application, that is, if new data is not received for a period of time, a new window will be generated. In this mode, the length of the window is variable, and the start and end time of each window are not determined. We can set a fixed length session gap or useSessionWindowTimeGapExtractorDynamically determine the length of the session gap.

val input: DataStream[T] = ...
// event-time session windows with static gap
input
    .keyBy(...)
    .window(EventTimeSessionWindows.withGap(Time.minutes(10)))
    .(...)
// event-time session windows with dynamic gap
input
    .keyBy(...)
    .window(EventTimeSessionWindows.withDynamicGap(new SessionWindowTimeGapExtractor[T] {
      override def extract(element: T): Long = {
        // determine and return session gap
      }
    }))
    .(...)
// processing-time session windows with static gap
input
    .keyBy(...)
    .window(ProcessingTimeSessionWindows.withGap(Time.minutes(10)))
    .(...)
// processing-time session windows with dynamic gap
input
    .keyBy(...)
    .window(DynamicProcessingTimeSessionWindows.withDynamicGap(new SessionWindowTimeGapExtractor[T] {
      override def extract(element: T): Long = {
        // determine and return session gap
      }
    }))
    .(...)

Window function

After the window is divided, it is necessary to process the data in the window. One is to calculate the corresponding value incrementallyreduceandaggregateThe second is full quantity calculationprocess, incremental calculation means that the window saves a copy of intermediate data. Every time a new element flows in, the new element and intermediate data are combined to generate new intermediate data and then save it to the window. Full calculation means that the window caches all the elements of the window first, and performs calculation on the full elements in the window after triggering conditions

reference resources

https://cloud.tencent.com/developer/article/1584926

Wu Xie, the third master, is a rookie in the field of big data and artificial intelligence.
Please pay more attention
file