Looking at infinite data stream through window

Time:2021-1-16
  • Welcome to my official account.Big data technology and data warehouse
  • Get 100 g big data for free

Window is one of the most commonly used operators in flow computing. Infinite flow can be cut into finite flow by window, and then calculation function can be used on each window to realize very flexible operation. Flink provides rich window operations. In addition, users can customize windows according to their own processing scenarios. Through this article, you can learn:

  • The basic concept and simple use of windows
  • Classification, source code and use of built-in window assigners
  • Classification and use of window function
  • Window components and life cycle source code interpretation
  • The complete window uses the demo case

Quick Start

What is it?

Window is the core operator to deal with unbounded flow. Window can divide the data flow into “buckets” of fixed size (that is, divide the data flow into different windows according to the fixed time or length). On each window, users can use some calculation functions to process the data in the window, so as to get the statistical results in a certain time range. For example, statistics output the top n products with the most hits in the latest hour every 5 minutes, so that you can use an hour time window to limit the data to a fixed time range, and then you can aggregate the bounded data in that range.

According to the data streams (datastream and keyedstream) that are used, windows can be divided into two types:Keyed WindowsAndNon-Keyed Windows. Keyed windows uses window (…) on keyedstream )Operation to generate a windowedstream. Non keyed windows uses windows all (…) on datastream )Operation to generate an allwindowedstream. The specific transformation relationship is shown in the figure below. Note: generally not recommendedAllWindowedStreamBecause window operation on ordinary flow will gather all partitioned flows into a single task, that is, the parallelism is 1, which will affect the performance.

Looking at infinite data stream through window

How to use it

What is a window and how to use it? The following code fragment is specific:

Keyed Windows

stream
       . keyby (...) // using window on keyedstream
       . window (...) // required: Specifies the window allocator
      [. Trigger (...)] // optional: specify trigger. If not specified, the default value will be used
      [. Evictor (...)] // optional: specify the evictor. If not specified, there is no evictor
      [. Allowedlatency (...)] // optional: Specifies whether to delay data processing. If not specified, 0 is used by default 
      [. Sideoutputlatedata (...)] // optional: configure side output. If not specified, there is no side output
       . reduce / aggregate / fold / apply() // required: Specifies the window calculation function
      [. Getsideoutput (...)] // optional: get data from side output

Non-Keyed Windows

stream
       . windowsall (...) // required: Specifies the window allocator
      [. Trigger (...)] // optional: specify trigger. If not specified, the default value will be used
      [. Evictor (...)] // optional: specify the evictor. If not specified, there is no evictor
      [. Allowedlatency (...)] // optional: Specifies whether to delay data processing. If not specified, 0 is used by default
      [. Sideoutputlatedata (...)] // optional: configure side output. If not specified, there is no side output
       . reduce / aggregate / fold / apply() // required: Specifies the window calculation function
      [. Getsideoutput (...)] // optional: get data from side output

Short for window operation

In the above code snippet, use window (…) on keyedstream )Or use windows all (…) on datastream ), you need to pass in a window assignor parameter, which will be explained in detail below. For example, the following code fragment:

// -------------------------------------------
//  Keyed Windows
// -------------------------------------------
stream
       .keyBy(id)               
       .window(Tumb lingEventTimeWindows.of ( Time.seconds (5) ) // 5S scrolling window
       .reduce(MyReduceFunction)
// -------------------------------------------
//  Non-Keyed Windows
// -------------------------------------------
stream               
       .windowAll(Tumb lingEventTimeWindows.of ( Time.seconds (5) ) // 5S scrolling window
       .reduce(MyReduceFunction)

The above code can be abbreviated as:

// -------------------------------------------
//  Keyed Windows
// -------------------------------------------
stream
       .keyBy(id)               
       .timeWindow( Time.seconds (5) ) // 5S scrolling window
       .reduce(MyReduceFunction)
// -------------------------------------------
//  Non-Keyed Windows
// -------------------------------------------
stream               
       .timeWindowAll( Time.seconds (5) ) // 5S scrolling window
       .reduce(MyReduceFunction)

With regard to the above abbreviation, take keyedstream as an example. If you take a look at the specific keyedstream source code fragment, you can see that the underlying code is still called when it is not abbreviated. The code of timewindowall() is the same. You can refer to the datastream source code, which will not be repeated here.

//Different built-in window assigners will be called according to the time type of users
public WindowedStream<T, KEY, TimeWindow> timeWindow(Time size) {
        if (environment.getStreamTimeCharacteristic() == TimeCharacteristic.ProcessingTime) {
            return window(TumblingProcessingTimeWindows.of(size));
        } else {
            return window(TumblingEventTimeWindows.of(size));
        }
    }

Window Assigners

classification

Windowassigner is responsible for assigning the input data to one or more windows. Flink has many built-in windowassigners, which can meet most use scenarios. such astumbling windows, sliding windows, session windows , global windows. If these built-in windowassigners can’t meet your needs, you can implement custom windowassigner by inheriting the windowassigner class.

The above windowassigner is time-based windows. In addition, Flink also provides count based windows, which defines the window size according to the number of window elements. In this case, if the data is out of order, the window calculation result will be uncertain. This paper focuses on the use of time-based windows. Due to the limited space, the number based windows will not be discussed.

Looking at infinite data stream through window

Introduction to use

Next, we will analyze the four time-based windows assigners built in Flink one by one.

Tumbling Windows

  • graphic

Tumbling windows allocates data to certain windows and segments them according to fixed time or size. Each window has a fixed size and there is no overlap between windows (as shown in the figure below). This method is relatively simple, and is suitable for the scenario of counting a certain index according to the cycle.

For time selection, you can use event time or processing time, and the corresponding window assigners are tumbling event time windows and tumbling processing time windows. The user can use the of (size) method of window assigner to specify the time interval, where the time unit can be Time.milliseconds (x)、 Time.seconds (x) Or Time.minutes (x) And so on.

Looking at infinite data stream through window

  • use
//Using eventtime
datastream
           .keyBy(id)
           .window(TumblingEventTimeWindows.of(Time.seconds(10)))
           .process(new MyProcessFunction())
//Using processing time
datastream
           .keyBy(id)
           .window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
           .process(new MyProcessFunction())

Sliding Windows

  • graphic

Sliding windows adds a sliding window time to the scrolling window. This type of window will overlap (as shown in the figure below). The scrolling window scrolls forward according to the fixed time size of the window, while the sliding window slides forward according to the set sliding time. The size of the overlap between windows depends on the size of the window and the sliding time. When the sliding time is less than the window time, the overlap will appear. When the sliding time is larger than the window time, the window will be discontinuous, resulting in the data may not belong to any window. When the two are equal, its function is the same as scrolling window. The use scenario of sliding window is: the user calculates the index of the specified window time size according to the set statistical cycle, such as outputting the top n products with the most hits in the latest hour every 5 minutes.

For the selection of time, you can use event time or processing time, and the corresponding window assigners are: slidingeventtimewindows and slidingprocessingtimewindows. The user can use the of (size) method of window assigner to specify the time interval, where the time unit can be Time.milliseconds (x)、 Time.seconds (x) Or Time.minutes (x) And so on.

Looking at infinite data stream through window

  • use
//Using eventtime
datastream
           .keyBy(id)
           .window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
           .process(new MyProcessFunction())
//Using processing time
datastream
           .keyBy(id)
           .window(SlidingProcessingTimeWindows.of(Time.seconds(10), Time.seconds(5)))
           .process(new MyProcessFunction())

Session Windows

  • graphic

Session windows (session window) is mainly to aggregate the data with high activity in a certain period of time into a window for calculation. The triggering condition of the window is session gap, which means that if there is no active data access within the specified time, the window is considered to be over, and then the window calculation result is triggered. It should be noted that if the data continuously enters the window, the window will not be triggered. Different from sliding window and rolling window, session windows does not need to have fixed window size and slide time. It only needs to define session gap to specify the upper limit of inactive data time. As shown in the figure below. Session windows window type is more suitable for discontinuous data processing or periodic data generation scenarios. The user behavior data is counted according to the user’s activity in a certain period of time online.

For time selection, you can use event time or processing time, and the corresponding window assigners are eventtimesessionwindows and processtimesessionwindows. The user can use the withgap() method of window assigner to specify the time interval, where the time unit can be Time.milliseconds (x)、 Time.seconds (x) Or Time.minutes (x) And so on.

Looking at infinite data stream through window

  • use
//Using eventtime
datastream
           .keyBy(id)
           .window((EventTimeSessionWindows.withGap(Time.minutes(15)))
           .process(new MyProcessFunction())
//Using processing time
datastream
           .keyBy(id)
           .window(ProcessingTimeSessionWindows.withGap(Time.minutes(15)))
           .process(new MyProcessFunction())

be careful:The start time and end time of session window depend on the received data. Windows assigner will not immediately allocate all elements to the correct window. Session window will initialize a window with the timestamp of the element as the start time for each received element, use session gap as the window size, and then merge the overlapped windows. Therefore, the session window operation needs to specify theTriggerandWindow FunctionFor exampleReduceFunction, AggregateFunction, or ProcessWindowFunction

Global Windows

  • graphic

Global windows allocates all the data of the same key to a single window to calculate the result. The window has no start and end time. The window needs to trigger the calculation with the help of trigger. If the trigger is not specified in global windows, the window will not trigger the calculation. Therefore, the use of global windows needs to be very careful. Users need to be very clear about the results of their statistics in the whole window, and specify the corresponding trigger. At the same time, they need to specify the corresponding data cleaning mechanism, otherwise the data will always be left in memory.

Looking at infinite data stream through window

  • use
datastream
    .keyBy(id)
    .window(GlobalWindows.create())
    .process(new MyProcessFunction())

Window Functions

classification

Flink provides two kinds of window functions: incremental aggregation function and total window function. The performance of the incremental aggregation function is better than that of the full window function, because the incremental aggregation window calculates the final result based on the intermediate result state, that is, only one intermediate result state is maintained in the window, and all window data is not cached. On the contrary, for the full window function, it is necessary to cache all the data that enter the window. When the window is triggered, it will traverse all the data in the window and calculate the result. If the amount of window data is large or the window time is long, it will consume a lot of resources to cache data, resulting in performance degradation.

  • Incremental aggregate function

    Including: reducefunction, aggregatefunction and foldfunction

  • Total window function

    Including: processwindowfunction

Introduction to use

ReduceFunction

Input two data elements of the same type, aggregate them according to the specified calculation method, and then output a result element of the same type. The data type of input element and output element must be consistent. The effect is to aggregate the last result value with the current value. The specific use cases are as follows:

public class ReduceFunctionExample {
    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(1);
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

        //Analog data source
        SingleOutputStreamOperator<Tuple3<Long, Integer, Long>> input = env.fromElements(
                Tuple3.of(1L, 10, 1588491228L),
                Tuple3.of(1L, 15, 1588491229L),
                Tuple3.of(1L, 20, 1588491238L),
                Tuple3.of(1L, 25, 1588491248L),
                Tuple3.of(2L, 10, 1588491258L),
                Tuple3.of(2L, 30, 1588491268L),
                Tuple3.of(2L, 20, 1588491278L)).assignTimestampsAndWatermarks(new AscendingTimestampExtractor<Tuple3<Long, Integer, Long>>() {
            @Override
            public long extractAscendingTimestamp(Tuple3<Long, Integer, Long> element) {
                return element.f2 * 1000;
            }
        });

        input
                .map(new MapFunction<Tuple3<Long, Integer, Long>, Tuple2<Long, Integer>>() {
                    @Override
                    public Tuple2<Long, Integer> map(Tuple3<Long, Integer, Long> value) {
                        return Tuple2.of(value.f0, value.f1);
                    }
                })
                .keyBy(0)
                .window(TumblingEventTimeWindows.of(Time.seconds(10)))
                .reduce(new ReduceFunction<Tuple2<Long, Integer>>() {
                    @Override
                    public Tuple2<Long, Integer> reduce(Tuple2<Long, Integer> value1, Tuple2<Long, Integer> value2) throws Exception {
                        //Group according to the first element and find the cumulative sum of the second element
                        return Tuple2.of(value1.f0, value1.f1 + value2.f1);
                    }
                }).print();

        env.execute("ReduceFunctionExample");
    }
}

AggregateFunction

Similar to reducefunction, aggregatefunction is also an incremental calculation function based on intermediate state calculation results. Compared with reducefunction, aggregatefunction is more flexible in window calculation, but its implementation is slightly complex. It needs to implement the aggregatefunction interface and rewrite four methods. Its biggest advantage is that the data type of the intermediate result and the final result do not depend on the data type of the input. The source code of aggregatefunction is as follows:

/** 
*@ param < in > the data type of the input element
 *@ param < ACC > data type of intermediate aggregation results
 *@ param < out > data type of final aggregation result
 */
@PublicEvolving
public interface AggregateFunction<IN, ACC, OUT> extends Function, Serializable {

    /**
     *Create a new accumulator
     */
    ACC createAccumulator();

    /**
     *Aggregate the new data with the accumulator and return a new accumulator
     */
    ACC add(IN value, ACC accumulator);

    /**
     Calculates the final result from the accumulator and returns it
     */
    OUT getResult(ACC accumulator);

    /**
     *Merge the two accumulators and return the result
     */
    ACC merge(ACC a, ACC b);
}

Specific code cases are as follows:

public class AggregateFunctionExample {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(1);
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

        //Analog data source
        SingleOutputStreamOperator<Tuple3<Long, Integer, Long>> input = env.fromElements(
                Tuple3.of(1L, 10, 1588491228L),
                Tuple3.of(1L, 15, 1588491229L),
                Tuple3.of(1L, 20, 1588491238L),
                Tuple3.of(1L, 25, 1588491248L),
                Tuple3.of(2L, 10, 1588491258L),
                Tuple3.of(2L, 30, 1588491268L),
                Tuple3.of(2L, 20, 1588491278L)).assignTimestampsAndWatermarks(new AscendingTimestampExtractor<Tuple3<Long, Integer, Long>>() {
            @Override
            public long extractAscendingTimestamp(Tuple3<Long, Integer, Long> element) {
                return element.f2 * 1000;
            }
        });

        input.keyBy(0)
             .window(TumblingEventTimeWindows.of(Time.seconds(10)))
             .aggregate(new MyAggregateFunction()).print();
        env.execute("AggregateFunctionExample");

    }

    private static class MyAggregateFunction implements AggregateFunction<Tuple3<Long, Integer, Long>,Tuple2<Long,Integer>,Tuple2<Long,Integer>> {
        /**
         *Create an accumulator and initialize the value
         * @return
         */
        @Override
        public Tuple2<Long, Integer> createAccumulator() {
            return Tuple2.of(0L,0);
        }

        /**
         *
         *The value of the element entered by @ param value
         *Intermediate result value of @ param accumulator
         * @return
         */
        @Override
        public Tuple2<Long, Integer> add(Tuple3<Long, Integer, Long> value, Tuple2<Long, Integer> accumulator) {
            return Tuple2.of(value.f0,value.f1 + accumulator.f1);
        }

        /**
         *Get the calculated value
         * @param accumulator
         * @return
         */
        @Override
        public Tuple2<Long, Integer> getResult(Tuple2<Long, Integer> accumulator) {
            return Tuple2.of(accumulator.f0,accumulator.f1);
        }

        /**
         *Merge intermediate result values
         *@ param a intermediate result value a
         *@ param B intermediate result value b
         * @return
         */
        @Override
        public Tuple2<Long, Integer> merge(Tuple2<Long, Integer> a, Tuple2<Long, Integer> b) {
            return Tuple2.of(a.f0,a.f1 + b.f1);
        }
    }
}

FoldFunction

Foldfunction defines the logic of how to merge the input elements in the window with the external elements. This interface has been marked out of date. It is recommended that users use aggregatefunction instead of foldfunction.

public class FoldFunctionExample {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(1);
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

        //Analog data source
        SingleOutputStreamOperator<Tuple3<Long, Integer, Long>> input = env.fromElements(
                Tuple3.of(1L, 10, 1588491228L),
                Tuple3.of(1L, 15, 1588491229L),
                Tuple3.of(1L, 20, 1588491238L),
                Tuple3.of(1L, 25, 1588491248L),
                Tuple3.of(2L, 10, 1588491258L),
                Tuple3.of(2L, 30, 1588491268L),
                Tuple3.of(2L, 20, 1588491278L)).assignTimestampsAndWatermarks(new AscendingTimestampExtractor<Tuple3<Long, Integer, Long>>() {
            @Override
            public long extractAscendingTimestamp(Tuple3<Long, Integer, Long> element) {
                return element.f2 * 1000;
            }
        });

        input.keyBy(0)
             .window(TumblingEventTimeWindows.of(Time.seconds(10)))
             . fold ("user", new foldfunction < tuple3 < long, integer, long >, string > (){
                 @Override
                 public String fold(String accumulator, Tuple3<Long, Integer, Long> value) throws Exception {
                    //Splice a "user" string for the value of the first element and output it
                     return accumulator + value.f0 ;
                 }
             }).print();

        env.execute("FoldFunctionExample");

    }
}

ProcessWindowFunction

Both reducefunction and aggregatefunction mentioned above are window functions to realize incremental calculation based on intermediate state. Sometimes you need to use all the data of the whole window for calculation, such as median and mode. In addition, the context object of processwindowfunction can access some metadata information of the window, such as window end time, water level, etc. Process windows function can more flexibly support the result calculation based on all data elements of the window.

Within the system, the windows processed by processwindowfunction will store all the allocated data in liststate. By collecting the data and providing access and use to the metadata and other features of the windows, the application scenarios are more extensive than reducefunction and aggregatefunction. The source code of the processwindowfunction abstract class is as follows:

/**
 *@ param < in > input data type
 *@ param < out > output data type
 *Data type of @ param < key > key
 *The type of @ param < w > window
 */
@PublicEvolving
public abstract class ProcessWindowFunction<IN, OUT, KEY, W extends Window> extends AbstractRichFunction {
    private static final long serialVersionUID = 1L;
    /**
     *Calculate window data and output 0 or more elements
     *Key of @ param key window
     *The context of the @ param context window
     *All elements in the @ param elements window
     *Collector object of @ param out output element
     * @throws Exception
     */
    public abstract void process(KEY key, Context context, Iterable<IN> elements, Collector<OUT> out) throws Exception;
    /**
     *When the window is destroyed, the deletion status is changed
     * @param context
     * @throws Exception
     */
    public void clear(Context context) throws Exception {}
    //Context can access the metadata information of the window
    public abstract class Context implements java.io.Serializable {
    //Returns the currently calculated window
        public abstract W window();
    //Returns the current processing time 
        public abstract long currentProcessingTime();
    //Returns the current event time waterline
        public abstract long currentWatermark();
    //State accessors for each key and window
        public abstract KeyedStateStore windowState();
    //The global state accessor of each key
        public abstract KeyedStateStore globalState();
        /**
         *Output data to side output
         *The ID of the @ param outputtag the {@ code outputtag} side output
         *The output data of @ param value
         */
        public abstract <X> void output(OutputTag<X> outputTag, X value);
    }
}

The specific use cases are as follows:

public class ProcessWindowFunctionExample {

    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(1);
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

        //Analog data source
        SingleOutputStreamOperator<Tuple3<Long, Integer, Long>> input = env.fromElements(
                Tuple3.of(1L, 10, 1588491228L),
                Tuple3.of(1L, 15, 1588491229L),
                Tuple3.of(1L, 20, 1588491238L),
                Tuple3.of(1L, 25, 1588491248L),
                Tuple3.of(2L, 10, 1588491258L),
                Tuple3.of(2L, 30, 1588491268L),
                Tuple3.of(2L, 20, 1588491278L)).assignTimestampsAndWatermarks(new AscendingTimestampExtractor<Tuple3<Long, Integer, Long>>() {
            @Override
            public long extractAscendingTimestamp(Tuple3<Long, Integer, Long> element) {
                return element.f2 * 1000;
            }
        });

        input.keyBy(t -> t.f0)
             .window(TumblingEventTimeWindows.of(Time.seconds(10)))
             .process(new MyProcessWindowFunction())
             .print();
    }

    private static class MyProcessWindowFunction extends ProcessWindowFunction<Tuple3<Long, Integer, Long>,Tuple3<Long,String,Integer>,Long,TimeWindow> {
        @Override
        public void process(
                Long aLong,
                Context context,
                Iterable<Tuple3<Long, Integer, Long>> elements,
                Collector<Tuple3<Long, String, Integer>> out) throws Exception {
            int count = 0;
            for (Tuple3<Long, Integer, Long> in: elements) {
                count++;
            }
            //Statistics of the number of data in each window, plus the window output
            out.collect(Tuple3.of(aLong,"" + context.window(),count));
        }
    }
}

Integration of incremental aggregate function and processwindow function

Process window function provides powerful functions, but the only disadvantage is that it needs more state storage data. In many cases, incremental aggregation is used very frequently. How to support both incremental aggregation and access window metadata? Reducefunction and aggregatefunction can be integrated with processwindow function. In this way, the elements assigned to the window will be calculated immediately. When the window is triggered, the aggregated result will be transferred to processwindowfunction. In this way, the iteratable parameter of the process method of processwindowfunction will only have one value, that is, the result of incremental aggregation.

  • Combination of reducefunction and processwindowfunction
public class ReduceProcessWindowFunction {
    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(1);
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

        //Analog data source
        SingleOutputStreamOperator<Tuple3<Long, Integer, Long>> input = env.fromElements(
                Tuple3.of(1L, 10, 1588491228L),
                Tuple3.of(1L, 15, 1588491229L),
                Tuple3.of(1L, 20, 1588491238L),
                Tuple3.of(1L, 25, 1588491248L),
                Tuple3.of(2L, 10, 1588491258L),
                Tuple3.of(2L, 30, 1588491268L),
                Tuple3.of(2L, 20, 1588491278L)).assignTimestampsAndWatermarks(new AscendingTimestampExtractor<Tuple3<Long, Integer, Long>>() {
            @Override
            public long extractAscendingTimestamp(Tuple3<Long, Integer, Long> element) {
                return element.f2 * 1000;
            }
        });

        input.map(new MapFunction<Tuple3<Long, Integer, Long>, Tuple2<Long, Integer>>() {
            @Override
            public Tuple2<Long, Integer> map(Tuple3<Long, Integer, Long> value) {
                return Tuple2.of(value.f0, value.f1);
            }
        })
             .keyBy(t -> t.f0)
             .window(TumblingEventTimeWindows.of(Time.seconds(10)))
             .reduce(new MyReduceFunction(),new MyProcessWindowFunction())
             .print();

        env.execute("ProcessWindowFunctionExample");
    }

    private static class MyReduceFunction implements ReduceFunction<Tuple2<Long, Integer>> {
        @Override
        public Tuple2<Long, Integer> reduce(Tuple2<Long, Integer> value1, Tuple2<Long, Integer> value2) throws Exception {
            //Incremental summation
            return Tuple2.of(value1.f0,value1.f1 + value2.f1);
        }
    }

    private static class MyProcessWindowFunction extends ProcessWindowFunction<Tuple2<Long,Integer>,Tuple3<Long,Integer,String>,Long,TimeWindow> {
        @Override
        public void process(Long aLong, Context ctx, Iterable<Tuple2<Long, Integer>> elements, Collector<Tuple3<Long, Integer, String>> out) throws Exception {
            //Output the sum result together with the end time of the window
            out.collect(Tuple3.of(aLong,elements.iterator().next().f1,"window_end" + ctx.window().getEnd()));
        }
    }
}
  • Combination of aggregatefunction and processwindowfunction
public class AggregateProcessWindowFunction {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(1);
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

        //Analog data source
        SingleOutputStreamOperator<Tuple3<Long, Integer, Long>> input = env.fromElements(
                Tuple3.of(1L, 10, 1588491228L),
                Tuple3.of(1L, 15, 1588491229L),
                Tuple3.of(1L, 20, 1588491238L),
                Tuple3.of(1L, 25, 1588491248L),
                Tuple3.of(2L, 10, 1588491258L),
                Tuple3.of(2L, 30, 1588491268L),
                Tuple3.of(2L, 20, 1588491278L))
                .assignTimestampsAndWatermarks(new AscendingTimestampExtractor<Tuple3<Long, Integer, Long>>() {
                    @Override
                    public long extractAscendingTimestamp(Tuple3<Long, Integer, Long> element) {
                        return element.f2 * 1000;
                    }
                });

        input.keyBy(t -> t.f0)
                .window(TumblingEventTimeWindows.of(Time.seconds(10)))
                .aggregate(new MyAggregateFunction(),new MyProcessWindowFunction())
                .print();

        env.execute("AggregateFunctionExample");

    }

    private static class MyAggregateFunction implements AggregateFunction<Tuple3<Long, Integer, Long>, Tuple2<Long, Integer>, Tuple2<Long, Integer>> {
        /**
         *Create an accumulator and initialize the value
         *
         * @return
         */
        @Override
        public Tuple2<Long, Integer> createAccumulator() {
            return Tuple2.of(0L, 0);
        }

        /**
         *The value of the element entered by @ param value
         *Intermediate result value of @ param accumulator
         * @return
         */
        @Override
        public Tuple2<Long, Integer> add(Tuple3<Long, Integer, Long> value, Tuple2<Long, Integer> accumulator) {
            return Tuple2.of(value.f0, value.f1 + accumulator.f1);
        }

        /**
         *Get the calculated value
         *
         * @param accumulator
         * @return
         */
        @Override
        public Tuple2<Long, Integer> getResult(Tuple2<Long, Integer> accumulator) {
            return Tuple2.of(accumulator.f0, accumulator.f1);
        }

        /**
         *Merge intermediate result values
         *
         *@ param a intermediate result value a
         *@ param B intermediate result value b
         * @return
         */
        @Override
        public Tuple2<Long, Integer> merge(Tuple2<Long, Integer> a, Tuple2<Long, Integer> b) {
            return Tuple2.of(a.f0, a.f1 + b.f1);
        }
    }

    private static class MyProcessWindowFunction extends ProcessWindowFunction<Tuple2<Long,Integer>,Tuple3<Long,Integer,String>,Long,TimeWindow> {
        @Override
        public void process(Long aLong, Context ctx, Iterable<Tuple2<Long, Integer>> elements, Collector<Tuple3<Long, Integer, String>> out) throws Exception {
            //Output the sum result together with the end time of the window
            out.collect(Tuple3.of(aLong,elements.iterator().next().f1,"window_end" + ctx.window().getEnd()));
        }
    }
}

The life cycle of windows

Life cycle diagram

A window needs to go through a series of processes from creation to execution of window calculation and then to clearing, which is the life cycle of the window.

First, before an element enters the window operator, the window assignor will assign which window or windows the element enters. If the window does not exist, the window will be created.

Secondly, when the data enters the window, it depends on whether the incremental aggregation function is used. If the incremental aggregation function reducefunction or aggregatefunction is used, the newly added elements in the window will trigger the incremental calculation immediately, and the calculation results will be used as the contents of the window. If the incremental aggregation function is not used, the data entering the window will be stored in the liststate state, and when the window is triggered, the window elements will be traversed for aggregation calculation.

Then, after entering the window, each element will be passed to the trigger of the window. The trigger determines when the window will be calculated and when it needs to clear itself and the saved content. The trigger can decide to perform window calculation or clear window content at some specific time according to the assigned element or registered timer.

Finally, the operation after the trigger is successfully triggered depends on the window function used. If the incremental aggregation function is used, such as reducefunction or aggregatefunction, the aggregation result will be directly output. If only one full window function is included, such as processwindowfunction, it will act on all elements of the window, perform calculation and output results. If reducefunction and processwindowfunction are used in combination, that is, incremental aggregation window function and full window function are used in combination, the full window function will act on the aggregation value of incremental aggregation function, and then output the final result.

  • Case 1: use only incremental aggregate window functions

Looking at infinite data stream through window

  • Case 2: use only the full window function

Looking at infinite data stream through window

  • Case 3: combine incremental aggregate window function with full window function

Looking at infinite data stream through window

Window assigners

The function of windowassigner is to assign the input elements to one or more windows. When windowassigner assigns the first element to the window, it will create the window. Therefore, once a window is created, there must be at least one element in the window. Flink has many built-in windowassigners. This paper mainly discusses time-based windowassigners, which inherit the abstract class of windowassigner. The commonly used allocators have been explained in detail above. Let’s take a look at the inheritance diagram

Looking at infinite data stream through window

Next, we will analyze the source code of the windowassigner abstract class. The specific code is as follows:

/**

  • Windowassigner assigns an element to 0 or more windows
  • Within a window operator, elements are grouped by key (keyedstream),
  • A set of elements with the same key and window is called a pane
  • @Param < T > the data type of the element to assign
  • @Types of param < w > window: timewindow, globalwindow

*/
@PublicEvolving
public abstract class WindowAssigner<T, W extends Window> implements Serializable {

private static final long serialVersionUID = 1L;
/**
 *Returns a collection of windows to which elements are assigned
 *@ param element the element to be assigned
 *The timestamp of the @ param timestamp element
 *@ param context windowassignercontext object
 * @return
 */
public abstract Collection<W> assignWindows(T element, long timestamp, WindowAssignerContext context);
/**
 *Returns a default trigger associated with the windowassigner
 *@ param env execution environment
 * @return
 */
public abstract Trigger<T, W> getDefaultTrigger(StreamExecutionEnvironment env);

/**
 *Returns a window serializer
 * @param executionConfig
 * @return
 */
public abstract TypeSerializer<W> getWindowSerializer(ExecutionConfig executionConfig);
/**
 *Returns true if the element is assigned to the window based on event time
 * @return
 */
public abstract boolean isEventTime();
/**
 *The context allows access to the current processing time processing time
 */
public abstract static class WindowAssignerContext {

    /**
     *Returns the current processing time
     */
    public abstract long getCurrentProcessingTime();
}

}

###Triggers

After data access to the window, whether the window triggers windowfunciton calculation depends on whether the window meets the triggering conditions. Triggers are the conditions that determine when the window triggers the calculation and outputs the results. Triggers can be triggered according to the time or specific data conditions, such as the number of elements entering the window or some specific element values entering the window. The built-in windowassigners discussed above all have their own default triggers. When processing time is used, it will be triggered when the processing time exceeds the window end time. When event time is used, it is triggered when the water level exceeds the window end time.

Flink provides many built-in triggers, such as eventtimetrigger, processtimetrigger and counttrigger. Each trigger corresponds to different window assigners. For example, the trigger for windows of event time type is eventtimetrigger. Its basic principle is to determine whether the current watermark exceeds the Endtime of the window. If it exceeds, the calculation of data in the window will be triggered. Otherwise, the calculation will not be triggered. As for the default trigger of the built-in windowassigner analyzed above, you can see from their respective source codes. The details are as follows:

|Allocator | corresponding source code | default trigger|
| --- | --- | --- |
| TumblingEventTimeWindows | public Trigger <object, timewindow="" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px;">getDefaultTrigger(StreamExecutionEnvironment env) { return EventTimeTrigger.create(); }</object,> | EventTimeTrigger |
| TumblingProcessingTimeWindows | public Trigger <object, timewindow="" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px;">getDefaultTrigger(StreamExecutionEnvironment env) { return ProcessingTimeTrigger.create(); }</object,> | ProcessingTimeTrigger |
| SlidingEventTimeWindows | public Trigger <object, timewindow="" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px;">getDefaultTrigger(StreamExecutionEnvironment env) { return EventTimeTrigger.create(); }</object,> | EventTimeTrigger |
| SlidingProcessingTimeWindows | public Trigger <object, timewindow="" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px;">getDefaultTrigger(StreamExecutionEnvironment env) { return ProcessingTimeTrigger.create(); }</object,> | ProcessingTimeTrigger |
| EventTimeSessionWindows | public Trigger <object, timewindow="" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px;">getDefaultTrigger(StreamExecutionEnvironment env) { return EventTimeTrigger.create(); }</object,> | EventTimeTrigger |
| ProcessingTimeSessionWindows | public Trigger <object, timewindow="" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px;">getDefaultTrigger(StreamExecutionEnvironment env) { return ProcessingTimeTrigger.create(); }</object,> | ProcessingTimeTrigger |
| GlobalWindows | public Trigger <object, globalwindow="" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px;">getDefaultTrigger(StreamExecutionEnvironment env) { return new NeverTrigger(); }</object,> | NeverTrigger |

These triggers inherit the trigger abstract class, and the specific inheritance relationship is as follows:

![](https://upload-images.jianshu.io/upload_images/22116987-5c716e831febced3.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)


The specific explanation of these built-in triggers is as follows:

|Trigger | explanation|
| --- | --- |
|Eventtimetrigger | whether the current watermark exceeds the Endtime of the window. If it exceeds, the calculation of the data in the window will be triggered. Otherwise, the calculation will not be triggered|
|Processtimetrigger | whether the current processing time exceeds the Endtime of the window. If it exceeds, the calculation of the data in the window will be triggered. Otherwise, the calculation will not be triggered|
|Continuous eventtimetrigger | according to the interval time, the window is triggered periodically or the end time of the window is less than the current eventtime, and the trigger window is calculated|
|Continuous processing time trigger | the window is triggered periodically according to the interval time, or the end time of the window is less than the current processing time, and the trigger window is calculated|
|Counttrigger | determines whether to trigger the window calculation according to whether the number of data in the window exceeds the set threshold|
|Delta trigger | according to whether the delta index calculated by the window data exceeds the specified threshold, judge whether to trigger the window calculation|
|Purgingtrigger | can convert any trigger as a parameter to a trigger of type purge, and the data will be cleaned up after the calculation. |

The source code of the abstract trigger class is explained as follows:

/**

  • @Data type of param < T > element
  • @Types of param < w > window

*/
@PublicEvolving
public abstract class Trigger<T, W extends Window> implements Serializable {

private static final long serialVersionUID = -4104633972991191369L;
/**
 *This method is called when each element is assigned to a window, returning a triggerresult enumeration
 *This enumeration contains many trigger types: continue, fire_ AND_ PURGE、FIRE、PURGE
 *
 *@ param element enter the element of the window
 *@ param timestamp enters the timestamp of the window element
 *@ param window
 *@ param CTX context object, which can register timer callback function
 * @return
 * @throws Exception
 */
public abstract TriggerResult onElement(T element, long timestamp, W window, TriggerContext ctx) throws Exception;
/**
 *This method is called when a processing time timer registered with triggercontext is triggered
 *
 *The timestamp of the @ param time trigger timer
 *@ param window timer triggered window
 *@ param CTX context object, which can register timer callback function
 * @return
 * @throws Exception
 */
public abstract TriggerResult onProcessingTime(long time, W window, TriggerContext ctx) throws Exception;
/**
 *This method is called when the event time timer registered with triggercontext is triggered
 *
 *The timestamp of the @ param time trigger timer
 *@ param window timer triggered window
 *@ param CTX context object, which can register timer callback function
 * @return
 * @throws Exception
 */
public abstract TriggerResult onEventTime(long time, W window, TriggerContext ctx) throws Exception;
/**
 *True if the trigger supports merging trigger states
 *
 * @return
 */
public boolean canMerge() {
    return false;
}

/**
 *This method is called when multiple windows are merged into one window
 *
 *@ param window merged window
 *@ param CTX context object, which can register timer callback function and access status
 * @throws Exception
 */
public void onMerge(W window, OnMergeContext ctx) throws Exception {
    throw new UnsupportedOperationException("This trigger does not support merging.");
}
/**
 *Clear all window States held by trigger
 *This method is called when the window is destroyed
 *
 * @param window
 * @param ctx
 * @throws Exception
 */
public abstract void clear(W window, TriggerContext ctx) throws Exception;
/**
 *Context object, in the method parameter passed to trigger, is used to register timer callback function and processing state
 */
public interface TriggerContext {
    //Returns the current processing time
    long getCurrentProcessingTime();
    MetricGroup getMetricGroup();
    //Returns the current waterline timestamp
    long getCurrentWatermark();
    //Register a processing time timer
    void registerProcessingTimeTimer(long time);
    //Register an eventtime timer
    void registerEventTimeTimer(long time);
    //Delete a processing time timer
    void deleteProcessingTimeTimer(long time);
    //Delete an eventtime timer
    void deleteEventTimeTimer(long time);
    /**
     *Extract the status of the current trigger window and key
     */
    <S extends State> S getPartitionedState(StateDescriptor<S, ?> stateDescriptor);

    //Like getpartitioned state, this method is marked obsolete
    @Deprecated
    <S extends Serializable> ValueState<S> getKeyValueState(String name, Class<S> stateType, S defaultState);
    //The same as getpartitioned state function, this method has been marked out of date
    @Deprecated
    <S extends Serializable> ValueState<S> getKeyValueState(String name, TypeInformation<S> stateType, S defaultState);
}
//Extension of triggercontext
public interface OnMergeContext extends TriggerContext {
    //Merge the state of each window. The state must support merging
    <S extends MergingState<?, ?>> void mergePartitionedState(StateDescriptor<S, ?> stateDescriptor);
}

}

As can be seen from the source code above, every time a trigger is called, a triggerresult object will be generated. This object is an enumeration class, and the properties it contains determine the operation on the window. There are four behaviors: continue, fire_ AND_ Purge, fire, purge. For the specific meaning of each type, let's take a look at the triggerresult source code:

/**

  • The result type of trigger method determines what operation to perform on the window, such as whether to call window function
  • Or do you need to destroy the window
  • Note: if a trigger returns fire or fire_ AND_ Purge, but if there are no elements in the window, the window function will not be called

*/
public enum TriggerResult {

//Do nothing, do not trigger calculation at present, continue to wait
CONTINUE(false, false),

//Execute window function, output results, and then clear all States
FIRE_AND_PURGE(true, true),

//Execute window function and output results. The window will not be cleared and the data will continue to be retained
FIRE(true, false),

//Clear the data inside the window without triggering the calculation
PURGE(false, true);

}

###Evictors

Evictors is an optional component. Its main function is to clear the data before and after entering windowfuction. Flink has built-in three kinds of Evictors: countevictor, deltaevictor and timeevitor. If the user does not specify Evictors, there is no default value.

** * countevictor * *: keep a fixed number of elements in the window, and delete the data exceeding the specified number of window elements before window calculation;
** * Delta evictor * *: by defining a delta function and specifying a threshold, and calculating the delta size between the element in windows and the latest element, if it exceeds the threshold, the current data element will be eliminated;
** * timeevictor * *: by specifying the time interval, the interval is subtracted from the time of the latest element in the current window, and then all the data less than the result are eliminated. Its essence is to select the data with the latest time and delete the outdated data.

The inheritance diagram of Evictors is as follows:

![](https://upload-images.jianshu.io/upload_images/22116987-82253d38edd55255.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)


The source code of Evictors interface is as follows:

/**

  • Clear window elements before or after windowfunction calculation
  • @Data type of param < T > element
  • @Param < w > window type

*/
@PublicEvolving
public interface Evictor<T, W extends Window> extends Serializable {

/**
 * selective elimination of elements before calling windowing function.
 *Elements in the @ param elements window
 *The number of elements in the @ param size window
 *@ param window
 * @param evictorContext
 */
void evictBefore(Iterable<TimestampedValue<T>> elements, int size, W window, EvictorContext evictorContext);
/**
 * selective elimination of elements after windowing function.
 *Elements in the @ param elements window.
 *The number of elements in the @ param size window
 *@ param window
 * @param evictorContext
 */
void evictAfter(Iterable<TimestampedValue<T>> elements, int size, W window, EvictorContext evictorContext);
//The value passed to the evictor method parameter
interface EvictorContext {
    //Returns the current processing time
    long getCurrentProcessingTime();
    MetricGroup getMetricGroup();
    //Returns the current watermark timestamp
    long getCurrentWatermark();
}

}

##Summary

This paper first gives a quick introduction to the use of windows, introduces the basic concept, classification and simple use of windows. Then it interprets the built-in window assigner of Flink one by one, and gives the diagram and code fragment. Then it introduces the window function of Flink, including the classification of window function and detailed use cases. Finally, the components involved in the life cycle of window are analyzed, and the source code of each component is analyzed.