Flink window background
Flink believes that batch is a special case of streaming, so the underlying engine of Flink is a streaming engine, which implements stream processing and batch processing. Window is the bridge from streaming to batch. Generally speaking, window is a mechanism used to set a finite set of infinite streams to operate on bounded data sets. The set on the stream is delimited by window, such as “calculate the last 10 minutes” or “sum of the last 50 elements”. The window can be driven by time window (e.g. every 30s) or data (count window) (e.g. every 100 elements). The datastream API provides windows for time and count.
The general skeleton structure of a Flink window application is as follows:
// Keyed Window
stream
.keyBy(...)
There are two necessary operations in the skeleton structure of Flink window:
- Use the window assignor to assign the elements in the data flow to the corresponding window.
- When the window trigger conditions are met, the data in the window is processed using the window function. The commonly used window functions are
reduce
、aggregate
、process
scroll window
Time driven
The data is segmented according to the fixed window length. There is no overlap between the windows under the rolling window, and the window length is fixed. We can useTumblingEventTimeWindows
andTumblingProcessingTimeWindows
Create a scrolling time window based on event time or processing time. The length of the window can beorg.apache.flink.streaming.api.windowing.time.Time
Mediumseconds
、minutes
、hours
anddays
To set.
//Key handling cases
KeyedStream, Tuple> keyedStream = mapStream.keyBy(0);
//Based on time driven, a window is divided every 10s
WindowedStream, Tuple, TimeWindow> timeWindow =
keyedStream.timeWindow(Time.seconds(10));
//Based on event driven, every three events (i.e. data of three same keys) are separated, and a window is divided for calculation
// WindowedStream, Tuple, GlobalWindow> countWindow =
keyedStream.countWindow(3);
//Apply is the application function of the window, that is, the function in apply will be applied to the data of this window.
timeWindow.apply(new MyTimeWindowFunction()).print();
// countWindow.apply(new MyCountWindowFunction()).print();
Event driven
When we want to drive the purchase behavior of every 100 users, we will calculate the window every time 100 “same” elements are filled in the window. It is easy to understand. The following is an implementation case
public class MyCountWindowFunction implements WindowFunction,
String, Tuple, GlobalWindow> {
@Override
public void apply(Tuple tuple, GlobalWindow window, Iterable>
input, Collector out) throws Exception {
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
int sum = 0;
for (Tuple2 tuple2 : input){
sum += tuple2.f1;
}
//Useless timestamp. The default value is: long MAX_ Value, because time is not concerned based on event count.
long maxTimestamp = window.maxTimestamp();
out.collect("key:" + tuple.getField(0) + " value: " + sum + "| maxTimeStamp :"+ maxTimestamp + "," + format.format(maxTimestamp)
);
}
}
Sliding time window
Moving window is a more generalized form of fixed window. Sliding window is composed of fixed window length and sliding interval. Characteristics: the window length is fixed and can be overlapped. The sliding window slides forward continuously with a step (slide), and the length of the window is fixed. When using, we need to set slide and size. The size of the slide determines how often Flink creates new windows. If the slide is small, there will be a lot of windows. When the slide is smaller than the size of the window, adjacent windows will overlap and an event will be assigned to multiple windows; If the slide is larger than the size, some events may be discarded
Time based scrolling window
//Based on the time drive, calculate the data of the last 10s every 5S
// WindowedStream, Tuple, TimeWindow> timeWindow =
keyedStream.timeWindow(Time.seconds(10), Time.seconds(5));
SingleOutputStreamOperator applyed = countWindow.apply(new WindowFunction, String, String, GlobalWindow>() {
@Override
public void apply(String s, GlobalWindow window, Iterable> input, Collector out) throws Exception {
Iterator> iterator = input.iterator();
StringBuilder sb = new StringBuilder();
while (iterator.hasNext()) {
Tuple3 next = iterator.next();
sb.append(next.f0 + ".." + next.f1 + ".." + next.f2);
}
// window.
out.collect(sb.toString());
}
});
Event based scrolling window
/**
*Sliding window: windows can overlap
*1. Time driven
*2. Event driven
*/
WindowedStream, String, GlobalWindow> countWindow = keybyed.countWindow(3,2);
SingleOutputStreamOperator applyed = countWindow.apply(new WindowFunction, String, String, GlobalWindow>() {
@Override
public void apply(String s, GlobalWindow window, Iterable> input, Collector out) throws Exception {
Iterator> iterator = input.iterator();
StringBuilder sb = new StringBuilder();
while (iterator.hasNext()) {
Tuple3 next = iterator.next();
sb.append(next.f0 + ".." + next.f1 + ".." + next.f2);
}
// window.
out.collect(sb.toString());
}
});
Session time window
It is composed of a series of events combined with a timeout gap of a specified length of time, which is similar to the session of a web application, that is, if new data is not received for a period of time, a new window will be generated. In this mode, the length of the window is variable, and the start and end time of each window are not determined. We can set a fixed length session gap or useSessionWindowTimeGapExtractor
Dynamically determine the length of the session gap.
val input: DataStream[T] = ...
// event-time session windows with static gap
input
.keyBy(...)
.window(EventTimeSessionWindows.withGap(Time.minutes(10)))
.(...)
// event-time session windows with dynamic gap
input
.keyBy(...)
.window(EventTimeSessionWindows.withDynamicGap(new SessionWindowTimeGapExtractor[T] {
override def extract(element: T): Long = {
// determine and return session gap
}
}))
.(...)
// processing-time session windows with static gap
input
.keyBy(...)
.window(ProcessingTimeSessionWindows.withGap(Time.minutes(10)))
.(...)
// processing-time session windows with dynamic gap
input
.keyBy(...)
.window(DynamicProcessingTimeSessionWindows.withDynamicGap(new SessionWindowTimeGapExtractor[T] {
override def extract(element: T): Long = {
// determine and return session gap
}
}))
.(...)
Window function
After the window is divided, it is necessary to process the data in the window. One is to calculate the corresponding value incrementallyreduce
andaggregate
The second is full quantity calculationprocess
, incremental calculation means that the window saves a copy of intermediate data. Every time a new element flows in, the new element and intermediate data are combined to generate new intermediate data and then save it to the window. Full calculation means that the window caches all the elements of the window first, and performs calculation on the full elements in the window after triggering conditions
reference resources
https://cloud.tencent.com/developer/article/1584926
Wu Xie, the third master, is a rookie in the field of big data and artificial intelligence.
Please pay more attention