At present, there are many scenarios of data analysis that evolve from batch processing to stream processing. Although batch processing can be treated as a special case of stream processing, analyzing an infinite set of stream data usually requires a change in thinking mode and has its own terms (for example, “windowing”, “at least once” and “exactly once”) (only once) “.
For those who are new to stream processing, this shift and new terminology can be very confusing. Apache Flink is a stream processor for production environments with an easy-to-use API that can be used to define advanced stream analyzers.
Flink’s API has a very flexible window definition on data flow, which makes it stand out from other open source stream processing frameworks.
In this article, we will discuss the concept of windows for streaming, introduce Flink’s built-in windows, and explain its support for custom window semantics.
What is windows?
Let’s illustrate it with a practical example.
Take the example of traffic sensors: count the sum of the number of cars passing a traffic light?
Suppose that at a traffic light, we count the number of cars passing through the traffic light every 15 seconds, as shown in the following figure:
The passing of cars can be regarded as a flow, an infinite flow. Cars continue to pass this traffic light, so it is impossible to count the total number of cars. However, we can change our thinking. Every 15 seconds, we will sum up with the last result (sliding aggregation), as follows:
This result still seems unable to answer our question. The fundamental reason is that the flow is unbounded. We can’t limit the flow, but we can deal with unbounded flow data within a bounded range.
Therefore, we need to change the formulation of the question: the sum of the number of cars passing a traffic light per minute?
This problem is equivalent to defining a window. The window limit is 1 minute, and the data in each minute does not interfere with each other. Therefore, it can also be called a tumbling (non coincident) window, as shown in the following figure:
The number of the first minute is 8, the second minute is 22, and the third minute is 27… In this way, there will be 60 windows in an hour.
Consider another case. Count the total number of cars in the past minute every 30 seconds:
At this point, the window appears coincident. In this way, there will be 120 windows in an hour.
To expand, we can collect the number of cars passing at each traffic light in a certain area, and then make a window statistics based on 1 minute at each traffic light, that is, parallel processing:
What does it do?
Generally speaking, window is a mechanism used to set a finite set of infinite streams and operate on bounded data sets. Windows can be divided into time-based windows and count based windows.
Flink’s own window
Flink datastream API provides time and count windows, and adds session based windows. At the same time, due to some special needs, the datastream API also provides customized window operations for users to customize the window.
Next, we will mainly introduce the time based window and count based window, as well as the customized window operation. The session based window operation will be discussed in the subsequent articles.
As named, windows aggregates streaming data based on time. For example, a tumbling time window of one minute collects elements of one minute and applies a function to all elements in the window after one minute.
Defining tumbling time windows and sliding time windows in Flink is very simple:
Tumbling time windows
Enter a time parameter
data.keyBy(1) . timewindow (time. Minutes (1)) // tumbling time window counts the quantity and time every minute .sum(1);
Sliding time windows
Enter two time parameters
data.keyBy(1) . timewindow (time. Minutes (1), time. Seconds (30)) // sliding time window counts the quantity and time of the past minute every 30s .sum(1);
One thing we haven’t discussed yet, that is, the exact meaning of “collecting one minute elements”, which can be reduced to a question, “how does the stream processor interpret time?”
Apache Flink has three different time concepts, namely, processing time, event time and entertainment time.
Here you can refer to my next article:
Apache Flink also provides counting window functionality. If the count window is set to 100, 100 events will be collected in the window and the value of the window will be calculated when the 100th element is added.
In Flink’s datastream API, tumbling count window and sliding count window are defined as follows:
tumbling count window
Enter a time parameter
data.keyBy(1) . countwindow (100) // count the sum of every 100 elements .sum(1);
sliding count window
Enter two time parameters
data.keyBy(1) . countwindow (100, 10) // count the sum of the past 100 elements for every 10 elements .sum(1);
Dissecting the window mechanism of Flink
Flink’s built-in time window and count window have covered most application scenarios, but sometimes the window logic needs to be customized. At this time, Flink’s built-in window cannot solve these problems. In order to support custom windows to implement different logic, the datastream API provides an interface for its window mechanism.
The following figure describes the window mechanism of Flink and introduces the components involved:
Elements that arrive at the window operator are passed to windowassigner. Windowassigner assigns elements to one or more windows and may create new windows.
The window itself is just an identifier of the element list. It may provide some optional meta information, such as the start and end times in timewindow. Note that elements can be added to multiple windows, which also means that an element can exist in multiple windows at the same time.
Each window has a trigger that determines when the window is calculated and cleared. When the previously registered timer expires, a trigger is called for each element inserted into the window. On each event, the trigger can decide to trigger (that is, clear (delete the window and discard its contents), or start and clear the window. A window can be evaluated multiple times and exists until it is cleared. Note that the window will consume memory until it is cleared.
When the trigger is triggered, the window element list can be provided to the optional evictor, which can traverse the window element list and decide to delete some elements that enter the window first from the beginning of the list. Then, the remaining elements are assigned to a calculation function. If there is no evictor defined, the trigger directly gives all window elements to the calculation function.
The calculation function receives the window elements filtered by evictor and calculates the results of one or more elements of the window. The datastream API accepts different types of calculation functions, including predefined aggregate functions, such as sum(), min(), max(), and reducefunction, foldfunction, or windowfunction.
These are the components that make up the Flink window mechanism. Next, we will step by step demonstrate how to use the datastream API to implement custom window logic. We start with streams of datastream [in] type and group them using the key selector function, which groups data of the same type of key together.
SingleOutputStreamOperator<xxx> data = env.addSource(...); data.keyBy()
How to customize window?
Responsible for assigning elements to different windows.
The window API provides a user-defined windowassigner interface. We can implement the windowassigner interface
public abstract Collection<W> assignWindows(T element, long timestamp)
method. Meanwhile, for count based windows, globalwindow’s window assignor is adopted by default, for example:
Trigger is a trigger that defines when or under what circumstances to remove a window
We can specify triggers to override the default triggers provided by windowassigner. Note that the specified trigger does not add other trigger conditions, but replaces the current trigger.
3. Evictor (optional)
Expel, that is, retain some elements left by the previous window
4. Return datastream type data through apply windowfunction.
Using Flink’s internal window mechanism and datastream API, you can implement custom window logic, such as session window.
For modern stream processors, it is essential to support various types of windows on continuous data streams. Apache Flink is a stream processor with a powerful feature set, including a very flexible mechanism to build windows on continuous data streams. Flink provides built-in window operators for common scenes and allows users to customize window logic.
Pay attention to me
Please indicate the original address for Reprint:http://www.54tianzhisheng.cn/2018/12/08/Flink-Stream-Windows/
Wechat public account:zhisheng
In addition, I have sorted out some Flink learning materials, which have been put into the wechat public account. You can add my wechat:zhisheng_tian, and then reply to the keyword:FlinkYou can get it unconditionally.
GitHub code warehouse
In the future, all the code of this project will be put in this warehouse, including some demos and blogs for learning Flink