Tips | Flink uses union instead of join and cogroup

Time:2020-11-9

Each article in this series is relatively short and updated from time to time. Starting from some practical cases, this article aims to improve the posture of small partners A kind of Potential level. This paper introduces how to use union instead of cogroup (or join) in Flink to simplify task logic and improve task performance under the scenario of meeting the original requirements and realizing the original logic. The reading time is about one minute, and you can enter the text directly without saying much!
##Demand scenario analysis

Demand scenarios

The demand lures… Data products sister wants to count the granularity of a single short videoLike, play, comment, share, reportFive kinds of real-time indicators are summarized into photos_ ID, 1 minute time granularity real-time video consumption wide table (i.e., the wide table field is at least:photo_id + play_cnt + like_cnt + comment_cnt + share_cnt + negative_cnt + minute_timestamp)Output to real-time large screen.
The problem is that for the same video, the trigger mechanism and reporting time of five types of video consumption behavior are different, which determines that for real-time processing, the five types of behavior logs correspond to five different data sources. SQL boys naturally thought of join operation to merge the five types of consumption behavior logs. But real time join (cogroup) is really perfect, which will be discussed in detail below.

Source input and features

Firstly, we analyze the characteristics of source in the demand

  • photo_ ID granularity play, like, comment, share and negative details,When the user plays (likes, comments…) n times, the client server will upload n playing (likes, comments…) logs to the data source
  • The source schema of the five types of video consumption behavior logs are as follows:photo_ ID + timestamp + other dimensions
    ###Sink output and characteristics

The characteristics of sink are as follows:

  • photo_ ID granularity play, like, comment, share, negative1 minute level window aggregate data
  • The real-time video consumption wide table sink schema is as follows:photo_id + play_cnt + like_cnt + comment_cnt + share_cnt + negative_cnt + minute_timestamp

Source and sink sample data

Source data:

photo_id timestamp user_id explain
1 2020/10/3 11:30:33 3 play
1 2020/10/3 11:30:33 4 play
1 2020/10/3 11:30:33 5 play
1 2020/10/3 11:30:33 4 give the thumbs-up
2 2020/10/3 11:30:33 5 give the thumbs-up
1 2020/10/3 11:30:33 5 comment

Sink data:

photo_id timestamp play_cnt like_cnt comment_cnt
1 2020/10/3 11:30:00 3 1 1
2 2020/10/3 11:30:00 0 1 0

We have a complete analysis of the input and output of data sources, so let’s see what solutions can meet the above requirements.

Implementation scheme

  • Scheme 1:Cogroup scheme in this sectionThe original log data is consumed directly, and the window aggregation calculation is performed by using cogroup or join for five different types of video consumption behavior logs
  • Scheme 2: five different types of video consumption behavior logs are separately aggregated to calculate the minute granularity index data, and the downstream indicators are aggregated according to photo_ ID to merge
  • Scheme 3:Union scheme in this sectionSince the data source schema is the same, the union operation is performed directly on five different types of video consumption behavior logs, and the aggregation calculation is carried out for the five types of indicators in the subsequent window function. The design process of union scheme is introduced later

Let’s start with the example code of the cogroup scheme.

cogroup

The example of cogroup implementation is as follows. The sample code directly uses the processing time (or can be replaced by the event time ~), so the timestamp of the data source is simplified (directly deleted)

public class Cogroup {
 public static void main(String[] args) throws Exception {
 final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
 // Long -> photo_ ID play once 
 DataStream<Long> play = SourceFactory.getDataStream(xxx); 
 // Long -> photo_ Id like once 
 DataStream<Long> like = SourceFactory.getDataStream(xxx); 
 // Long -> photo_ ID comment once 
 DataStream<Long> comment = SourceFactory.getDataStream(xxx); 
 // Long -> photo_ ID share once 
 DataStream<Long> share = SourceFactory.getDataStream(xxx); 
 // Long -> photo_ ID report once 
 DataStream<Long> negative = SourceFactory.getDataStream(xxx);
 // Tuple3<Long, Long, Long> -> photo_ id + play_ cnt + like_ Data merging of CNT playing and liking 
 DataStream<Tuple3<Long, Long, Long>> playAndLikeCnt = play .coGroup(like) .where(KeySelectorFactory.get(Function.identity())) .equalTo(KeySelectorFactory.get(Function.identity())) .window(TumblingProcessingTimeWindows.of(Time.seconds(60))) .apply(xxx1);
 // Tuple4<Long, Long, Long, Long> -> photo_ id + play_ cnt + like_ cnt + comment_ Data merging of CNT play, like and comment 
 DataStream<Tuple4<Long, Long, Long, Long, Long>> playAndLikeAndComment = playAndLikeCnt .coGroup(comment) .where(KeySelectorFactory.get(playAndLikeModel -> playAndLikeModel.f0)) .equalTo(KeySelectorFactory.get(Function.identity())) .window(TumblingProcessingTimeWindows.of(Time.seconds(60))) .apply(xxx2);
 // Tuple5<Long, Long, Long, Long, Long> -> photo_ id + play_ cnt + like_ cnt + comment_ cnt + share_ Data merging of CNT play, like, comment and share 
 DataStream<Tuple5<Long, Long, Long, Long, Long, Long>> playAndLikeAndCommentAndShare = playAndLikeAndComment .coGroup(share) .where(KeySelectorFactory.get(playAndLikeAndCommentModel -> playAndLikeAndCommentModel.f0)) .equalTo(KeySelectorFactory.get(Function.identity())) .window(TumblingProcessingTimeWindows.of(Time.seconds(60))) .apply(xxx2);
 // Tuple7<Long, Long, Long, Long, Long, Long, Long> -> photo_ id + play_ cnt + like_ cnt + comment_ cnt + share_ cnt + negative_ cnt + minute_ Timestamp playback, like, comment, share and report data merge // same as above~ 
 DataStream<Tuple7<Long, Long, Long, Long, Long, Long, Long>> playAndLikeAndCommentAndShare = ***;
 env.execute(); }}

Rough to think about it, the above is not the end of this, things are not so simple, let’s do a detailed point of analysis.

The possible problems of the above implementation

  • The data delay of the whole process is more than 3 minutes from the data consumed by Flink to a piece of data from play data source to the final output data after the data is aggregated
  • If the data sources continue to increase (such as adding other video consumption operation data sources), the whole task operators will become more, the data link will be longer, the task stability will become worse, and the output data delay will also increase with the window calculation, and the delay will be longer

Data products sisterGood, little brother. Since the problems have been analyzed, I’ll help them solve them~

Head text ∩ technology brotherDo it.

Head text ∩ technology brotherSince too many windows may cause data output delay and job instability, is there any way to reduce the number of windows. Based on the fact that there is only one window operator operation in the whole job, we can deduce the following data links.

Reverse link

1-5 is the whole link of reverse push.

  • 1. The data of five kinds of indicators are calculated in a single window
  • 2. The window models of the five indicators are the same
  • 3. The key in keyby is consistent (photo_ id)
  • 4. The data sources of the five indicators are all photo_ ID granularity, and the model of five types of data sources must be the same, and can be merged
  • 5. Union operator can merge five kinds of data sources!!!

If you don’t say much, just go to the Union program code.

union

public class Union {
 public static void main(String[] args) throws Exception {
 final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
 // Tuple2<Long, String> -> photo_ ID + "play" label 
 DataStream<Tuple2<Long, String>> play = SourceFactory.getDataStream(xxx); 
 // Tuple2<Long, String> -> photo_ ID + "like" tag 
 DataStream<Tuple2<Long, String>> like = SourceFactory.getDataStream(xxx); 
 // Tuple2<Long, String> -> photo_ ID + "comment" tag 
 DataStream<Tuple2<Long, String>> comment = SourceFactory.getDataStream(xxx); 
 // Tuple2<Long, String> -> photo_ ID + "share" tag 
 DataStream<Tuple2<Long, String>> share = SourceFactory.getDataStream(xxx); 
 // Tuple2<Long, String> -> photo_ ID + "negative" tag 
 DataStream<Tuple2<Long, String>> negative = SourceFactory.getDataStream(xxx);
 // Tuple5<Long, Long, Long, Long> -> photo_id + play_cnt + like_cnt + comment_cnt + window_start_timestamp 
 DataStream<Tuple3<Long, Long, Long>> playAndLikeCnt = play .union(like) .union(comment) .union(share) .union(negative) .keyBy(KeySelectorFactory.get(i -> i.f0)) .timeWindow(Time.seconds(60)) .process(xxx);
 env.execute(); }}

It can be found that no matter how the upstream data source changes, the union scheme can always keep only one window operator to process and calculate the data, which can solve the problems of data delay and too many Flink task operators.
When the schema of the data source is the same (or different, but can be formatted into the same format after processing), or if the processing logic is the same, union can be used for logical simplification.

summary

This paper first introduces our requirement scenario, the second part analyzes how to use cogroup (case code) to solve this requirement scenario, and then analyzes some problems that may exist in the implementation scheme, and leads to the backstepping and design ideas of the union solution.
In the third part, we use union instead of cogroup to optimize this scenario. If you have a better optimization plan for this scenario, please leave a message.

Recommended Today

Summary of recent use of gin

Recently, a new project is developed by using gin. Some problems are encountered in the process. To sum up, as a note, I hope it can help you. Cross domain problems Middleware: func Cors() gin.HandlerFunc { return func(c *gin.Context) { //Here you can use * or the domain name you specify c.Header(“Access-Control-Allow-Origin”, “*”) //Allow header […]