Process function: a versatile tool in Flink datastream API

Time:2021-1-19

stayFlink’s time and watermarksIn this article, the related contents of Flink’s time and water level are described. You may have to ask, how do I access timestamps and watermarks? First of all, you can’t access it through the normal datastream API. You need to use the underlying API process function provided by Flink. Process function can not only access timestamps and watermarks, but also register timers triggered at a specific time in the future. In addition, data can be sent to multiple output streams through side outputs. In this way, the function of data diversion can be realized, and it is also a way to deal with late data. Next, we will start from the source code, combined with specific use cases to illustrate how to use process function.

<!– more –>

brief introduction

Flink provides many process functions, each of which has its own functions

  • ProcessFunction
  • KeyedProcessFunction
  • CoProcessFunction
  • ProcessJoinFunction
  • ProcessWindowFunction
  • ProcessAllWindowFunction
  • BaseBroadcastProcessFunction

    • KeyedBroadcastProcessFunction
  • BroadcastProcessFunction

The inheritance diagram is as follows:

Process function: a versatile tool in Flink datastream API

As can be seen from the above inheritance relationship, both of them implement the richfunction interface, so they support usingopen()close()getRuntimeContext()And so on. It can be seen from the name that these functions have different application scenarios, but the basic functions are similar. The following will take keyedprocessfunction as an example to discuss the general functions of these functions.

Source code

KeyedProcessFunction

/**
 *Low level API functions for handling keyedstream streams
 *A call to the processelement method is triggered for each element in the input stream. The method produces 0 or more outputs
 *Its implementation class can access the timestamp and timers of data through context. When timers are triggered, Ontimer method will be called back
 *The Ontimer method generates 0 or more outputs and registers a future timer
 *
 *Note: if you want to access keyed state and timers, you must use keyedprocessfunction on keyedstream
 *In addition, abstractrichfunction, the parent class of keyedprocessfunction, implements the richfunction interface, so you can use
 *Open (), close () and getruntimecontext () methods
 *
 *Types of @ param < k > key
 *@ param < I > the data type of the input element
 *The data type of the @ param < o > output element
 */
@PublicEvolving
public abstract class KeyedProcessFunction<K, I, O> extends AbstractRichFunction {
​
 private static final long serialVersionUID = 1L;
 /**
 *Processes each element in the input stream
 *This method will output 0 or more outputs, similar to the function of flatmap
 *In addition, the method can update the internal state or set the timer
 *@ param value input element
 *@ param CTX context, which can access the timestamp of the input element and obtain a timerservice to register timers and query the time
 *Context is valid only when processelement is called
 *The result value returned by @ param out
 * @throws Exception
 */
 public abstract void processElement(I value, Context ctx, Collector<O> out) throws Exception;
​
 /**
 *Is a callback function that will be called back when timers registered in timerservice are triggered
 *@ param timestamp triggers the time stamp of timers
 *@ param CTX ontimercontext allows access to timestamps. The timedomain enumeration class provides two time types:
 * EVENT_ Time and processing_ TIME
 *And it can get a timerservice to register timers and query the time
 *Ontimercontext is valid only when the Ontimer method is called
 *@ param out result output
 * @throws Exception
 */
 public void onTimer(long timestamp, OnTimerContext ctx, Collector<O> out) throws Exception {}
 /**
 *Only valid when the processelement () method or Ontimer method is called
 */
 public abstract class Context {
​
 /**
 *The time stamp of the currently processed element, or the time stamp when the timers are triggered
 *This value may be null, for example, when the time semantics set in the program is: timecharacteristic ා processingtime
 * @return
 */
 public abstract Long timestamp();
​
 /**
 *Access time and registered timers
 * @return
 */
 public abstract TimerService timerService();
​
 /**
 *Output elements to side output
 *Tag output from @ param outputtag side
 *Record of @ param value output
 * @param <X>
 */
 public abstract <X> void output(OutputTag<X> outputTag, X value);
 /**
 *Gets the key of the processed element
 * @return
 */
 public abstract K getCurrentKey();
 }
 /**
 *Ontimercontext can only be used when the Ontimer method is called
 */
 public abstract class OnTimerContext extends Context {
 /**
 *There are two types of timers: event_ Time and processing_ TIME
 * @return
 */
 public abstract TimeDomain timeDomain();
 /**
 *Gets the key of the trigger timer element
 * @return
 */
 @Override
 public abstract K getCurrentKey();
 }
}

In the above source code, there are mainly two methods, which are analyzed as follows:

  • processElement(I value, Context ctx, Collector<O> out)

This method will be called once for each record in the stream, output 0 or more elements, similar to the function of flatmap, and send out the results through the collector. In addition, the function has a context parameter, through which the user can access the timestamp, the key value of the current record and timerservice. In addition, the output method can also be used to send data to side output to realize the function of streaming or processing late data.

  • onTimer(long timestamp, OnTimerContext ctx, Collector<O> out)

This method is a callback function, which will be called back when the timers registered in timerservice are triggered. among@param timestampThe parameter represents the time stamp of the trigger timers, and the collector can issue the record. Carefully, you may find that these two methods have a context parameter. The above method passes the context parameter, and the Ontimer method passes the ontimercontext parameter. These two parameter objects can achieve similar functions. Ontimercontext can also return the event that triggers the timer_ Time and processing_ TIME)。

TimerService

In keyedprocessfunction source code, use timerservice to access time and timer. Let’s take a look at the source code

@PublicEvolving
public interface TimerService {
 String UNSUPPORTED_REGISTER_TIMER_MSG = "Setting timers is only supported on a keyed streams.";
 String UNSUPPORTED_DELETE_TIMER_MSG = "Deleting timers is only supported on a keyed streams.";
 //Returns the current processing time
 long currentProcessingTime();
 //Returns the current event time watermark
 long currentWatermark();
​
 /**
 *Register a timer, which will be called when the processing time is equal to the timer clock
 * @param time
 */
 void registerProcessingTimeTimer(long time);
​
 /**
 *Register a timer, which will be triggered when the watermark of event time reaches the time
 * @param time
 */
 void registerEventTimeTimer(long time);
​
 /**
 *The processing time timer is deleted according to the given trigger time
 *If the timer does not exist, the method will not work,
 *That is, the timer has been registered before and is not out of date
 *
 * @param time
 */
 void deleteProcessingTimeTimer(long time);

 /**
 *Delete the event time timer according to the given trigger time
 *If the timer does not exist, the method will not work,
 *That is, the timer has been registered before and is not out of date
 * @param time
 */
 void deleteEventTimeTimer(long time);
}

Timerservice provides the following methods:

  • currentProcessingTime()

Returns the current processing time

  • currentWatermark()

Returns the current event time watermark timestamp

  • registerProcessingTimeTimer(long time)

For the current key, register a processing time timer (timers), which will be called when the processing time is equal to the timer clock

  • registerEventTimeTimer(long time)

For the current key, register an event time timer (timers), which will be called when the waterline timestamp is greater than or equal to the timer clock

  • deleteProcessingTimeTimer(long time)

For the current key, delete a previously registered processing time timer. If the timer does not exist, the method will not work

  • deleteEventTimeTimer(long time)

For the current key, delete a previously registered event time timer. If the timer does not exist, the method will not work

When the timer is triggered, the ontimer() function will be called back, and the system calls the processelement() method and ontimer() method synchronously

Note: there are two error messages in the above source code, which means that the timer can only be used on keyed streams. The common use is to clear the keyed state after some key values are not in use, or to implement some time-based custom window logic. If you want to use a timer on a non keyedstream, you can use keyselector to return a fixed partition value (such as a constant), so that all data will be sent to only one partition.

Use cases

Next, the side output function of process function will be used for streaming processing. The specific code is as follows:

public class ProcessFunctionExample {
​
 //Define the side output label
 static final OutputTag<UserBehaviors> buyTags = new OutputTag<UserBehaviors>("buy") {
 };
 static final OutputTag<UserBehaviors> cartTags = new OutputTag<UserBehaviors>("cart") {
 };
 static final OutputTag<UserBehaviors> favTags = new OutputTag<UserBehaviors>("fav") {
 };
 static class SplitStreamFunction extends ProcessFunction<UserBehaviors, UserBehaviors> {
​
 @Override
 public void processElement(UserBehaviors value, Context ctx, Collector<UserBehaviors> out) throws Exception {
 switch (value.behavior) {
 case "buy":
 ctx.output(buyTags, value);
 break;
 case "cart":
 ctx.output(cartTags, value);
 break;
 case "fav":
 ctx.output(favTags, value);
 break;
 default:
 out.collect(value);
 }
 }
 }
 public static void main(String[] args) throws Exception {
 StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(1);
​
 //Simulation data source [userid, behavior, product]
 SingleOutputStreamOperator<UserBehaviors> splitStream = env.fromElements(
 new UserBehaviors(1L, "buy", "iphone"),
 new UserBehaviors(1L, "cart", "huawei"),
 new UserBehaviors(1L, "buy", "logi"),
 new UserBehaviors(1L, "fav", "oppo"),
 new UserBehaviors(2L, "buy", "huawei"),
 new UserBehaviors(2L, "buy", "onemore"),
 new UserBehaviors(2L, "fav", "iphone")).process(new SplitStreamFunction());
​
 //Obtain the data of purchase behavior after diversion
 splitStream.getSideOutput(buyTags).print("data_buy");
 //Obtain the data of purchase behavior after diversion
 splitStream.getSideOutput(cartTags).print("data_cart");
 //Get the data of collection behavior after streaming
 splitStream.getSideOutput(favTags).print("data_fav");
​
 env.execute("ProcessFunctionExample");
 }
}

summary

This paper first introduces several underlying process function APIs provided by Flink. These APIs can access timestamps and watermarks, and support registering a timer to call the callback function ontimer(). Then the common parts of these APIs are interpreted from the perspective of source code, and the specific meaning and usage of each method are explained in detail. Finally, a common use case of process function is given to implement the streaming processing. In addition, users can also use these functions to define processing logic in callback functions by registering timers, which is very flexible.

  • Pay attention to the official account:Big data technology and data warehouse

Get 100 g big data for free