Flink’s datastream — time and window based operator — processfunction

Time:2020-10-31

Process Function API

The previous transformations cannot access the event timestamp and watermark information. For example, the map transformation operator of mapfunction cannot access the timestamp and the event time of the current event. Based on this, datastream API provides a series of low level conversion operators — process function API. Different from high-level operators, through these low-level conversion operators, we can access the time stamp, water mark and register timing events of data. Process functions are used to build event driven applications and implement custom business logic. For example, Flink SQL is implemented with process functions.

Flink provides us with 8 process functions:

  • ProcessFunction
  • KeyedProcessFunction
  • CoProcessFunction
  • ProcessJoinFunction
  • BroadcastProcessFunction
  • KeyedBroadcastProcessFunction
  • ProcessWindowFunction
  • ProcessAllWindowFunction

Therefore, I chose the project of e-commerce user behavior data analysis based on Flink to conduct a comprehensive study and understanding of Flink’s keyedprocess function.

First of all, we focus on the keyedprocessfunction, which is used to operate keyedstream. Keyedprocessfunction processes every element of the stream. Keyedprocessfunction [key, in, out] also provides two additional methods:

  • processElement(v: IN, ctx: Context, out: Collector[OUT])This method will be called by each element in the stream, and the result of the call will be output in the collector data type. Context can access the timestamp of the element, the key of the element, and the timerservice time service. Context can also output results to other streams (side outputs).
  • onTimer(timestamp: Long, ctx: OnTimerContext, out: Collector[OUT])Is a callback function. Called when a previously registered timer is triggered. The parameter timestamp is the trigger time stamp set by the timer. Collector is the collection of output results. Ontimercontext, like the context parameter of processelement, provides some information about the context, such as the time information triggered by timer (event time or processing time). When timer timer triggers, the callback function ontimer() will be executed. Note that timer timer can only be used on keyed streams.

Real time hot page traffic statistics

  • Basic needs

    • From the log of web server, count the hot pages in real time
    • Statistics of IP traffic per minute, take out the largest 5 addresses, update every 5 seconds
  • Solutions

    • The time in the Apache server log is converted to a timestamp as the event time
    • The length of the window is 1 minute and the sliding distance is 5 seconds
import java.text.SimpleDateFormat

import model.ApacheLogEvent
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.scala._

/**
 *Statistics of hot products in nearly 1 hour, update every 5 minutes
 */
object NetWorkFlowAnalysis {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(5)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)


    val resource = getClass.getResource("/apache.log").getPath
    val inputStream = env.readTextFile(resource)

    val dataStream  = inputStream
      .map(data => {
        val arr = data.split(" ")
        ApacheLogEvent(arr(0), arr(1), new SimpleDateFormat("dd/MM/yyyy:HH:mm:ss").parse(arr(3)).getTime, arr(5), arr(6))
      })
      .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[ApacheLogEvent](Time.seconds(1)) {
        override def extractTimestamp(t: ApacheLogEvent): Long = t.timestamp
      })
      .keyBy(_.url)
      .timeWindow(Time.minutes(10), Time.seconds(5))
      .aggregate(new NetWorkFlowAggregateFunction, new NetWorkFlowWindowFunction)
      .keyBy(_.windowEnd)
      .process(new NetWorkFlowKeyedProcessFunction(3))
      .print()

    env.execute("NetWorkFlowAnalysis")
  }
}

Real time traffic statistics PV and UV

  • Basic needs

    • Statistics of real-time PV and UV from the buried point log
  • Solutions

    • Count the PV behavior in the buried point log, and use set data structure to remove duplicate
object PageView {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    val resource = getClass.getResource("/UserBehavior.csv").getPath
    val inputStream = env.readTextFile(resource)

    val dataStream = inputStream
      .map(data => {
        val arr = data.split(",")
        UserBehavior(arr(0).toLong, arr(1).toLong, arr(2).toInt, arr(3), arr(4).toLong)
      })
      .assignAscendingTimestamps(_.timestamp * 1000L)
      .filter(_.behavior == "pv")
      .map(data => ("pv", 1))
      .keyBy(_._1)
      .timeWindow(Time.hours(1))
      .aggregate(new PvAggregateFunction, new PvWindowFunction)
      .keyBy(_.windowEnd)
      .process(new PvKeyedProcessFunction)
      .print()


    env.execute("PageView")
  }
}
object UniqueVisitor {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    env.setParallelism(1)

    val resource = getClass.getResource("/UserBehavior.csv").getPath
    val inputStream = env.readTextFile(resource)

    val dataStream = inputStream
      .map(data => {
        val arr = data.split(",")
        UserBehavior(arr(0).toLong, arr(1).toLong, arr(2).toInt, arr(3), arr(4).toLong)
      })
      .assignAscendingTimestamps(_.timestamp * 1000L)
      .filter(_.behavior == "pv")
      .timeWindowAll(Time.hours(1))
      .apply(new UvAllWindowFunction)
      .print()

    env.execute()
  }
}

Marketing analysis – App marketing plan

  • Basic needs

    • From the embedded point log, statistics of APP marketing data indicators
    • According to different promotion channels, statistical data were collected
  • Solutions

    • Through filtering the user behavior data in the log, statistics are made according to different channels
    • Process function can be used to process and get customized output data information
object AppMarket {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)


    val sourceStream = env.addSource(new SimulatedSourceFunction).assignAscendingTimestamps(_.timestamp)

    val processStream = sourceStream
      .keyBy(data => (data.channel, data.behavior))
      .timeWindow(Time.days(1), Time.seconds(5))
      .process(new AppMarketProcessWindowFunction)
      .print()

    env.execute()
  }
}

Marketing Analysis page advertising statistics

  • Basic needs

    • From the buried point log, the number of hits per hour page ads is counted, and it is refreshed once every 5 seconds, and divided according to different provinces
    • Filter the frequent click behavior of “swipe” and add the user to the blacklist
  • Solutions

    • According to the provinces, create a time window with a length of 1 hour and a sliding distance of 5 seconds for statistics
    • You can use process function to filter the blacklist to detect the number of hits on the same advertisement,

      If the upper limit is exceeded, the user information will be output to the blacklist

object BrushOrderAlert {
  def main(args: Array[String]): Unit = {

    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    env.setParallelism(1)

    val resource = getClass.getResource("/AdClickLog.csv").getPath
    val socketStream = env.readTextFile(resource)
    val dataStream = socketStream
      .map(data => {
        val arr = data.split(",")
        UserAdvertClick(arr(0).toLong, arr(1).toLong, arr(2), arr(3), arr(4).toLong)
      })
      .assignAscendingTimestamps(_.timestamp * 1000L)

    //Output the user with brush order behavior to the side output stream (blacklist alarm)
    val filterStream = dataStream
      .keyBy(data=>(data.userId, data.advertId))
      .process(new BrushOrderKeyedProcessFunction(100))

    val aggProviceStream = filterStream
      .keyBy(_.province)
      .timeWindow(Time.hours(1), Time.seconds(5))
      .aggregate(new BrushOrderAggregateFunction, new BrushOrderWindowFunction)

    aggProviceStream.print (order details by province)
    filterStream.getSideOutput (New outputtag [blacklist] ("blacklist")). Print ("malicious swiping details")

    env.execute()
  }
}

Malicious login monitoring

  • Basic needs

    • Users frequently login failure in a short period of time, there is the possibility of malicious attacks
    • The same user (can be different IP) login failure twice in 2 seconds, need to alarm
  • Solutions

    • Save the login failure behavior of the user in the liststate, set the timer to trigger after 2 seconds, and check the number of failed logins in the liststate
    • For more accurate detection, CEP library can be used to realize the pattern matching of event flow
object ContinuousLoginFailure {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    env.setParallelism(1)

    val resource = getClass.getResource("/LoginLog.csv").getPath
    val socketStream = env.readTextFile(resource)
    val dataStream = socketStream
      .map(data => {
        val arr = data.split(",")
        LoginEvent(arr(0).toLong, arr(1), arr(2), arr(3).toLong)
      })
      .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[LoginEvent](Time.seconds(3)) {
        override def extractTimestamp(t: LoginEvent): Long = t.timestamp * 1000L
      })

    val loginStream = dataStream
      .keyBy(_.userId)
      .process(new LoginFailAdvanceKeyedProcessFunction(2))

    dataStream.print("LoginEvent");
    loginStream.print("LoginFailWarning")

    env.execute()
  }
}

Real time monitoring of order payment

  • Basic needs

    • After placing an order, the order expiration time should be set to improve the user’s willingness to pay and reduce the system risk
    • If the user fails to pay within 15 minutes after placing an order, the monitoring information will be output
  • Solutions

    • The CEP library is used for pattern matching of event flow, and the matching time interval is set
    • We can also use state programming and process function to realize processing logic
object OrderTimeoutWithCep {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    val resource = getClass.getResource("/OrderLog.csv").getPath
    val inputStream = env.readTextFile(resource)

    val orderEventStream = inputStream
      .map(data => {
        val dataArray = data.split(",")
        OrderEvent(dataArray(0).trim.toLong, dataArray(1).trim, dataArray(2).trim, dataArray(3).trim.toLong)
      })
      .assignAscendingTimestamps(_.eventTime * 1000L)
      .keyBy(_.orderId)

    //Patterns with time constraints
    val orderPayPattern = Pattern.begin[OrderEvent]("begin")
      .where(_.eventType.equals("create"))
      .followedBy("follow")
        .where(_.eventType.equals("pay"))
          .within(Time.minutes(15))


    //Apply pattern to InputStream to get a patternstream
    val patternStream = CEP.pattern(orderEventStream, orderPayPattern)

    val orderTimeoutTag = OutputTag[OrderResult]("orderTimeoutTag")
    //Call select to test the output alarm of the timeout
    val resultStream = patternStream.select(orderTimeoutTag, new OrderTimeOutPatternTimeoutFunction, new OrderPayPatternSelectFunction)

    resultStream.print("payed")
    resultStream.getSideOutput(orderTimeoutTag).print("timeout")

    env.execute()
  }
}

Output side

The output of most of the datastream API operators is a single output, that is, a stream of a certain data type. The side outputs function of processfunction can generate multiple streams with different data types. A side output can be defined as an outputtag [x] object, where x is the data type of the output stream. Processfunction can send an event to one or more sideoutputs through the context object.

/**
 *Alarm will be given if the temperature rises continuously within 1s
 */
object TempIncreaseAlert {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    env.setParallelism(1)

    val socketStream = env.socketTextStream("localhost", 9998)
    val dataStream = socketStream
      .map(data => {
        val arr = data.split(",")
        TempReading(arr(0).trim, arr(1).trim.toLong, arr(2).trim.toDouble)
      })
      .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[TempReading](Time.seconds(1)) {
        override def extractTimestamp(t: TempReading): Long = {t.timestamp + 1000L}
      })

    val processStream = dataStream
      .keyBy(_.id)
      .process(new TempKeyedProcessFunction)


    dataStream.print("data")
    processStream.getSideOutput(new OutputTag[String]("alert")).print("output")
    env.execute()
  }
}

I upload the code to my GitHub. For specific code, please go to the following steps:flink_behavior_analysis

Recommended Today

Comparison and analysis of Py = > redis and python operation redis syntax

preface R: For redis cli P: Redis for Python get ready pip install redis pool = redis.ConnectionPool(host=’39.107.86.223′, port=6379, db=1) redis = redis.Redis(connection_pool=pool) Redis. All commands I have omitted all the following commands. If there are conflicts with Python built-in functions, I will add redis Global command Dbsize (number of returned keys) R: dbsize P: print(redis.dbsize()) […]