Spark streaming operator development example

Time:2019-11-4

Spark streaming operator development example

Transform operator development

When the transform operation is applied to dstream, it can be used to perform any RDD to RDD conversion operation, and it can also be used to implement operations not provided in dstream API. For example, dstream API does not provide the join operation between each batch in a dstream and a specific RDD. The join operator in dstream can only join other dstreams, but we can do it ourselves To implement this function using the transform operation.

Example: real time filtering of blacklist users

package StreamingDemo

import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 *Real time blacklist filtering
 */
object TransformDemo {
 def main(args: Array[String]): Unit = {
  //Set log level
  Logger.getLogger("org").setLevel(Level.WARN)
  val conf = new SparkConf()
   .setAppName(this.getClass.getSimpleName)
   .setMaster("local[2]")
  val ssc = new StreamingContext(conf, Seconds(2))

  //Create a blacklist RDD
  val blackRDD =
   ssc.sparkContext.parallelize(Array(("zs", true), ("lisi", true)))

  //Get data from NC through socket
  val linesDStream = ssc.socketTextStream("Hadoop01", 6666)

  /**
   *Filter the speech of blacklist users
   * zs sb sb sb sb
   * lisi fuck fuck fuck
   * jack hello
   */
  linesDStream
   .map(x => {
    val info = x.split(" ")
    (info(0), info.toList.tail.mkString(" "))
   })
   . transform (RDD = > {// transform is an RDD - > RDD operation, so the return value must be RDD
    /**
     *After the leftouterjoin operation, the result is as follows:
     * (zs,(sb sb sb sb),Some(true)))
     * (lisi,(fuck fuck fuck),some(true)))
     * (jack,(hello,None))
     */
    val joinRDD = rdd.leftOuterJoin(blackRDD)

    //If it's some (true), it means that it's a blacklist user. If it's none, it means that it's not in the blacklist. Keep the non blacklist users
    val filterRDD = joinRDD.filter(x => x._2._2.isEmpty)

    filterRDD
   })
   .map(x=>(x._1,x._2._1)).print()

  ssc.start()
  ssc.awaitTermination()
 }
}

test

Start NC, pass in users and their speech information

You can see that the program filters out the users’ statements in the blacklist in real time

Updatestatebykey operator development

The updatestatebykey operator can maintain any state, and constantly update with new information. This operator can maintain a state for each key and continuously update the state. For each batch, spark will apply a state update function to each previously existing key. Whether the key has a new value in the batch or not, if the value returned by the state update function is none, the state corresponding to the key will be deleted; for the newly emerging key, the state update function will also be executed.

To use this operator, there are two steps

  • Define state — state can be any data type
  • Define the state update function — use a function to specify how to use the previous state and get the new value update state from the input stream

Note: for updatestatebykey operation, checkpoint mechanism must be enabled

Example: cache based real-time wordcount

package StreamingDemo

import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 *Cache based real-time wordcount, which counts the number of words globally
 */
object UpdateStateByKeyDemo {
 def main(args: Array[String]): Unit = {
  //Set log level
  Logger.getLogger("org").setLevel(Level.WARN)

  /**
   *If security authentication is not enabled or the user obtained from Kerberos is null, get the Hadoop? User? Name environment variable,
   *And set its value as Hadoop execution user to Hadoop username
   *Here's an experiment. Without enabling security authentication, my user name will be automatically obtained even if it is not added explicitly
   */
  //System.setProperty("HADOOP_USER_NAME","Setsuna")

  val conf = new SparkConf()
   .setAppName(this.getClass.getSimpleName)
   .setMaster("local[2]")
  val ssc = new StreamingContext(conf, Seconds(2))

  //Set the path where checkpoint is stored
  ssc.checkpoint("hdfs://Hadoop01:9000/checkpoint")

  //Create input dstream
  val lineDStream = ssc.socketTextStream("Hadoop01", 6666)
  val wordDStream = lineDStream.flatMap(_.split(" "))
  val pairsDStream = wordDStream.map((_, 1))

  /**
   *State: represents the previous state value
   *Values: values corresponding to the key in the current batch
   */
  val resultDStream =
   pairsDStream.updateStateByKey((values: Seq[Int], state: Option[Int]) => {

    //When the state is none, it means that there is no statistics for this word, then 0 is returned to the counter
    var count = state.getOrElse(0)

    //Traverse values to accumulate the value of the new words
    for (value <- values) {
     count += value
    }

    //Returns the new state corresponding to the key, that is, the number of occurrences of the word
    Option(count)
   })

  //Output at console
  resultDStream.print()

  ssc.start()
  ssc.awaitTermination()
 }
}

test

Turn on NC, input word

Results of console real-time output

Development of window sliding window operator

Spark streaming provides the support of sliding window operation, which can perform calculation operations on the data in a sliding window
In sliding window, including batch interval, window interval, sliding interval

  • For window operations, there will be n batch data in the window
  • The size of batch data is determined by the window interval, which refers to the duration of the window, that is, the length of the window
  • The sliding time interval refers to how long the window slides once to form a new window. By default, the sliding time interval is the same as the batch time interval

Note: the sliding time interval and window time interval must be set to an integral multiple of the batch interval

Use an official chart as an illustration

The batch interval is 1 time unit, the window interval is 3 time units, and the sliding interval is 2 time units. For the initial window time1-time3, only when the window interval is satisfied can the data processing be triggered. Therefore, the sliding window operation must specify two parameters: window length and sliding time interval. Support for sliding windows in spark streaming is more complete than storm.

Window sliding operator operation

operator describe
window() Perform custom calculations on the data for each sliding window
countByWindow() Count the data of each sliding window
reduceByWindow() Reduce the data of each sliding window
reduceByKeyAndWindow() Perform the reducebykey operation on the data of each sliding window
countByValueAndWindow() Perform the countbyvalue operation on the data of each sliding window

Development of reducebykeyandwindow operator

Example: online hot search term real-time sliding statistics

Every 2 seconds, count the top 3 search terms in the last 5 seconds and the number of times they appear

package StreamingDemo

import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 *Demand: every 2 seconds, count the top 3 search terms in the last 5 seconds and the number of times
 */
object ReduceByKeyAndWindowDemo {
 def main(args: Array[String]): Unit = {

  //Set log level
  Logger.getLogger("org").setLevel(Level.WARN)
  //Basic configuration
  val conf = new SparkConf()
   .setAppName(this.getClass.getSimpleName)
   .setMaster("local[2]")

  //Batch interval set to 1s
  val ssc = new StreamingContext(conf, Seconds(1))

  val linesDStream = ssc.socketTextStream("Hadoop01", 6666)
  linesDStream
   . flatmap (_. Split ("") // use spaces as participles
   . map ((, 1)) // return (word, 1)
   .reduceByKeyAndWindow(
    //Functions that define how windows are evaluated
    //X represents the result after aggregation, Y represents the next value to be aggregated corresponding to this key
    (x: Int, y: Int) => x + y,
    //Window length is 5 seconds
    Seconds(5),
    //Window interval is 2 seconds
    Seconds(2)
   )
   . transform (RDD = >
    //Sort according to the number of key occurrences, and then sort in descending order to obtain the top three search terms
    val info: Array[(String, Int)] = rdd.sortBy(_._2, false).take(3)
    //Convert array to resultrdd
    val resultRDD = ssc.sparkContext.parallelize(info)
    resultRDD
   })
   . map (x = > s "${X. '1} occurs ${X.' 2}")
   .print()

  ssc.start()
  ssc.awaitTermination()

 }
}

test result

Overview of dstream output operations

Spark streaming allows data from dstream to be output to external system. All calculations in dstream are triggered by output operation. Foreachrdd output operation must also perform action operation on RDD in order to trigger calculation logic of each batch.

Transformation describe
print() Print out the first 10 elements of data in dstream in driver. It is mainly used for testing, or for simply triggering a job when no output operation is needed.
saveAsTextFiles(prefix,
[suffix])
The contents of dstream are saved as text files, in which the files generated in each batch interval are named as prefix time_in_ms [. Suffix].
saveAsObjectFiles(prefix
, [suffix])
Serialize the contents of dstream into objects and save them in the format of sequencefile. The files generated in each batch interval are named as prefix time in MS [. Suffix].
saveAsHadoopFiles(pref
ix, [suffix])
Save the contents of dstream as a Hadoop file in the form of text, which is generated within each batch interval
Name it prefix time in MS [. Suffix].
foreachRDD(func) The most basic output operation is to apply func function to RDD in dstream, which will output data to external system
Unified, such as saving RDD to files or network databases. Note that the func function is running the streaming
Executed in the driver process of the application.

Development of foreachrdd operator

Foreachrdd is the most commonly used output operation. It can traverse and process each RDD generated in dstream, and then write the data in each RDD to external storage, such as file, database, cache, etc., in which action operations are usually performed for RDD, such as foreach

Using foreachrdd to operate database

Usually a connection is created in foreachrdd, such as JDBC connection, and then the data is written to the external storage through connection

Error 1: create connection outside foreach operation of RDD


dstream.foreachRDD { rdd =>
  val connection=createNewConnection()
  rdd.foreach { record => connection.send(record)
  }
}

This way is wrong. This way will cause the connection object to be serialized and transferred to each task. However, the connection object does not support serialization, so it cannot be transferred

Error 2: create connection in foreach operation of RDD


dstream.foreachRDD { rdd =>
  rdd.foreach { record =>
    val connection = createNewConnection()
    connection.send(record)
    connection.close()
  }
}

Although this method is possible, the execution efficiency will be very low, because it will lead to the creation of a connection object for every piece of data in RDD. Generally, the creation of the connection object is very performance consuming

A reasonable way

  • The first is to use the foreachpartition operation of RDD and create a connection object within the operation, which is equivalent to creating a connection object for each partition of RDD, saving a lot of resources
  • The second is to manually encapsulate a static connection pool and use the foreachpartition operation of RDD. In this operation, a connection is obtained from the static connection pool through the static method. After the connection is used, it is put back into the connection pool. In this way, connections can be reused between partitions of multiple RDDS

Example: real time global statistics wordcount and save the results to MySQL database

The MySQL database table creation statement is as follows


CREATE TABLE wordcount (
  word varchar(100) CHARACTER SET utf8 NOT NULL,
  count int(10) NOT NULL,
  PRIMARY KEY (word)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Add mysql-connector-java-5.1.40-bin.jar to idea

The code is as follows

In fact, at the beginning of the connection pool code, I thought about using static blocks to write a pool for direct access, but if I consider that the pool width is not enough, this way is actually better. At the beginning, instantiate a connection pool and be called to access the connection. When all the connections are acquired, the pool is empty, and then instantiate a pool

package StreamingDemo

import java.sql.{Connection, DriverManager, SQLException}
import java.util

object JDBCManager {
 var connectionQue: java.util.LinkedList[Connection] = null

 /**
  *Get connection objects from the database connection pool
  * @return
  */
 def getConnection(): Connection = {
  synchronized({
   try {
    //If the connection pool is empty, a connection type linked list is instantiated
    if (connectionQue == null) {
     connectionQue = new util.LinkedList[Connection]()
     for (i <- 0 until (10)) {
      //Generate 10 connections and configure related information
      val connection = DriverManager.getConnection(
       "jdbc:mysql://Hadoop01:3306/test?characterEncoding=utf-8",
       "root",
       "root")
      //Push connection into connection pool
      connectionQue.push(connection)
     }
    }
   } catch {
    //Catch exception and output
    case e: SQLException => e.printStackTrace()
   }
   //If the connection pool is not empty, return the header element and delete it in the linked list
   return connectionQue.poll()
  })
 }

 /**
  *When the connection object is used up, you need to call this method to return the connection
  * @param connection
  */
 def returnConnection(connection: Connection) = {
  //Insert element
  connectionQue.push(connection)
 }

 def main(args: Array[String]): Unit = {
  //Main method test
  getConnection()
  println(connectionQue.size())
 }
}

Wordcount code

package StreamingDemo

import org.apache.log4j.{Level, Logger}
import org.apache.spark.{SparkConf, streaming}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object ForeachRDDDemo {
 def main(args: Array[String]): Unit = {
  //Set the log level to avoid too much info
  Logger.getLogger("org").setLevel(Level.WARN)

  //You can set up Hadoop users without adding
  System.setProperty("HADOOP_USER_NAME", "Setsuna")

  //Spark basic configuration
  val conf = new SparkConf()
   .setAppName(this.getClass.getSimpleName)
   .setMaster("local[2]")
  val ssc = new StreamingContext(conf, streaming.Seconds(2))

  //Because updatestatebykey is to be used, checkpoint is required
  ssc.checkpoint("hdfs://Hadoop01:9000/checkpoint")

  //Set socket as configured by NC
  val linesDStream = ssc.socketTextStream("Hadoop01", 6666)
  val wordCountDStream = linesDStream
   . flatmap (_. Split ("") // use space as participle
   . map ((, 1)) // generate (word, 1)
   .updateStateByKey((values: Seq[Int], state: Option[Int]) => {
    //Update status information in real time
    var count = state.getOrElse(0)
    for (value <- values) {
     count += value
    }
    Option(count)
   })

  wordCountDStream.foreachRDD(rdd => {
   if (!rdd.isEmpty()) {
    rdd.foreachPartition(part => {
     //Get connection from connection pool
     val connection = JDBCManager.getConnection()
     part.foreach(data => {
      Val SQL = // insert the wordcount information into the wordcount table. If the on duplicate key update clause is yes, update it or not, insert it
       s"insert into wordcount (word,count) " +
        s"values ('${data._1}',${data._2}) on duplicate key update count=${data._2}"
      //Using preparestatement to use SQL statements
      val pstmt = connection.prepareStatement(sql)
      pstmt.executeUpdate()
     })
     //After submitting data at the connection, return the connection to the connection pool
     JDBCManager.returnConnection(connection)
    })
   }
  })

  ssc.start()
  ssc.awaitTermination()
 }
}

Open NC, input data

Query the result of wordcount in another terminal, and it can be found that it changes in real time

The above is the whole content of this article. I hope it will help you in your study, and I hope you can support developepaer more.#niming{position:fixed; bottom:0; z-index:9999;right:0;width:100%;background:#e5e5e5;}.fengyu{width:99%;padding:0.5%;text-align:center;}a{text-decoration: none;color:#000;}.tuiguangbiaoji{font-size:10px;margin:0px 10px;}

Recommended Today

Details of multi-path and large capacity hard disk mount under CentOS

I. application environment and requirementsBlade servers connect HP storage through fiber switches, forming a 2×2 link The storage capacity of the operating system for CentOS 6.4 64 bit mount is 2.5t Based on this application environment, two problems need to be solved: In order to ensure the stability and transmission performance of the link, multi-path […]