Use of spark learning notes spark streaming

Time:2019-11-6

1. Spark Streaming

  • Spark streaming is a real-time computing framework based on spark core, which can consume and process data from many data sources
  • One of the most basic abstractions in spark streaming is called dstream (agent), which is essentially a series of continuous RDD. Dstream is actually the encapsulation of RDD
  • Dstream can be regarded as an RDD factory. In this dstream, RDD with the same business logic is produced, but the data to be read in RDD is different
  • In the processing interval of a batch, dstream generates only one RDD
  • Dstream is equivalent to a “template”. We can process the RDD generated in a period of time according to this “template”, and build the DAG of RDD based on this

2. The current popular real-time computing engine

Throughput programming language processing speed ecology

Storm low clojure very fast (sub second) Ali (jstorm)

Flink is higher, Scala is faster (sub second), less used in China

Spark streaming very high Scala fast (MS) perfect ecosystem

3. Spark streaming processing network data

//Creating a streamingcontext requires at least two threads, one for receiving data and one for processing data
val conf = new SparkConf().setAppName("Ops1").setMaster("local[2]")
val ssc = new StreamingContext(conf, Milliseconds(3000))
val receiverDS: ReceiverInputDStream[String] = ssc.socketTextStream("uplooking01", 44444)
val pairRetDS: DStream[(String, Int)] = receiverDS.flatMap(_.split(",")).map((_, 1)).reduceByKey(_ + _)
pairRetDS.print()
//Open flow calculation
ssc.start()
//Elegant closure
ssc.awaitTermination()

4. Two ways for spark streaming to receive data (Kafka)

Receiver

  • The offset is maintained by zookeeper
  • Using Kafka advanced API (consumer API)
  • Programming is simple
  • Low efficiency (in order to ensure data security, wal will be turned on)
  • Receiver has been completely deprecated in Kafka version 0.10
  • This is not generally used in production environments

Direct

  • The offset is maintained manually by us
  • High efficiency (we directly connect spark streaming to Kafka’s partition)
  • Programming is complex
  • This is generally used in production environments

5. Spark streaming integration Kafka

Kafka integration based on receiver (not recommended for production environment, removed in 0.10)

//Creating a streamingcontext requires at least two threads, one for receiving data and one for processing data
val conf = new SparkConf().setAppName("Ops1").setMaster("local[2]")
val ssc = new StreamingContext(conf, Milliseconds(3000))
val zkQuorum = "uplooking03:2181,uplooking04:2181,uplooking05:2181"
val groupId = "myid"
val topics = Map("hadoop" -> 3)
val receiverDS: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(ssc, zkQuorum, groupId, topics)
receiverDS.flatMap(_._2.split(" ")).map((_,1)).reduceByKey(_+_).print()
ssc.start()
ssc.awaitTermination()

Direct based approach (used in production environment)

//Creating a streamingcontext requires at least two threads, one for receiving data and one for processing data
val conf = new SparkConf().setAppName("Ops1").setMaster("local[2]")
val ssc = new StreamingContext(conf, Milliseconds(3000))
val kafkaParams = Map("metadata.broker.list" -> "uplooking03:9092,uplooking04:9092,uplooking05:9092")
val topics = Set("hadoop")
val inputDS: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
inputDS.flatMap(_._2.split(" ")).map((_, 1)).reduceByKey(_ + _).print()
ssc.start()
ssc.awaitTermination()

6. Architecture of real-time stream computing

1. Generate log (simulate the log of user accessing web application)

public class GenerateAccessLog {
  public static void main(String[] args) throws IOException, InterruptedException {
    //Prepare data
    int[] ips = {123, 18, 123, 112, 181, 16, 172, 183, 190, 191, 196, 120};
    String[] requesTypes = {"GET", "POST"};
    String[] cursors = {"/vip/112", "/vip/113", "/vip/114", "/vip/115", "/vip/116", "/vip/117", "/vip/118", "/vip/119", "/vip/120", "/vip/121", "/free/210", "/free/211", "/free/212", "/free/213", "/free/214", "/company/312", "/company/313", "/company/314", "/company/315"};

    String [] coursenames = {"big data", "Python", "Java", "C + +," C "," Scala "," Android "," spark "," Hadoop "," redis "};
    String[] references = {"www.baidu.com/", "www.sougou.com/", "www.sou.com/", "www.google.com"};
    FileWriter fw = new FileWriter(args[0]);
    PrintWriter printWriter = new PrintWriter(fw);
    while (true) {
      //      Thread.sleep(1000);
      //Generate field
      String date = new Date().toLocaleString();
      String method = requesTypes[getRandomNum(0, requesTypes.length)];
      String url = "/cursor" + cursors[getRandomNum(0, cursors.length)];
      String HTTPVERSION = "HTTP/1.1";
      String ip = ips[getRandomNum(0, ips.length)] + "." + ips[getRandomNum(0, ips.length)] + "." + ips[getRandomNum(0, ips.length)] + "." + ips[getRandomNum(0, ips.length)];
      String reference = references[getRandomNum(0, references.length)];
      String rowLog = date + " " + method + " " + url + " " + HTTPVERSION + " " + ip + " " + reference;
      printWriter.println(rowLog);
      printWriter.flush();
    }
  }


  //[start,end)
  public static int getRandomNum(int start, int end) {
    int i = new Random().nextInt(end - start) + start;
    return i;
  }
}

2. Flume uses Avro to collect log data of web application server

Collect the results of command execution into Avro

# The configuration file needs to define the sources, 
# the channels and the sinks.
# Sources, channels and sinks are defined per agent, 
# in this case called 'agent'
f1.sources = r1
f1.channels = c1
f1.sinks = k1

#define sources
f1.sources.r1.type = exec
f1.sources.r1.command =tail -F /logs/access.log

#define channels
f1.channels.c1.type = memory
f1.channels.c1.capacity = 1000
f1.channels.c1.transactionCapacity = 100

#Define sink collect log to uplooking03
f1.sinks.k1.type = avro
f1.sinks.k1.hostname = uplooking03
f1.sinks.k1.port = 44444

#bind sources and sink to channel 
f1.sources.r1.channels = c1
f1.sinks.k1.channel = c1
Acquisition from Avro to console
# The configuration file needs to define the sources, 
# the channels and the sinks.
# Sources, channels and sinks are defined per agent, 
# in this case called 'agent'
f2.sources = r2
f2.channels = c2
f2.sinks = k2

#define sources
f2.sources.r2.type = avro
f2.sources.r2.bind = uplooking03
f2.sources.r2.port = 44444

#define channels
f2.channels.c2.type = memory
f2.channels.c2.capacity = 1000
f2.channels.c2.transactionCapacity = 100

#define sink
f2.sinks.k2.type = logger

#bind sources and sink to channel 
f2.sources.r2.channels = c2
f2.sinks.k2.channel = c2
From Avro to Kafka
# The configuration file needs to define the sources, 
# the channels and the sinks.
# Sources, channels and sinks are defined per agent, 
# in this case called 'agent'
f2.sources = r2
f2.channels = c2
f2.sinks = k2

#define sources
f2.sources.r2.type = avro
f2.sources.r2.bind = uplooking03
f2.sources.r2.port = 44444

#define channels
f2.channels.c2.type = memory
f2.channels.c2.capacity = 1000
f2.channels.c2.transactionCapacity = 100

#define sink
f2.sinks.k2.type = org.apache.flume.sink.kafka.KafkaSink
f2.sinks.k2.topic = hadoop
f2.sinks.k2.brokerList = uplooking03:9092,uplooking04:9092,uplooking05:9092
f2.sinks.k2.requiredAcks = 1

The above is the whole content of this article. I hope it will help you in your study, and I hope you can support developepaer more.

Recommended Today

The use of progressbarcontrol, a progress bar control of devexpress – Taking ZedGraph as an example to add curve progress

scene WinForm control – devexpress18 download installation registration and use in vs: https://blog.csdn.net/BADAO_LIUMANG_QIZHI/article/details/100061243 When using ZedGraph to add curves, the number of curves is slower if there are many cases. So in the process of adding curve, the progress needs to be displayed, and the effect is as follows     Note: Blog home page:https://blog.csdn.net/badao_liumang_qizhi […]