2. Spark principle RDD and shared variables

Time:2020-11-26

explain

  1. This article only discusses spark running in cluster mode.
  2. The reading time is 15 minutes.
  3. This use case is all in spark shell interactive script.

Spark has two important concepts: RDD and shard variable. The following is a detailed introduction.

1 RDD

The full name of RDD is resilient distributed dataset. Spark operation is based on this abstract set, which enables it to deal with different big data processing scenarios in a basically consistent way, including MapReduce, streaming, SQL, machine learning, graph, etc.

Spark supports a variety of programming languages, including python, Scala and R. RDD also supports storing objects in these languages.

1.1 create RDD

Spark supports the creation of RDDS from multiple places, including local file system, HDFS file system, HBase and memory.

1.1.1 local file system

//From local file / user / root/ data.txt Create as immutable RDD
val distFile = sc.textFile("file:///user/root/data.txt")

1.1.2 HDFS file system

Spark gets data from the HDFS file system by default

val distFile = sc.textFile("/user/root/data.txt")
//Specify HDFS URL
val distFile = sc.textFile("hdfs://localhost:9000/user/data.txt")

1.1.3 memory

Created by calling the parallelize method of sparkcontext.

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)

After the RDD is created, we can parallelize these datasets in the cluster.

1.2 RDD operation

RDD supports two types of operations:

  • Transformation operation is used to create a new RDD from an existing RDD;
  • Action, which calculates the RDD and returns the result to the program, or stores the result to an external storage system (such as HDFS).

The important difference between the transformation and the action operation is that the conversion operation is lazy execution, and it will not be executed until the action operation begins to execute.

1.2.1 conversion operation

The following RDD methods are conversion operations:

  • map
  • filter
  • flatMap
  • mapPartitions
  • sample
  • union
  • intersection
  • distinct
  • groupByKey
  • reduceByKey
  • aggregateByKey
  • sortByKey
  • join
  • cogroup
  • pipe
  • repartition

The conversion operation method returns the new RDD.

1.2.2 action operation

The following RDD methods are all action operations:

  • reduce
  • reduceByKey
  • collect
  • count
  • first
  • take
  • takeSample
  • takeOrdered
  • saveAsTextFile
  • saveAsSequenceFile
  • saveAsObjectFile
  • countByKey
  • foreach

The action operation method returns the final result.

1.3 RDD cache

Spark is lazy in evaluating RDD operations, and sometimes wants to be able to use the same RDD many times. If you simply call an action operation on an RDD, it will recalculate the RDD every time. This is particularly expensive in iterative algorithms, because iterative algorithms often use the same set of data multiple times.

To avoid calculating the same RDD multiple times, spark can be allowed to persist the data. Persist() method of RDD object is used to persist the data. This method supports passing in parameters to specify the persistence location and persistence mode.

val lines = sc.textFile("file:///tmp/README.md")
import org.apache.spark.storage._
lines.persist(StorageLevel.MEMORY_ONLY)

1.3.1 remove data

Spark automatically monitors cache usage and automatically removes data based on the LRU (least recently used) algorithm. If you want to remove data manually, you can use the unpersist method. By default, this method does not consider whether the data is in use. If you want to remove the data when the resource is no longer in use, you can specify blocking = true.

2 Shared Variable

When a remote node performs a function that is passed in to spark operation, all variables of the function will be copied to the node. When these variables are updated in other nodes, the current node variables will not be updated until they are returned to the driver program. In general, reading and writing variables across tasks are not efficient. Spark provides two kinds of shared variables:

  • Broadcast variables
  • Accumulating variables

2.1 broadcast variables

Broadcast variables allow developers to cache a read-only variable on each node without copying in each task. Broadcast variables can be used to efficiently copy a large input data set for each node. Spark will also try to use efficient algorithms to broadcast variables to reduce transmission costs.

Show creating broadcast variables is only useful if the task needs to span multiple stages and the same data is required, or the cached data is in reverse serialization format.

To create a broadcast variable:

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)

Get the broadcast variable value from the value field. As the name says, broadcast variables are sent one-way from the driver to the task. Broadcast variables cannot be updated and the driver cannot be updated. Ensure that all nodes get the same data.

Call the unpersist() method to release the resources occupied by broadcast variables. If the resource is used again, the variable will be broadcast again. If you want to permanently remove resources occupied by broadcast variables, you can call the destroy() method.

2.2 accumulators

Similar to static variables in C language, spark supports numerical and user-defined cumulative variables. Its feature is to allow multiple tasks to update the same cumulative variable in order. When you create a variable, you can use the SparkContext.longAccumulator () or SparkContext.doubleAccumulator () create long type and double type accumulator. Tasks can use the add method to add content to the accumulation variable. The actuator program cannot read the accumulated variable, only the driver program can read it through the value method.

scala> val accum = sc.longAccumulator("My Accumulator")
accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, name: Some(My Accumulator), value: 0)

scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum.add(x))
...
10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s

scala> accum.value
res2: Long = 10

Of course, you can also customize the cumulative variable.

quote

  1. http://spark.apache.org/docs/…