Spark series (6): shared variables of spark


Spark series (6): shared variables of spark

By studytime

What are shared variables?

All transformation operators in spark are parallelized by parallel tasks distributed to multiple nodes. When a custom function is passed to the spark operator (such as map or reduce), the variables contained in the function will be propagated to the remote node through copy mode. However, all write operations for these variables will only be updated locally, and will not be passed back to the driver and distributed updates. Generally, cross task read-write variables are inefficient. Therefore, spark provides two limited shared variables: Broadcast variable and shared variable.

Broadcast variable

What is the broadcast variable?

Broadcast variable is a kind of read-only variable that can be distributed to each node of the cluster. The driver side distributes the variable to each executor. The executor only needs to save a copy of the variable, rather than one copy for each task. This avoids that when there are too many tasks, the bandwidth of the driver becomes the bottleneck of the system and the resource consumption on the task server. Spark implements an efficient broadcast algorithm to ensure the efficient distribution of broadcast variables.

Graphic broadcast variables

Illustration of unused broadcast variables:

Spark series (6): shared variables of spark

Use the broadcast variable diagram:

Spark series (6): shared variables of spark

How spark creates broadcast variables and uses

  1. How to define a broadcast variable
val a = 3
val broadcast = sc.broadcast(a)
  1. Restore a broadcast variable
val c = broadcast.value
  1. Code usage instance
val arr = (0 until 1000).toArray

#Create a broadcast variable, and the corresponding broadcast data is an array
val barr = sc.broadccast(arr) 

#Use of broadcast variables
val pbservedSizes = sc.parallelize(1 to 10 ,slices).map(_=>barr.value.size)
  1. Notes on using broadcast variables
  • Once a variable is defined as a broadcast variable, it can only be read and cannot be modified
  • Can an RDD be broadcast using broadcast variables? The answer is: No, because RDD does not store data. The results of RDD can be broadcast.
  • Broadcast variables can only be defined on the driver side, not on the executor side.
  • The value of broadcast variable can be modified on the driver side, and cannot be modified on the executor side.
  • If the executor uses the driver’s variables, it will have as many copies of the driver’s variables as there are tasks in the executor without using the broadcast variables.
  • If the executor side uses the driver’s variables, if the broadcast variables are used, there is only one copy of the driver side variables in each executor.


What is the accumulator?

The accumulator is similar to the distributed counter in MapReduce. It is an integer value, which can be modified separately in each task, and then the global value is automatically summarized. Accumulator is often used to track the running state of the state, which is convenient for debugging and monitoring spark programs.

Graphic accumulator

Accumulator not used

Spark series (6): shared variables of spark

Accumulator used

Spark series (6): shared variables of spark

How spark creates accumulators and uses them

Define an accumulator
val a = sc.accumulator(0)
Restore an accumulator
val b = a.value
Code usage instance
Define an accumulator with an initial value of 0 and a name of total
val totalPoints = sc.accumulator(0,"total")
Define an accumulator named hit with an initial value of 0
val hitPoints = sc.accumulator(0,"hit")
val count = sc.parallelize(1 until n,slices).map{
    val x = random * 2 -1
    val y = random * 2 -1
    Totalpoints + = 1 // update accumulator
    If (x * x + y * y < 1) hitpoints + = 1 // update accumulator

//Get accumulator value
val result = hitPoints.value / totalPoints.value;
Matters needing attention
  • The accumulator defines the initial value on the driver side. The accumulator can only read the last value on the driver side and update it on the exciter side.
  • Accumulator is not a tuning operation, because if you don’t, the result is wrong