Spark serialization

Time:2021-2-25

Serialization is often used for network transmission and data persistence to facilitate storage and transmission. Spark creates a serializer in two ways

1. Serializer

SparkEnv

//Serializer serializer is mainly used to serialize shuffle data and RDD cache
val serializer = instantiateClassFromConf[Serializer](
  "spark.serializer", "org.apache.spark.serializer.JavaSerializer")
logDebug(s"Using serializer: ${serializer.getClass}")

//The closeserializer is mainly used to serialize tasks
val closureSerializer = instantiateClassFromConf[Serializer](
  "spark.closure.serializer", "org.apache.spark.serializer.JavaSerializer")

//Blockmanager uses closureserializer to serialize
val blockManager = new BlockManager(executorId, rpcEnv, blockManagerMaster,
  serializer, conf, memoryManager, mapOutputTracker, shuffleManager,
  blockTransferService, securityManager, numUsableCores)

2. Task serialization

SparkContext

def runJob[T, U: ClassTag](
    rdd: RDD[T],
    func: (TaskContext, Iterator[T]) => U,
    partitions: Seq[Int],
    resultHandler: (Int, U) => Unit): Unit = {
  if (stopped.get()) {
    throw new IllegalStateException("SparkContext has been shutdown")
  }
  val callSite = getCallSite
  //Serializing func with closeserializer
  val cleanedFunc = clean(func) 
  ......

DAGScheduler.submitMissingTasks()

//Serializing RDD using closureserializer
private val closureSerializer = SparkEnv.get.closureSerializer.newInstance()
val taskBinaryBytes: Array[Byte] = stage match {
  case stage: ShuffleMapStage =>
    closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef).array()
  case stage: ResultStage =>
    closureSerializer.serialize((stage.rdd, stage.func): AnyRef).array()
}

CoarseGrainedSchedulerBackend.launchTasks()

//Serializing taskdescription using closeserializer
private val ser = SparkEnv.get.closureSerializer.newInstance()
private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
  for (task <- tasks.flatten) {
    val serializedTask = ser.serialize(task)
    ......

3. Task deserialization

CoarseGrainedExecutorBackend.receive()

//Deserializing the taskdescription using the closeserializer
private[this] val ser: SerializerInstance = env.closureSerializer.newInstance()
case LaunchTask(data) =>
  if (executor == null) {
    logError("Received LaunchTask command but executor was null")
    System.exit(1)
  } else {
    val taskDesc = ser.deserialize[TaskDescription](data.value)
    ......

TaskRunner.run()

override def run(): Unit = {
  val taskMemoryManager = new TaskMemoryManager(env.memoryManager, taskId)
  val deserializeStartTime = System.currentTimeMillis()
  Thread.currentThread.setContextClassLoader(replClassLoader)
  val ser = env.closureSerializer.newInstance()
  ......
  //Deserializing tasks using the closeserializer
  task = ser.deserialize[Task[Any]](taskBytes, Thread.currentThread.getContextClassLoader)
  ......

ShuffleMapTask.runTask()

override def runTask(context: TaskContext): MapStatus = {
  // Deserialize the RDD using the broadcast variable.
  val deserializeStartTime = System.currentTimeMillis()
  val ser = SparkEnv.get.closureSerializer.newInstance()
  //Deserializing RDD using closureserializer
  val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
    ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)

4. Spark serialization related configuration

spark.seriallizer

Default toorg.apache.spark.serializer.JavaSerializer, optionalorg.apache.spark.serializer.KryoSerializerIn fact, as long as it is org.apache.spark The subclass of. Serializer is OK, but if it’s just an application, you probably won’t implement one yourself.

Serialization has a great impact on the performance of spark applications. In the case of a specific data format, the performance of kryoserializer can reach more than 10 times that of Java serializer. Of course, considering the whole Spark Program, the proportion is not so large. But taking wordcount as an example, it is usually easy to achieve a performance improvement of more than 30%. For some basic type data such as int, the performance improvement can almost be ignored. Kryoserializer relies on Twitter’s chill library. Compared with Java serializer, the main problem is that not all Java serializable objects can support it.

It should be noted that the available serializer is for shuffle data, RDD cache and other occasions, while the spark task is serialized throughspark.closure.serializerHowever, only javaserializer is supported at present.

So it’s impossible to configure. For more optimized configurations related to kryo serialization, please refer tohttp://spark.apache.org/docs/…A section

spark.rdd.compress

The default value is false. This parameter determines whether the RDD data is further compressed and stored in memory or disk after serialization in the process of RDD cache.

Of course, to further reduce the size of cache data, the absolute size of cache on disk probably does not matter much, mainly considering the IO bandwidth of disk. As for cache in memory, it is mainly to consider the influence of size, whether it can cache more data, whether it can reduce the pressure of cache data on GC, etc.
The former is usually not the main problem, especially when the purpose of RDD cache itself is to pursue speed, reduce recalculation steps and replace CPU with io. The latter, of course, needs to consider the GC problem. The data volume is small and the space occupied is small. The GC problem will probably be alleviated. However, whether we really need to go to the RDD cache compression step or not may be solved in other ways, which may be more effective.

So this value is off by default, but if disk IO does become a problem or there is really no better solution for GC problem, you can consider enabling RDD compression.

spark.broadcast.compress

Whether to compress the data of broadcast. The default value is true.

The broadcast mechanism is used to reduce the size of the relevant data used by the RDD that needs to be sent to the task when running each task. An executor only needs to get a copy of the broadcast data when the first task is started, and the subsequent tasks get the relevant data from the local blockmanager. In the latest version of 1.1 code, RDD itself is also sent to the executor in the form of broadcast (the previous implementation of RDD itself is sent with each task), so it is not necessary to explicitly decide which data needs broadcast.
Because the data of broadcast needs to be sent through the network, and the executor needs to store it in the local blockmanager. With the latest implementation, the default RDD is sent through the boradcast mechanism, which greatly increases the proportion of broadcast variables.

Therefore, reducing the size by compression to reduce the network transmission overhead and memory consumption is usually conducive to improving the overall performance. What’s better without compression? Generally speaking, when network bandwidth and memory are not a problem, if the CPU resources on the driver side are very problematic (after all, compression is basically performed on the driver side), it may be necessary to adjust.

spark.io.compression.codec

Codec, the algorithm used in RDD cache and shuffle data compression, used to use lzf as the default codec. Recently, due to the memory overhead of lzf, the default codec has been changed to snappy.

Compared with snappy, the compression ratio of lzf is relatively high (of course, it depends on the specific data content, which is usually about 20% higher), but in addition to the memory problem, the CPU cost is also higher (about 20% ~ 50%?)

In the case of shuffle data, memory may become a problem when using hashshuffle manager, because if there are a large number of reduce partitions, a large number of compressed data streams need to be opened for writing files at the same time, and then a large number of buffers are needed in codec. However, if you use sortshufflemanager, because the number of shuffle files is greatly reduced, it will not produce a large number of compressed data streams, so memory overhead will probably not become a major problem.
The rest is the trade-off between CPU and compression rate. As before, it depends on the capacity and load of CPU / network / disk. I think CPU is usually easier to become a bottleneck. So if you want to adjust the performance, it’s more likely to either not compress or use snappy?

For RDD cache applications, most of them are memory operations or local IO, so the problem of CPU load may be more prominent than that of IO, which is why spark.rdd.compress Snappy itself is not compressed by default. If you want to compress, is snappy appropriate?