Analysis of Apache spark caching and checkpointing

Time:2021-3-22

abstract

As an Apache spark application developer, memory management is one of the most important people, but the difference between caching and checkpointing may lead to confusion. These two operations are used to prevent the unnecessary loss of time and space caused by repeated calculation when RDD (elastic distributed data set) is referenced each time. But what is the difference between them?

Analysis of Apache spark caching and checkpointing

Caching

The cache mechanism ensures that applications that need to access duplicate data (such as iterative algorithms and interactive applications) can run faster. There are multiple levels of persistence strategies for developers to choose, so that developers can balance the space and computing cost, and specify the operation on RDD when out of memory (cache in memory or disk, and select a part of blocks to be exchanged to disk according to FIFO strategy to generate free space in case of insufficient memory). Therefore, spark can not only repeatedly calculate RDD, but also recalculate the lost partition in case of node failure. Finally, the cached RDD exists in the life cycle of a running application. If the application is terminated, the cached RDD will be deleted at the same time.

Checkpointing

Checkpointing stores RDD to a reliable storage system (such as HDFS, S3). Checkpoint an RDD is a bit similar to Hadoop, which stores the intermediate calculation results to disk, and loses part of the execution performance to get better recovery ability when failures occur in the running process. Because RDD is the external storage system of checkpoint (disk, HDFS, S3, etc.), checkpoint RDD can be reused by other applications.

The relationship between caching and checkpointing

Let’s first look at the computation path of RDD to understand the interaction between caching and checkpointing.
The core of spark engine isDAGScheduler. It decomposes a spark job into DAG composed of several stages. Each shuffle or result stage is decomposed into tasks that run independently in the RDD partition. An iterator method of RDD is the entry of a task to access the basic data partition

/**
   * Internal method to this RDD; will read from cache if applicable, or otherwise compute it.
   * This should ''not'' be called by users directly, but is available for implementors of custom
   * subclasses of RDD.
*/    
 final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
    if (storageLevel != StorageLevel.NONE) {
      getOrCompute(split, context)
    } else {
      computeOrReadCheckpoint(split, context)
    }
  }

We can see from the code that if the storage level is set, the RDD may be cached. It first attempts to call the getorcompute method to get the partition from the block manager.

/**
   * Gets or computes an RDD partition. Used by RDD.iterator() when an RDD is cached.
   */
  private[spark] def getOrCompute(partition: Partition, context: TaskContext): Iterator[T] = {
    val blockId = RDDBlockId(id, partition.index)
    var readCachedBlock = true
    // This method is called on executors, so we need call SparkEnv.get instead of sc.env.
    SparkEnv.get.blockManager.getOrElseUpdate(blockId, storageLevel, elementClassTag, () => {
      readCachedBlock = false
      computeOrReadCheckpoint(partition, context)
    }) match {
      case Left(blockResult) =>
        if (readCachedBlock) {
          val existingMetrics = context.taskMetrics().registerInputMetrics(blockResult.readMethod)
          existingMetrics.incBytesReadInternal(blockResult.bytes)
          new InterruptibleIterator[T](context, blockResult.data.asInstanceOf[Iterator[T]]) {
            override def next(): T = {
              existingMetrics.incRecordsReadInternal(1)
              delegate.next()
            }
          }
        } else {
          new InterruptibleIterator(context, blockResult.data.asInstanceOf[Iterator[T]])
        }
      case Right(iter) =>
        new InterruptibleIterator(context, iter.asInstanceOf[Iterator[T]])
    }
  }

If there is no RDD partition in the block manager, it goes to computeorreadcheckpoint:

/**
   * Compute an RDD partition or read it from a checkpoint if the RDD is checkpointing.
   */
  private[spark] def computeOrReadCheckpoint(split: Partition, context: TaskContext): Iterator[T] =
  {
    if (isCheckpointedAndMaterialized) {
      firstParent[T].iterator(split, context)
    } else {
      compute(split, context)
    }
  }

As you guessed, the computeorreadcheckpoint method will find the corresponding data from the checkpoint. If RDD is not checkpointed, it will start from the current partition.

It’s time to talk about the difference

The cache mechanism is that each partition to be cached is directly cached into memory. However, checkpoint does not use this method of storing the first calculation, but starts a special job to complete checkpoint after the job is finished. That is to say, the RDD that needs checkpoint will be calculated twice. Therefore, in the use of rdd.checkpoint (), it is suggested to add rdd.cache (), so that the job running for the second time does not need to calculate the RDD, and reads the cache directly and writes to the disk. In fact, spark provides rdd.persist ( StorageLevel.DISK_ Only) is equivalent to cache to disk. In this way, RDD can be stored on disk the first time it is calculated. But there are many differences between persist and checkpoint. The former can persist RDD partition to disk, but the partition is managed by blockmanager. Once the driver program is executed, that is, the process of the executor, coarsegrainedexecutorbackend stop, the blockmanager will also stop, and the RDD cached on the disk will be cleared (the local folder used by the whole blockmanager will be deleted). Checkpoint can persist RDD to HDFS or local folder. If it is not removed manually, it always exists. In other words, it can be used by the next driver program, while cached RDD cannot be used by other driver programs.

summary

Using checkpoint will consume more time on reading and writing RDD (because the external storage system HDFS, S3, or disk is used), but some failures of spark worker do not necessarily lead to recalculation. On the other hand, the RDD of caching will not occupy the storage space permanently, but recalculation is necessary when some failures occur in spark worker. To sum up, these two things all depend on the developer’s own point of view, combined with business scenarios. Generally, the choice between them is based on the performance of computing tasks (Tips: cache is enough in most cases. If you feel that the job may make mistakes, you can manually check some critical RDDS).

The author is from the maxleap team_ Member of data analysis group: Tan Yang
Original bloghttps://blog.maxleap.cn/archives/617

Welcome to wechat subscription number: from mobile to cloud
Welcome to join our maxleap activity group: 555973817. We will do technology sharing activities from time to time.

Recommended Today

Hash algorithm

In 2011, CSDN’s “off database” incident, at that time, CSDN’s website was attacked by hackers, and the registered mailbox and password plaintext of more than 6 million users were leaked. Many netizens were dissatisfied with CSDN’s plaintext saving of user passwords. If you were an engineer of CSDN, how would you store such important data […]