[Spark Learning Notes] RDD


What is RDD?

RDD is the computing cornerstone of spark. It is the abstraction of distributed data, shielding the complex computing and mapping environment for users

  1. RDD is immutable. If you need to perform conversion operations on an RDD, a new RDD will be generated
  2. RDD is partitioned. The specific data in RDD is distributed in executors on multiple machines. In and out of heap memory + disk.
  3. RDD is elastic.

    • Storage: spark will automatically cache RDD data to memory or disk according to the user’s configuration or the current operation of Spark’s application. It is an encapsulated function that is invisible to users.
    • Fault tolerance: when your RDD data is deleted or lost, you can recover the data through lineage or checkpoint mechanism. This user is transparent.
    • Calculation: calculation is hierarchical, with application – > job – > stage – > taskset task. Each layer has corresponding calculation guarantee and repetition mechanism. Ensure that your calculation is not terminated due to some unexpected factors.
    • Sharding: you can readjust the data distribution in RDD according to business requirements or some operators.

What spark core did is to operate RDD:
Creation of RDD — conversion of RDD — cache of RDD — action of RDD — output of RDD.

RDD persistence

RDD passPersist method or cache methodThe previous calculation results can be cached by defaultpersist() The data will beSerializedForm cache inHeap space of JVMMedium.
However, when these two methods are called, the RDD will be cached in the memory of the computing node and reused later when the subsequent action is triggered.

Discover by looking at source codeCache also calls persist method in the endThe default storage level is to store only one copy in memory,
[Spark Learning Notes] RDD
Storage level of sparkThere are many kinds of storage levelsDefined in object storagelevel.
[Spark Learning Notes] RDD
[Spark Learning Notes] RDD

RDD checkpoint mechanism

Checkpoints (in essence, RDD is written to disk as checkpoints) are used for fault tolerance through lineage,Too long lineage will result in too high cost of fault toleranceIn this way, it’s better to do checkpoint fault tolerance in the middle stage. If there is a problem with a node and the partition is lost, you can redo lineage from the RDD that is the checkpointReduce overhead。 The checkpoint passes the dataWrite to HDFS fileThe system realizes the checkpoint function of RDD.
Checkpoint is aRDD is saved in HDFS, which is multi copy reliable storage, so the dependency chain can be lostCut off the chain of dependence(parent RDD not found), which is a high tolerance error implemented by replication.

The checkpoint mechanism is more appropriate if the following scenarios exist:
1) The lineage in DAG is too long. If it is recalculated, the overhead is too high (for example, in PageRank).
2) Making checkpoint on the basis of wide dependency will bring more benefits.