“Spark” 2. Analysis of the basic concept of spark

Time:2021-3-29

Original text:“Spark” 2. Analysis of the basic concept of spark

Write on the front

This series is a combination of my understanding records in the process of learning spark + some understanding of reference articles + some personal experience in the process of practicing spark. The purpose of writing such a series is only to sort out the notes of personal learning spark, not to do any tutorials, so everything is based on personal understanding and sorting, and no necessary details will be recorded. If you want to know more, you’d better read reference articles and official documents.

Secondly, this series is based on the latest spark 1.6.0 series, spark update speed is very fast, it is necessary to record a good version.
Finally, if you feel that the content is wrong, please leave a note. All the messages will be answered within 24 hours. Thank you very much.
Tips: if the illustration doesn’t look obvious, you can: 1. Enlarge the web page; 2. Open the picture in the new tag and view the original picture.

1. Application

The program built by users on spark includes driver program and program code running on cluster. The physical machine involves three nodes: driver, master and worker

2. Driver Program

Create SC, define UDF function, define the logic of three steps required by a spark application: load data set, process data, and display results.

3. Cluster Manager

The resource manager of the cluster can obtain external services of resources on the cluster.
Take yarn as an example. The client program will apply to yarn to calculate how much memory, CPU, etc I need for this task.
Then the cluster manager will tell the client that it can be used through scheduling, and then the client can send the program to each worker node for execution.

4. Worker Node

Any node in the cluster that can run spark application code. A worker node is a physical node on which the executor process can be started.

5. Executor

A process started on each worker node for an application, which is responsible for running tasks and storing data in memory or disk. Each task has its own independent executor.
The executor is a container for executing tasks. Its main responsibilities are:

  • Initializes the context sparkenv to be executed by the program, resolves the dependency of the jar package that the application needs to run, and loads the class.

  • At the same time, an executorbackend reports the current task status to the cluster manager, which is a bit similar to Hadoop’s task tracker and task.

Summary: the executor is a container for monitoring and executing the running of an application.

6. Jobs

Parallel computing with many tasks can be regarded as the action in spark RDD, and each action trigger generates a job.
The jobs submitted by users will be submitted to dagscheduler, the jobs will be decomposed into stages, and the stages will be refined into tasks. Task is simply a single data processing flow on a data partition. For details about job, stage and task, please refer to this article:“Spark” 6. In depth study of the working principle of spark: job, stage, task

A job is triggered by an action, like count() or saveAsTextFile(), click on a job to see info about the stages of tasks inside it.

7. Stage

A job is divided into multiple groups of tasks, and each group of tasks is called a stage, just like map stage and reduce stage.

The division of stage is introduced in detail in the paper of RDD. In short, it is divided into shuffle and result.
There are two types of tasks in spark:

  • shuffleMapTask

    The output is the data needed by shuffle, and the stage division is also based on it. All the transformations before shuffle are one stage, and the operation after shuffle is another stage.
  • resultTask,

    The output is result, such as rdd.parallize (1 to 10). Foreach (println) has no shuffle and outputs directly. Then only its task is resulttask and there is only one stage. If it is rdd.map (x => (x, 1)).reduceByKey(_  + _ ). foreach (println), this job has a shuffle process because it has reduce, so the stage before reducebykey is to execute shufflemaptask and output the data needed for shuffle. Reducebykey is a stage at the end and directly outputs the result. If there are multiple shuffles in the job, then each shuffle is preceded by a stage.

8. Task

Unit of work sent to the executor.

9. Partition

Partition is similar to Hadoop split. The calculation is based on partition. Of course, partition can be divided according to many criteria, which can be defined by yourself. For example, HDFS files are divided in the same way as MapReduce files. Different partitions are divided according to the block of files. In a word, Spark’s partition is similar to Hadoop’s split in concept, providing a way to divide data.

10. RDD

Let’s look at the original [resilient distributed datasets: a fault tolerant abstraction for
In-Memory Cluster Computing](http://litaotao.github.io/files/spark-rd…How to introduce RDD.


a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.

RDDs are motivated by two types of applications that current computing frameworks handle inefficiently:

  • iterative algorithms;

  • interactive data mining tools;

In both cases, keeping data in memory can improve performance by an order of magnitude.

To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarsegrained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.


Each RDD has five main attributes:

  • A partition, the basic unit of a data set

  • A function that computes each slice

  • Dependency on parent RDD, which describes the lineage between RDDS

  • For key value RDD, a partitioner is optional

  • A list that stores and accesses the preferred location of each partition. For an HDFS file, the location of each partition block is stored. This is also optional

By summarizing the above five main attributes, we can get the general concept of RDD. First of all, RDD is something that represents a dataset. It has some of the attributes listed above. Spark is a data structure designed by spark project group to represent data set. In order to make RDD handle more problems, spark project team also stipulates that RDD should be read-only, which is a data set of partition records. There are two ways to create RDD: one is based on data in physical storage, such as files on disk; the other is the most common way to create RDD, that is, to create [later called conversion] through other RDDS. Because RDD satisfies so many features, spark calls RDD resilient distributed datasets, and in Chinese, resilient distributed datasets. Many articles are first about the definition and concept of RDD, then about the characteristics of RDD. In fact, I think it can be reversed. We can understand the definition and concept of RDD through the characteristics of RDD. It’s not too bad to understand RDD through this way of tracing the result to the cause. Anyway, it’s a good way for me.

RDD is not only the core of spark, but also the architecture foundation of the whole spark

  • It is an invariable data structure storage

  • It is a distributed data structure supporting cross cluster

  • The structure can be partitioned according to the key of the data record

  • Provides coarse-grained operations, and these operations support partitioning

  • It stores data in memory, providing low latency

For more details on RDD, please refer to this article:“Spark” 4. RDD of spark

11. sc.parallelize

First, let’s see what the API document saysparallelize


  • parallelize(c, numSlices=None)

Distribute a local Python collection to form an RDD. Using xrange is recommended if the input represents a range for performance.


To put it simply, parallelize is to distribute a data set defined on the driver side, or a generator to obtain the data set, to the executor on the worker for subsequent analysis. This method is often used when testing code logic, but it is rarely used when building real spark applications, generally reading data from HDFS or database.

12. code distribute

When you submit a spark application, spark will distribute the application code to all workers. Packages that the application depends on need to exist on all workers. There are two ways to solve the problem of package dependence on workers

  • Select some tools to deploy spark cluster uniformly;

  • When the spark application is submitted, it specifies the related packages that the application depends on, and distributes the application code and the package to the worker;

13. cache priority

Whether cache supports priority is not supported at present, and the cache of RDD in spark is different from our common cache system. Details can be discussed with me.

14. cores


The number of cores to use on each executor. For YARN and standalone mode only. In standalone mode, setting this parameter allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker. Otherwise, only one executor per application will run on each worker.


Each core is equivalent to a process on a worker, and these processes will execute the tasks assigned to the worker at the same time. In short, spark manager divides a job into several tasks and distributes them to workers for synchronous execution, while each worker divides the tasks assigned to him into several subtasks and assigns them to different processes on the current worker.

15. Memory

Is the memory allocated to the spark application only for cache data?

No, the memory allocated to the spark application has three applications:

  • Spark itself

  • Spark application

    • Runtime is used in spark application, such as UDF function

    • Cache in spark application

16. RDD narrow/wide dependences

Dependency classes between RDDS [or different ways to create an RDD]

17. Local memory and cluster memory

The so-called local memory refers to the memory required by the program on the driver side, which is provided by the driver machine, and is generally used to generate test data and receive operation results;
Cluster memory refers to the maximum amount of memory that a job submitted to the cluster can apply to the cluster, which is generally used to store key data;

18. Limit user memory

You can apply when you start the spark application; it’s completely controllable.

19. When the total resources requested by users exceed the total resources of the current cluster

FIFO resource allocation mode.

20. Sparkcontext [often referred to as SC]

The starting point and entry of spark app are generally used to load data sets and generate the first RDD.

21. What is the impact on multiple caches of an RDD?

No, only once.stackoverflow

4. Next

Next, I will introduce the basic programming mode of spark through several simple examples.

5. Open wechat, scan, click, bang bang

Reference article

Links to this series