Understand stage, executor, driver in spark

Time:2021-1-26

1. Introduction

Asiba, for novice spark, first of all, they don’t understand the operation mechanism of spark. When they communicate with you, they don’t know what they are talking about. For example, deployment mode and operation mode may be confused. For veteran spark with certain development experience, even if they know the operation mechanism, they may not understand all the terms of spark very well, so they understand spark K-term is a necessary way for spark developers to communicate with each other. This paper starts with the operation mechanism of spark, and then goes to wordcount case to understand various terms in spark.

2. Operation mechanism of spark

First of all, take a picture of the official website to illustrate that it is the general execution framework of spark applications on distributed clusters. It is mainly composed of spark context, cluster manager and resource manager ▪ Executor (execution process of a single node). Cluster manager is responsible for the unified resource management of the whole cluster. Executor is the main process of application execution, which contains multiple task threads and memory space.

file
The main operation process of spark is as follows:

  1. After the application is submitted with spark submit, it initializes spark context in the corresponding position according to the deployment mode at the time of submission, that is, the running environment of spark, and creates DAG scheduler and task scheduler. The driver divides the whole program into multiple jobs according to the action operator according to the execution code of the application, and each job builds DAG diagram and DAG diagram The scheduler divides the DAG graph into multiple stages, and each stage is internally divided into multiple tasks. The DAG scheduler transfers the task set to the task scheduler, which is responsible for scheduling tasks on the cluster. As for the relationship between stage and task and how they are divided, we will talk about it in detail later.

  2. The driver applies for resources from the resource manager according to the resource requirements in the sparkcontext, including the number of executors and memory resources.

  3. After receiving the request, the resource manager creates the executor process on the work node that meets the conditions.

  4. After the executor is created, it will reverse register with the driver, so that the driver can assign tasks to him to execute.

  5. After the program is executed, the driver logs off the requested resource to the resource manager.

3. Understand the terms in spark

In terms of operation mechanism, let’s continue to explain the following terms,

3.1 Driver program

Driver is the spark application we write to create a spark context or spark session. The driver will communicate with cluster manager and assign tasks to the executor for execution

3.2 Cluster Manager

It is responsible for the resource scheduling of the whole program

YARN

Spark Standalone

Mesos

3.3 Executors

Executors is actually an independent JVM process, which plays a role on each work node. It is mainly used to execute tasks. In an executor, multiple tasks can be executed in parallel at the same time.

3.4 Job

Job is a complete processing flow of user program, which is a logical term.

3.5 Stage

A job can contain multiple stages, which are serial. The trigger of state is generated by some shuffle, reduce and save actions

3.6 Task

A stage can contain multiple tasks, such as sc.textFile (“/ XXXX”). Map (). Filter (), where map and filter are a task respectively. The output of each task is the output of the next task.

3.7 Partition

Partition is a part of the data source in spark. A complete data source will be divided into multiple partitions by spark so that spark can be sent to multiple executors to execute tasks in parallel.

3.8 RDD

RDD is a distributed elastic data set. In spark, a data source can be regarded as a large RDD. RDD is composed of multiple partitions. The data loaded by spark will be stored in RDD. Of course, in RDD, it is actually cut into multiple partitions.

So the question is, how does a spark job execute?

(1) Our spark program, also known as the driver, will submit a job to the cluster manager

(2) The cluster manager checks the local rows of data and finds the most suitable node to schedule the task

(3) Jobs will be split into different stages, and each stage will be split into multiple tasks

(4) The driver sends the task to the executor to execute the task

(5) The driver will track the execution of each task and update it to the master node, which we can check on the spark master UI

(6) When the job is completed, the data of all nodes will be aggregated to the master node again, including the average time consumption, maximum time consumption, median and other indicators.

3.9 deployment mode and operation mode

The deployment mode refers to the cluster manager, which generally includes standalone and yarn, while the running mode refers to the running machine of drvier, the cluster or the task submitting machine, which correspond to the cluster and client modes respectively. The difference lies in the running results, logs, stability, etc.

4. Understand the terms from wordcount cases

Understand related concepts again

  • Job: job is triggered by action, so a job contains one action and N transform operations;

  • Stage: stage is a set of tasks that are divided due to shuffle operations. Stage is divided according to its wide and narrow dependencies;

  • Task: the smallest execution unit, because each task is only responsible for the data of a partition

    Therefore, there are generally as many tasks as there are partitions. This kind of task actually performs the same action on different partitions;

Here is a wordcount program

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD

object WordCount {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("yarn").setAppName("WordCount")
    val sc = new SparkContext(conf)
    val lines1: RDD[String] = sc.textFile("data/spark/wc.txt")
    val lines2: RDD[String] = sc.textFile("data/spark/wc2.txt")
    val j1 = lines1.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_)
    val j2 = lines2.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_)
    j1.join(j2).collect()
    sc.stop()
  }
}

Yarn mode is widely used in the production environment, so from the deployment mode of yarn, there is only one action operation collect in the code, so there is only one job. Because of shuffle, the job is divided into three stages, namely flatmap, map and reducebykey to calculate a stage0, and line2 to calculate a stage1. Stage3 is the first two results to join and then collect, And stage3 depends on stage1 and stage0, but stage0 and stage1 are parallel. In the actual production environment, if you want to look at the dependency graph depending on stage, you can clearly see the dependency relationship.

Wu Xie, little third master, is a rookie in the field of big data and artificial intelligence.
Please pay more attention
file