Check out the spark submit execution process

Time:2021-1-19

Abstract:This article is mainly through the spark code walkthrough to understand the spark submit process.

1. Task order submission

When we submit a spark task, we will use the “spark submit – class…” style command to submit the task, which is a shell script in the spark directory. Its function is to query spark home and call spark class command.

if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

# disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0

exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "[email protected]"

Then, the spark class command is executed to submit tasks to the spark program with the sparksubmit class as the parameter. The shell script of spark class mainly performs the following steps:

(1) Load the spark environment parameter and get it from conf

if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

. "${SPARK_HOME}"/bin/load-spark-env.sh

#Looking for javahome
if [ -n "${JAVA_HOME}" ]; then
  RUNNER="${JAVA_HOME}/bin/java"
else
  if [ "$(command -v java)" ]; then
    RUNNER="java"
  else
    echo "JAVA_HOME is not set" >&2
    exit 1
  fi
fi

(2) Load Java, jar package, etc

# Find Spark jars.
if [ -d "${SPARK_HOME}/jars" ]; then
  SPARK_JARS_DIR="${SPARK_HOME}/jars"
else
  SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"
fi

(3) Call org.apache.spark . launcher for parameter injection

build_command() {
  "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "[email protected]"
  printf "%d0" $?
}

(4) Shell script monitors the execution status of the task, whether to complete or exit the task, and judges whether to end by executing the return value

if ! [[ $LAUNCHER_EXIT_CODE =~ ^[0-9]+$ ]]; then
  echo "${CMD[@]}" | head -n-1 1>&2
  exit 1
fi

if [ $LAUNCHER_EXIT_CODE != 0 ]; then
  exit $LAUNCHER_EXIT_CODE
fi

CMD=("${CMD[@]:0:$LAST}")
exec "${CMD[@]}"

2. Task detection and submission to spark

The class or submit is built to build CMD, to check the parameters in SparkSubmitOptionParser (submit), to build the command line and print back to spark-class, and finally to invoke exec to perform the spark command line submission task. The contents of the CMD are as follows:

/usr/local/java/jdk1.8.0_91/bin/java-cp
/data/spark-1.6.0-bin-hadoop2.6/conf/:/data/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/data/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/data/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/data/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/data/hadoop-2.6.5/etc/hadoop/
-Xms1g-Xmx1g -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=1234
org.apache.spark.deploy.SparkSubmit
--classorg.apache.spark.repl.Main
--nameSpark shell
--masterspark://localhost:7077
--verbose/tool/jarDir/maven_scala-1.0-SNAPSHOT.jar

3. Execution of sparksubmit function

(1) After the spark task is submitted, the main method in sparksubmit is executed

 def main(args: Array[String]): Unit = {
    val submit = new SparkSubmit()
    submit.doSubmit(args)
  }

(2) Dosubmit() initializes the log, adds spark task parameters, and executes tasks through parameter types

 def doSubmit(args: Array[String]): Unit = {
    // Initialize logging if it hasn't been done yet. Keep track of whether logging needs to
    // be reset before the application starts.
    val uninitLog = initializeLogIfNecessary(true, silent = true)

    val appArgs = parseArguments(args)
    if (appArgs.verbose) {
      logInfo(appArgs.toString)
    }
    appArgs.action match {
      case SparkSubmitAction.SUBMIT => submit(appArgs, uninitLog)
      case SparkSubmitAction.KILL => kill(appArgs)
      case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
      case SparkSubmitAction.PRINT_VERSION => printVersion()
    }
  }

Submit: submit the application using the parameters provided

Kill (standby and mesos cluster mode only): terminate the task through rest protocol

REQUEST_ Status (standby and mesos cluster mode only): request the status of submitted tasks through rest protocol

PRINT_ Version: output version information to log

(3) Call the submit function:

def doRunMain(): Unit = {
      if (args.proxyUser != null) {
        val proxyUser = UserGroupInformation.createProxyUser(args.proxyUser,
          UserGroupInformation.getCurrentUser())
        try {
          proxyUser.doAs(new PrivilegedExceptionAction[Unit]() {
            override def run(): Unit = {
              runMain(args, uninitLog)
            }
          })
        } catch {
          case e: Exception =>
            // Hadoop's AuthorizationException suppresses the exception's stack trace, which
            // makes the message printed to the output by the JVM not very helpful. Instead,
            // detect exceptions with empty stack traces here, and treat them differently.
            if (e.getStackTrace().length == 0) {
              error(s"ERROR: ${e.getClass().getName()}: ${e.getMessage()}")
            } else {
              throw e
            }
        }
      } else {
        runMain(args, uninitLog)
      }
    }

DoRunMain prepares parameters for cluster call sub main class, then calls runMain () to perform task invoke main.

4. Summary

Spark will use a variety of different parameters and modes in job submission, and will select different branches for execution according to different parameters. Therefore, the required parameters will be passed to the execution function in the final submitted runmain.

This article is shared from “spark submit for spark kernel analysis” in Huawei cloud community. The original author: stupid bear likes to drink cola.

Click follow to learn about Huawei’s new cloud technology for the first time~

Recommended Today

Interface control design of NAND flash

NAND flash is one of the flash memories. NAND flash adopts nonlinear macro cell mode and provides a cheap and effective solution for the realization of solid-state mass memory.NAND FLASHMemory has the advantages of large capacity and fast rewriting speed, which is suitable for the storage of large amounts of data, so it has been […]