Big data Hadoop — spark cluster deployment (standalone)

Time:2022-5-13

1、 Spark overview

For the explanation of spark basic concepts and principles, please refer to my previous blog post:Big data Hadoop — computing engine spark

2、 Operation mode of spark

1) Standalone (explained in this chapter)

Independent mode, a set of independent clusters (Master / client / slave), Spark’s native simple cluster manager, with complete services, can be deployed separately in a cluster without relying on any other resource management system. Using standalone can easily build a cluster, which is generally used when there is no other resource management framework in the company. Disadvantages: resources are not conducive to full utilization

2)Mesos

A powerful distributed resource management framework, which allows a variety of different frameworks to be deployed on it, including yarn. Since mesos is rarely used at present, the deployment mode of mesos is not recorded here.

3) Yarn (recommended)

A unified resource management mechanism can run multiple sets of computing frameworks on it, such as map reduce, storm, spark, and flick. According to the location of drivers in the cluster, they are divided into yarn client and yarn cluster. In fact, the essence is that drives are different. One of the most used models in enterprises. This model has been deployed in the environmentBig data Hadoop — computing engine sparkAs mentioned in the blog post, I won’t repeat it here.

  • Yarn client mode: the driver runs locally and is suitable for interactive debugging
  • Yarn cluster mode: the driver runs in the cluster (AM), and the mode of formally submitting tasks (remote)

4) K8s (new mode)

K8s is a new cluster management and scheduling system on spark. Since the vast majority of cluster managers used in the actual production environment are in the on yarn mode, we mainly focus on the on yarn mode at present. Just understand the on k8s mode. Interested partners can try it. The working mode is shown in the figure below:

The operation mode of spark depends on the value of the master environment variable passed to sparkcontext. Individual modes also need auxiliary program interfaces. Currently, the supported master strings and URLs include:

–deploy-mode: whether the driver is deployed on the work node (cluster) or locally as an external client(Default: client)。

Master URL meaning
local Running locally, there is only one working process and no parallel computing power
local[K] Running locally, there are k working processes, and K is usually set as the number of CPU cores of the machine
local[*] Running locally, the number of worker processes is equal to the number of CPU cores of the machine.
spark://HOST:PORT Run in standalone mode, which is the cluster operation mode provided by spark itself. The default port number is 7077
mesos://HOST:PORT Running on the mesos cluster, the driver process and worker process run on the mesos cluster. The deployment mode must use a fixed value: – deploy mode cluster
yarn Running on the yarn cluster depends on the Hadoop cluster. The yarn resource scheduling framework submits the application to yarn, runs the driver in the applactionmaster (equivalent to the master in the stand alone mode), schedules resources on the cluster, and starts the excutor to execute tasks.
k8s Running on a k8s cluster

3、 Standalone mode operation mechanism

The standalone cluster has four important components:

  • Driver: is a process. The spark application we write runs on the driver and is executed by the driver process;
  • Master: it is a process, which is mainly responsible for the scheduling and allocation of resources and the monitoring of clusters;
  • Worker: it is a process. A worker runs on a server in the cluster and is mainly responsible for two responsibilities. One is to use its own memory to store one or some partitions of RDD; The other is to start other processes and threads (executors) to process and calculate the partition on RDD in parallel.
  • Executor: is a process. Multiple executors can be run on a worker. The executor performs parallel calculation on the partition of RDD by starting multiple threads (tasks), that is, executes the operator operations defined for RDD, such as map, flatmap, reduce and so on.

1) Standalone client mode

  • In the standalone client mode, the driver runs on the local machine where the task is submitted,
  • After the driver starts, it registers the application with the master. The master finds internal resources according to the resource requirements of the submit script, and can start all workers of at least one executor,
  • Then assign executors among these workers. After the executor on the worker is started, it will reverse register with the driver. After all executors are registered,
  • The driver starts to execute the main function, and then when it executes the action operator, it starts to divide the stages, generate the corresponding taskset for each stage, and then distribute the tasks to each executor for execution.

2) Standalone cluster mode

  • In the standalone cluster mode, after the task is submitted, the master will find a worker and start the driver process,
  • After the driver starts, register the application with the master,
  • The master finds internal resources according to the resource requirements of the submit script, and can start all workers of at least one executor,
  • Then assign executors among these workers. After the executor on the worker is started, it will reverse register with the driver,
  • After all executors are registered, the driver starts to execute the main function, and then when the action operator is executed, it starts to divide the stages, generate the corresponding taskset for each stage, and then distribute the tasks to each executor for execution.

[note] in the two modes of standalone (client / cluster), after receiving the driver’s request to register the spark application, the master will obtain the remaining resources managed by the driver, be able to start all workers of one executor, and then distribute the executors among these workers. At this time, the distribution only considers whether the resources on the workers are used enough until all executors required by the current application are allocated, After the reverse registration of the executor, the driver starts to execute the main program.

4、 Spark cluster installation (standalone)

1) Machine and role division

Machine IP machine name Node type
192.168.0.113 hadoop-node1 Master/Worker
192.168.0.114 hadoop-node2 Worker
192.168.0.115 hadoop-node3 Worker

2) Install JDK environment on three machines

Hadoop cluster has been installed before, which is omitted here. For unclear, please refer to my previous article:Introduction to the principle of big data Hadoop + installation + actual operation (HDFS + yarn + MapReduce)

3) Download

Spark download address:http://spark.apache.org/downloads.html

I need to pay attention to the version here. My Hadoop version is 3.3.1. Here, spark downloads the latest version of 3.2.0, while spark 3.0 2.0 depends on 2.13 of Scala, so pay attention to the version of scala when using Scala programming later.

$ cd /opt/bigdata/hadoop/software
#Download
$ wget https://dlcdn.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
#Decompress
$ tar -zxvf spark-3.2.0-bin-hadoop3.2.tgz -C /opt/bigdata/hadoop/server/
#Modify the installation directory name
$ cp -r /opt/bigdata/hadoop/server/spark-3.2.0-bin-hadoop3.2 /opt/bigdata/hadoop/server/spark-standalone-3.2.0-bin-hadoop3.2

4) Configure spark

1. Configure slave files

$ cd /opt/bigdata/hadoop/server/spark-standalone-3.2.0-bin-hadoop3.2/conf
$ cp workers.template workers
#The contents of the slave file are as follows:
hadoop-node1
hadoop-node2
hadoop-node3

Hadoop-node1 is both a master and a worker

2. Configure spark env sh

$ cd /opt/bigdata/hadoop/server/spark-standalone-3.2.0-bin-hadoop3.2/conf
#Create a data directory (all nodes have to create this directory)
$ mkdir -p /opt/bigdata/hadoop/data/spark-standalone
#Copy an environment variable file
$ cp spark-env.sh.template spark-env.sh
#Add the following:
export SPARK_MASTER_HOST=hadoop-node1
export SPARK_LOCAL_DIRS=/opt/bigdata/hadoop/data/spark-standalone

3. Configure spark defaults conf
There is no modification here. If you need to modify it, you can modify it yourself. The default port is 7077

$ cp spark-defaults.conf.template spark-defaults.conf
$ cat spark-defaults.conf

5) Copy the configured package to the other two clusters

$ scp -r spark-standalone-3.2.0-bin-hadoop3.2 hadoop-node2:/opt/bigdata/hadoop/server/
$ scp -r spark-standalone-3.2.0-bin-hadoop3.2 hadoop-node3:/opt/bigdata/hadoop/server/

5) Start

1. Start master (execute on hadoop-node1 node)

$ cd /opt/bigdata/hadoop/server/spark-standalone-3.2.0-bin-hadoop3.2/sbin
$ ./start-master.sh
#Check the process port. The default port of spark master Web UI is 8080
$ ss -tnlp|grep :8080
#Modify the port -start.master if it conflicts Just use the port in the SH script
$ grep SPARK_MASTER_WEBUI_PORT start-master.sh


To access spark master Web UI:http://hadoop-node1:8080

2. Start the worker node (execute on all nodes)

$ cd /opt/bigdata/hadoop/server/spark-standalone-3.2.0-bin-hadoop3.2/sbin
$ ./start-worker.sh spark://hadoop-node1:7077

5、 Test verification

Detailed parameter description of spark submit

Parameter name Parameter description
–master Address of the master, where to execute the submitted task, for example spark://host:port , yarn, local
–deploy-mode Start the driver locally (client) or on the cluster. The default is client
–class The main class of the application, only for Java or Scala applications
–name Name of the application
–jars Local jar packages separated by commas. After setting, these jars will be included under the classpath of driver and executor
–packages Maven coordinates of jars contained in the classpath of driver and executor
–exclude-packages To avoid conflicts, specify packages that are not included
–repositories Remote repository
–conf PROP=VALUE Specify the value of the spark configuration attribute, for example – conf spark executor. extraJavaOptions=”-XX:MaxPermSize=256m”
–properties-file The loaded configuration file is conf / spark defaults by default conf
–driver-memory Driver memory, 1g by default
–driver-java-options Additional Java options passed to driver
–driver-library-path Additional library path passed to driver
–driver-class-path Additional classpath passed to driver
–driver-cores The number of cores of the driver. The default is 1. Used under yarn or standalone
–executor-memory The memory of each executor is 1g by default
–total-executor-cores The total number of cores for all executors. Only used in mesos or standalone
–num-executors Number of executors started. The default is 2. Use under yarn
–executor-core The number of cores per executor. Used under yarn or standalone

1) Driver client mode (- – deploy mode client)

$ cd /opt/bigdata/hadoop/server/spark-standalone-3.2.0-bin-hadoop3.2/bin
$ ./spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://hadoop-node1:7077 \
--deploy-mode client \
--driver-memory 1G \
--executor-memory 1G \
--total-executor-cores 2 \
--executor-cores 1 \
/opt/bigdata/hadoop/server/spark-standalone-3.2.0-bin-hadoop3.2/examples/jars/spark-examples_2.12-3.2.0.jar 10

The running results of this mode are directly displayed on the client.

2) Driver cluster mode (- – deploy mode cluster)

$ cd /opt/bigdata/hadoop/server/spark-standalone-3.2.0-bin-hadoop3.2/bin
$ ./spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://hadoop-node1:7077 \
--deploy-mode cluster \
--driver-memory 1G \
--executor-memory 1G \
--total-executor-cores 2 \
--executor-cores 1 \
/opt/bigdata/hadoop/server/spark-standalone-3.2.0-bin-hadoop3.2/examples/jars/spark-examples_2.12-3.2.0.jar 10

This mode basically has no output information. You need to log in to the web page to view it

View driver log information

Finally, check the running results in the driver log.

It’s the most popular mode in [standalone] at present.

Recommended Today

Vue3 develops a freely configurable browser start page

preface HowdzIs based onVue3 + TypescriptDeveloped a fully customized browser navigation start page, which supports adding material components on demand, and can freely edit the location, size and function of components. It supports responsive design and can customize random wallpaper, dynamic wallpaper, etc. Project provisionWeb page online access. packingBrowser plug-in. packingDesktop application (electron)And other access […]