- This article only discusses spark running in cluster mode.
- The reading time is 10 minutes.
Location of spark in Hadoop component set
Spark is a computing framework for big data cluster. Its location in big data components is as follows.
It is explained here that spark is a replacement of MapReduce, not the entire Hadoop.
1. Spark architecture
Spark adopts master slave architecture mode to realize multi cluster operation mode. Its structure is as follows:
- Master node, the master node only runs the driver program
- The worker node, the walker node, only runs the actuator program.
Use the command JPS on the host of the related node to view the related process. The meaning of each part is introduced below.
2. Architecture description
Spark includes two types of programs:
- There is only one driver program.
- Actuator program, 1 or more.
From the operating system level, each program is a different process, running on different nodes.
The driver program contains the main function of the application, defines the distributed data sets on the cluster, and applies related operations to these distributed data sets. The driver program accesses spark through a sparkcontext object, which represents a connection to the computing cluster.
Note: when starting spark shell, a sparkcontext object has been automatically created, which is a variable called SC. So don’t be surprised to see SC variables that are not manually instantiated by others.
With spark context, it can be used to innovate spark core class RDD. At the same time, in cluster mode, the driver also manages multiple actuator nodes. The two main tasks of the driver program are as follows:
- Turn user programs into tasks
- Scheduling tasks for executor programs
The driver program will display the runtime information of some spark applications through the web interface, which is on port 4040 by default. For example, access in local modehttp://localhostYou can see this page at 4040.
The actuator program has two main tasks:
- Responsible for running the tasks sent by the driver program, and returning the execution results to the driver program;
- Through its own block manager, it provides memory storage for the RDD which is required to be cached in user program.
2.3 cluster manager
The cluster manager is responsible for managing the life cycle of the executor. At present, the system supports four kinds of cluster managers
- The local cluster manager, driver and executor are on the same server, which is suitable for testing or executing low complexity jobs.
- Spark independent cluster manager, a simple spark cluster manager, is easy to set up a cluster
- Hadoop yarn, Hadoop V2 resource manager
- Apache mesos, a cluster manager that can run haoop MapReduce and service applications
If there are only spark programs in the cluster, you can use the spark independent cluster manager. If there are other MapReduce programs, you need to use either the yarn or the mesos manager.
2.3.1 local cluster manager
There is only one executor in the local cluster management, calling the method:
#Use a thread /usr/local/spark-2.1.1-bin-hadoop2.7/bin/pyspark --master local #Two threads are used to execute /usr/local/spark-2.1.1-bin-hadoop2.7/bin/pyspark --master local #Each CPU executes one thread /usr/local/spark-2.1.1-bin-hadoop2.7/bin/pyspark --master local[*]
2.3.2 spark independent cluster manager
Enabling stand-alone cluster management is simple and requires only the start all script.
Can be inhttp://masternode:8080See the web user interface of cluster manager, which shows all the work nodes. When you submit an application, you can configure the amount of memory used by the actuator process, as well as the total number of CPU cores used by all the actuator processes. By default, the cluster manager selects the appropriate default value to automatically allocate CPU core and memory to all work nodes.
Using the stand-alone cluster manager
If you want to use a stand-alone cluster manager, you only need to specify the submitted node when submitting the job.
//Submit application bin/spark-submit --master spark://masternode:7077 yourapp //When you start the spark shell, specify the master, and all running programs will be submitted to the cluster. bin/spark-shell --master spark://masternode:7077 //Start pyspark bin/pyspark --master spark://masternode:7077
The independent cluster manager mainly manages CPU and memory resources
- Actuator process memoryYou can configure this item through the — executor memory parameter of spark submit. Each application has at most one actuator process 1 on each work node. Therefore, this setting item can control how much memory the actuator node takes up of the work node. The default value is 1g.
- Maximum number of cores occupiedThis is the total number of cores occupied by all the executor processes in an application. The default value is infinite; that is, the application can start the executor process on all available nodes in the cluster. For multi-user workload, users should be required to limit their usage. This value can be set through the — total executorcores parameter of spark submit.
2.3.3 Hadoop Yarn
Yarn is the cluster manager that comes with Hadoop. By default, you don’t need to start it manually. To use yarn as the cluster manager, you only need to specify the master as yarn when submitting tasks.
/usr/local/spark-2.1.1-bin-hadoop2.7/bin/spark-submit \ --master yarn \ --deploy-mode client \ --name "Example Program" \ --num-executors 40 \ --executor-memory 10g \ my_script.py
Compared with independent cluster management, yarn can manage three resources:
- –Num executors: Spark applications will use a fixed number of actuator nodes. By default, this value is only 2;
- –Executor memory to set the memory consumption of each actuator;
- –Executor cores to set the number of cores occupied by each executor process from yarn.
2.3.4 Apache Mesos
Mesos needs to be downloaded separately,https://open.mesosphere.com/d…。 After downloading. Specifies the use of the mesos method by specifying the master:
./bin/spark-shell --master mesos://host:5050
2.4 deployment mode
All cluster managers support two deployment modes. The driver programs applied in these two modes run in different places:
- Client, client mode, default deployment mode. The driver program will run on the machine where you execute spark submit and is part of the spark submit command. This means that you can see the output of the driver program directly or input data directly (via interactive shell).
- In cluster mode, the driver will run in the independent cluster manager as an independent process on a work node. It will also connect the master node to apply for the actuator node.
Specify the deployment mode through deploy mode.
#Client mode, cluster mode /usr/local/spark-2.1.1-bin-hadoop2.7/bin/spark-submit \ --master yarn \ --deploy-mode client \ my_script.py