Spark Learn Part01 introduce and install


Introduction to chap 0

From the perspective of history:

  1. Spark originated from amplap big data analysis platform of University of California, Berkeley

  2. Spark is based on memory computing and multi iteration batch processing

  3. Spark takes into account data warehouse, flow processing, graph computing and other computing paradigms, and is a full stack computing platform in the field of big data system

Introduction to chap 1 spark

Key points of this chapter:

  1. Spark framework, architecture, computing model and data management strategy

  2. Spark BDAs project and its sub projects are briefly introduced

  3. Spark ecosystem includes many subprojects: sparksql, spark streaming, graphx, mllib

What is 1-1 spark

1.1.1 history and development of spark

  1. 2009: spark was born in amplab

  2. February 2014: Apache top projects

  3. May 2014: Spark 1.0.0 released

1.1.2 spark for Hadoop

Spark is an alternative to MapReduce and is compatible with distributed storage layers such as HDFS and hive.

Compared with Hadoop MapReduce, spark has the following advantages:

  1. Intermediate result output

  2. Data format and memory layout

  3. Execution strategy

  4. Cost of task scheduling

    (spark uses event driven class library akka to start tasks, and reuses threads through thread pool to avoid the overhead of thread start and switch.)

1.1.3 what can spark bring?

  1. Create a funny data pipeline with full stack and multi computing paradigm (high efficiency)

  2. Lightweight and fast processing

  3. Easy to use, spark supports multiple languages

  4. Compatible with storage layers such as HDFS

  5. High community activity

1-2 spark ecological BDAs

Spark Learn Part01 introduce and install
Spark ecosystem = BDAs = Berkeley data analysis stack

(1) Spark
Spark is the core of BDAs and a big data distributed programming framework.

1-3 spark architecture

  1. The code structure of spark

  2. The architecture of spark

  3. Spark operation logic

(1) The code structure of spark

Spark Learn Part01 introduce and install

Scheduler: the folder contains the code responsible for the overall spark application and task scheduling.
Broadcast: contains the implementation code of broadcast (broadcast variable). The API includes the implementation of Java and python API.

Deploy: contains the code to deploy and start spark.
Common: it is not a folder, but represents the general class and logic implementation of spark, with 5000 lines of code.

Metrics: it is the logic code of runtime state monitoring, and the executor contains the logic code of the worker node responsible for calculation.
Partial: contains approximate evaluation code.

(2) The architecture of spark

Spark architecture adopts the master slave model in distributed computing. Master is the node containing master process in the corresponding cluster, and slave is the node containing worker process in the cluster. Master, as the controller of the whole cluster, is responsible for the normal operation of the whole cluster; worker, as the computing node, receives the master node’s commands and reports the status; executor is responsible for the execution of tasks; client, as the user’s client, is responsible for submitting applications, and driver is responsible for controlling the execution of an application

Spark Learn Part01 introduce and install

Spark starts master process and worker process respectively to control the whole cluster. In the execution of a spark application, driver and worker are two important roles.

Driver program is the starting point of application logic execution, responsible for job scheduling, that is, task distribution.
Workers are used to manage computing nodes and create parallel processing tasks for executors.

In the execution phase, the driver will serialize the files and jars that the task and task depend on and pass them to the corresponding worker machine. Meanwhile, the executor will process the tasks of the corresponding data partition.

The basic components of spark architecture are as follows:

  • Cluster manager: Master in standalone mode, which controls the whole cluster and monitors workers.

  • Worker: slave node, responsible for controlling the computing node and starting the executor or driver. In the horn mode, it is nodemanager, which is responsible for the control of computing nodes.

  • Driver: run the main() function of application and create sparkcontext.

  • Executor: executor, a component that executes tasks on the worker node and is used to start the thread pool to run tasks. Each application has an independent set of executors.

  • Sparkcontext: the context of the whole application, which controls the life cycle of the application.

  • RDD: the basic computing unit of spark. A group of RDDS can form an executed directed acyclic graph RDD graph.

  • Dag scheduler: build DAG based on stage according to job and submit stage to task scheduler.

  • Task scheduler: distributes tasks to the executor for execution.

  • Sparkenv: thread level context that stores references to important components of the runtime. Sparkenv creates and contains references to the following important components.

Mapoutputtracker: responsible for storing shuffle meta information.
Broadcast manager: responsible for controlling broadcast variables and storing meta information.

Blockmanager: responsible for storage management, creating and finding blocks.
Metrics system: monitors runtime performance metrics information.
Sparkconf: responsible for storing configuration information.

The overall process of spark is as follows: the client submits the application, the master finds a worker to start the driver, the driver applies for resources from the master or resource manager, and then converts the application into RDD graph, the dagscheduler converts the RDD graph into a stage directed acyclic graph and submits it to the taskscheduler, and the taskscheduler submits the task to the executor for execution. In the process of task execution, other components work together to ensure the smooth implementation of the whole application.

(3) Spark operation logic

For spark application, the whole execution process will logically form a directed acyclic graph (DAG).

After the action operator is triggered, all the accumulated operators form a directed acyclic graph, and then the scheduler schedules the tasks on the graph for operation.

The scheduling method of spark is different from that of MapReduce. Spark is divided into different stages according to different dependencies between RDDS. One stage contains a series of function execution pipelines. A, B, C, D, e and F in the diagram represent different RDDS, and the boxes in RDD represent partitions. The data is input into spark from HDFS to form RDD A and RDD C. RDD C performs map operation and is converted to RDD D. RDD B and RDD e perform join operation and are converted to F. in the process of B and e connecting and converting to F, shuffle is executed. Finally, RDD f is output and saved to HDFS through the function saveassequencefile.

Spark Learn Part01 introduce and install

Installation and deployment of chap 2 spark

Spark is easy to install.

Spark website

Spark mainly uses HDFS as the persistence layer, so you need to install Hadoop before installing spark

Installation and deployment of 2-1 spark

Spark is a computing framework, which mainly uses HDFS as persistence layer. Such as hive etc

1. Installing spark in Linux Cluster

  1. Install JDK

  2. Install Scala

  3. Configure SSH password free login (optional)

  4. Install Hadoop

  5. Install spark

  6. Start spark cluster

Download from spark website

5. Install spark

(1). download  spark-1.5.2-bin-hadoop2.6.tgz

(2). tar -xzvf spark-1.5.2-bin-hadoop2.6.tgz

(3) . configure conf / spark-
    1) Please refer to the official website for details of complex parameter configuration
    2) vim conf/
    export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/Home
    export SCALA_HOME=/usr/local/Cellar/scala/2.11.5
    export SPARK_HOME=/usr/local/xSoft/spark
    export SPARK_MASTER_IP=192.168.181.**
    export MASTER=spark://192.168.181.**:7077


    export SPARK_WORKER_MEMORY=1000m


(4) . configure conf / slaves (test optional)
(5) . startup ssh server is generally required

6. Start spark cluster

  • Start spark in the spark root directory


After starting JPS, there will be master process

➜  spark-1.5.2-bin-hadoop2.6  jps
11262 Jps
11101 Master
11221 Worker

2-2 preliminary trial of spark cluster

You can run the spark sample in two ways:

  • Execute as. / run example

[[email protected] libin]$ cd /opt/cloudera/parcels/CDH-5.3.6-1.cdh5.3.6.p0.11/lib/spark

[[email protected] libin]$ ./bin/run-example org.apache.spark.examples.SparkPi
  • Execute in. / spark shell mode

scala> import org.apache.spark._
import org.apache.spark._


scala> object SparkPi {
     |   def main(args: Array[String]) {
     |     val slices = 2
     |     val n = 100000 * slices
     |     val count = sc.parallelize(1 to n, slices).map { i =>
     |       val x = math.random * 2 - 1
     |       val y = math.random * 2 - 1
     |       if (x * x + y * y < 1) 1 else 0
     |     }.reduce(_ + _)
     |     println("Pi is rounghly " + 4.0 * count / n)
     |   }
     | }
defined module SparkPi


//Spark shell has initialized the sparkcontext class as the object SC by default, which can be used directly by user code.

//Spark comes with an interactive shell program to facilitate interactive programming.
  • Viewing cluster status through Web UI


2-3 Spark — quick start

quick-start :


scala> val textFile = sc.textFile("")
textFile: spark.RDD[String] = [email protected]
RDDs have actions, which return values, and transformations, which return pointers to new RDDs. Let’s start with a few actions:

scala> textFile.count() // Number of items in this RDD
res0: Long = 126

scala> textFile.first() // First item in this RDD
res1: String = # Apache Spark

2-4 summary of this chapter

Because spark mainly uses HDFS as persistence layer, Hadoop needs to be installed in advance to fully use spark

Spark abstracts the distributed memory data into elastic distributed data set (RDD), and implements rich operators on it, so as to calculate RDD. Finally, the operator sequence is transformed into DAG for execution and scheduling.

Recommended Today

Review of SQL Sever basic command

catalogue preface Installation of virtual machine Commands and operations Basic command syntax Case sensitive SQL keyword and function name Column and Index Names alias Too long to see? Space Database connection Connection of SSMS Connection of command line Database operation establish delete constraint integrity constraint Common constraints NOT NULL UNIQUE PRIMARY KEY FOREIGN KEY DEFAULT […]