Introduction to spark concept and framework


What is spark

Apache spark is a fast and general-purpose computing engine designed for large-scale data processing. It is developed by the amp Laboratory (algorithms, machines, and people Lab) at the University of California, Berkeley. It can be used to build large, low latency data analysis applications. Spark entered Apache in June 2013 and became the incubation project. Eight months later, it became the top project of Apache. Its speed is remarkable. With its advanced design concept, spark quickly became a hot project in the community.

Spark is implemented in scala. It is an object-oriented and functional programming language, which can operate distributed data sets as easily as local collection objects (Scala provides a parallel model called actor, in which actor sends and receives asynchronous information through its inbox instead of sharing data. This method is called shared nothing model

Official website:
Official document:
Source address:

2. Spark features

1: fast operation
Spark has DAG execution engine, which supports iterative calculation of data in memory. If the data is read by disk, the speed is more than 10 times that of Hadoop. If the data is read in memory, the speed can be up to more than 100 times.

2: easy to use
Spark not only supports Scala to write applications, but also supports Java, Python and other languages to write, especially Scala is an efficient and extensible language, which can handle more complex processing work with simple code.

3: generality
Spark provides a large number of libraries, including SQL, dataframes, mllib, graphx, spark streaming Developers can use these libraries seamlessly in the same application.

4: support multiple resource managers
Spark supports Hadoop yarn, Apache mesos and its own independent cluster manager

Introduction to spark ecosystem

Spark ecosphere, also known as BDAs (Berkeley data analysis stack), is a platform built by the apmlab Laboratory of Berkeley, trying to show big data applications through large-scale integration among algorithms, machines and people. Look at his ecosystem from a picture

Introduction to spark concept and framework

1:Spark Core

Spark core implements the basic functions of spark, including task scheduling, memory management, error recovery, and storage system interaction modules. It also contains the API definition of elastic distributed data set (RDD). RDD represents the set of elements distributed on multiple computer nodes that can operate in parallel, and is the main programming abstraction of spark

2:Spark SQL

In essence, HQL is translated into RDD operation on spark through hive’s HQL analysis, and then the table information in the database is obtained through hive’s metadata. The actual data and files on HDFS will be obtained by shark and put into spark for operation. Spark SQL supports a variety of data sources, such as hive and JSON.

3:Spark Streaming

Sparkstreaming is a high-throughput and fault-tolerant streaming processing system for real-time data flow. It can perform complex operations like map, reduce and join on multiple data sources (such as kdfka, flume, twitter, zero and TCP socket), and save the results to external file systems, databases or real-time dashboards. Spark streaming provides an API for operating data streams, which is highly corresponding to RDD API, thus greatly reducing the threshold and cost of learning and development.

Internal principle: Spark streaming is to decompose flow computing into a series of short batch jobs. The batch engine here is spark core, which is to divide the input data of spark streaming into discrete streams according to the batch size (for example, one second). Each segment of data is transformed into RDD (resilient distributed dataset) in spark, and then the transformation operation of dstream in spark streaming is transformed into the transformation operation of RDD in spark Operation to change RDD into intermediate result and save it in memory. The whole flow computing can stack or store the intermediate results to the external devices according to the business requirements.

Introduction to spark concept and framework

Mllib is a machine learning library, which provides a variety of algorithms for clustering, regression, clustering, collaborative filtering and so on. Some of these algorithms can also be applied to stream data, such as using the ordinary least square method or K-means clustering (and more) to calculate linear regression. Apache mahout (a machine learning library for Hadoop) has left MapReduce to join spark mllib

Graphx is an API for parallel computing of graphs and graphs in spark, which can be considered as the rewriting and optimization of graphlab (c + +) and pregel (c + +) on Spark (Scala). Compared with other distributed graph computing frameworks, graphx’s greatest contribution is to provide a stack data solution scheme on spark, which can complete a complete set of flow jobs of graph computing conveniently and efficiently. Graphx was first a distributed graph computing framework project in amplab, Berkeley, and later integrated into spark as a core component.

IV. applicable scenarios of spark

1: for complex batch data processing, the focus is on the ability to process massive data. As for the processing speed, it can be tolerated. The normal time may be tens of minutes to hours (this is similar to Hadoop’s MapReduce calculation)
2: the amount of data is not very large, but it requires real-time statistical analysis (real-time calculation)
3: spark is a memory based iterative computing framework, which is suitable for applications requiring multiple operations on specific data sets. The more times you need to repeat operations, the more data you need to read, the greater the benefit, the smaller the amount of data, but the more intensive the calculation, the smaller the benefit

The current and officially recommended usage patterns are as follows

Introduction to spark concept and framework

V. Spark’s operation mode

1: the local mode is commonly used for local development and testing. The local mode is also divided into local single thread and local cluster multi thread.

2: the typical mater / slave mode of the standalone cluster mode, but it can also be seen that the master has a single point of failure; spark supports zookeeper to implement ha.

3: on yarn cluster mode runs on the yarn resource manager framework. Yarn is responsible for resource management and spark is responsible for task scheduling and calculation

4: the on mesos cluster mode runs on the framework of mesos resource manager. Mesos is responsible for resource management and spark is responsible for task scheduling and calculation

Vi. basic principles of spark**

The spark operation framework is shown in the figure below. First, there are cluster resource management service (cluster manager) and worker node running job tasks. Then there are task control node driver of each application and specific task execution process (executor) on each machine node.
First, the driver program starts multiple workers. Workers load data from the file system and generate RDD (that is, data is put into RDD, which is a data structure), and cache it into memory according to different partitions.

Introduction to spark concept and framework

7: introduction to spark RDD

RDD is one of the core contents of spark (before version 2.0). The Chinese interpretation of RDD is: elastic distributed datasets, the full name of which is resilient distributed datasets, and the object is dataset, that is, in memory database. RDD is read-only and partitioned. All or part of this data set can be cached in memory and reused among multiple calculations. The so-called elasticity refers to that when the memory is not enough, it can be exchanged with the disk. This involves another feature of RDD: memory computing, which is to save data into memory. At the same time, to solve the problem of memory capacity limitation, spark provides us with the largest degree of freedom. All data can be set by us, including whether to cache and how to cache.

8: Spark task submission

Spark submit can specify various parameters

./bin/spark-submit \
… # other options

The parameters are explained as follows:
–Class: the entry method of a spark task, generally referred to as the main method. For example: org. Apache. Spark. Examples. Sparkpi)
-Master: the master URL of the cluster. For example, spark://
–Deploy mode: deployment mode. There are two modes: cluster and client. The default is client
–Conf: additional properties
Application jar: the specified jar directory. The path must be visible in the whole cluster
Application argument: parameter of main method

Official more detailed parameter description: