Everyone talks about it hereSparkIn the age of, decimal also thinks it is necessary to publish technical articles on spark to help you understand and master spark from entry to proficient, from concept to programming, and deeply experience its charm:)
Spark, here we go!
1、 Why spark
First of all, spark is the most effective open source data processing engine designed for fast processing, easy use and advanced analysis. It has participants from more than 250 organizations, and its community has more and more developers and users joining.
Second, as a general computing engine designed for large-scale distributed data processing, spark supports multiple workloads through a unified engine containing spark components and API access library. It supports popular programming languages, including Scala, Java, Python and r.
Finally, it can be deployed in different environments, read data from various data sources, and interact with numerous applications.
At the same time, this unified computing engine enables different loads – ETL, spark SQL, machine learning, graphx / graphframes and spark streaming – to run on the same engine in an ideal environment.
In the next steps, you’ll get an introduction to some of these components, but first let’s introduce its key concepts and terminology.
2、 The concept, key terms and keywords of Apache spark
In June this year, kdnuggete published an explanation of key terms of Apache spark（http://www.kdnuggets.com/2016…）This is a very good introduction. The following is a supplement to spark’s glossary of terms, which will often appear in this article.
A set of machines or nodes preset in the cloud or in the data center where spark is installed. Those machines are spark workers, spark master (cluster manager in a separate mode) and at least one spark driver.
As the name suggests, the spark master JVM acts as the cluster manager in a separate deployment mode, and spark works registers itself as a part of the cluster. According to the deployment pattern, it acts as a resource manager to decide which machine in the cluster to publish how many executors.
The spark worker JVM, after receiving the instruction from the spark master, publishes the executor on behalf of the spark driver. Spark’s application is decomposed into task units and executed by each worker’s executor. In short, the job of a worker is to publish an actuator on behalf of the master.
It is a JVM container that allocates the amount of processor and memory, on which spark runs its tasks. Each worker node publishes its own spark executor through a configurable core (or thread). In addition to performing spark tasks, each executor also stores and caches data partitions in memory.
Once it gets the information of all the workers in the cluster from the spark master, the driver assigns spark tasks to the executors of each worker. Drive also gets the calculation results from the tasks of each actuator.
Sparksession and sparkcontext
As shown in the diagram, sparkcontext is the channel to access all spark functions; there is only one sparkcontext in each JVM. The spark driver uses it to connect to the cluster manager to communicate and submit spark work. It allows you to configure the spark parameter. Through sparkcontext, the driver can instantiate other contexts, such as sqlcontext, hivecontext and streamingcontext.
Using Apache spark 2.0, sparksession can access all the mentioned spark functions through a unified entry point. At the same time, it can more easily access the spark functions and operate data through the underlying context.
Spark deployment mode
Spark supports four cluster deployment modes, corresponding to the components of spark running in the spark cluster, each of which has its own characteristics. Of all the modes, local mode runs on a single host, which is by far the simplest.
As a junior or intermediate developer, you don’t need to know this complex form. Here is for your reference. In addition, the fifth step of this article will provide an in-depth introduction to all aspects of spark architecture.
Apps, jobs, stages and tasks of spark
A spark application usually includes several spark operations, which can be decomposed into transformation or action on the dataset to use the RDD, data frame or dataset of spark. For example, in the spark application, if you call an action, the action will generate a job. A job will be decomposed into single or multiple stages; the stage will be further divided into separate tasks; task is the execution unit, and the scheduling of spark driver will transport it to the spark executor on the spark worker node for execution. Usually, multiple tasks will run on the same executor in parallel, and perform unit processes on the partitioned data set in memory.