Real time computing framework: Spark cluster construction and introduction case


1、 Spark overview

1. Introduction to spark

Spark is designed for large-scale data processing. It is a fast, universal and scalable cluster computing engine based on memory. It implements an efficient DAG execution engine. It can process data streams efficiently based on memory. Compared with MapReduce, the computing speed has been significantly improved.

2. Operation structure

Real time computing framework: Spark cluster construction and introduction case


Run the main() function in Spark’s application to create a sparkcontext, which is responsible for communicating with cluster manager, applying for resources, assigning tasks and monitoring.


It is responsible for applying and managing the resources needed to run applications on workernode. It can efficiently scale computing from one computing node to thousands of computing nodes. At present, it includes spark native cluster manager, Apache mesos and Hadoop horn.


Application is a process running on workernode. As a work node, it is responsible for running tasks and storing data in memory or disk. Each application has its own batch of executors, and the tasks are independent of each other.

2、 Environment deployment

1. Scala environment

Installation package management

[[email protected] opt]# tar -zxvf scala-2.12.2.tgz
[[email protected] opt]# mv scala-2.12.2 scala2.12

Configuration variables

[[email protected] opt]# vim /etc/profile

export SCALA_HOME=/opt/scala2.12

[[email protected] opt]# source /etc/profile

Version view

[[email protected] opt]# scala -version

Scala environment needs to be deployed on the related service nodes running in spark.

2. Spark basic environment

Installation package management

[[email protected] opt]# tar -zxvf spark-2.1.1-bin-hadoop2.7.tgz
[[email protected] opt]# mv spark-2.1.1-bin-hadoop2.7 spark2.1

Configuration variables

[[email protected] opt]# vim /etc/profile

export SPARK_HOME=/opt/spark2.1

[[email protected] opt]# source /etc/profile

Version view

[[email protected] opt]# spark-shell

Real time computing framework: Spark cluster construction and introduction case

3. Spark cluster configuration

Service node

[[email protected] opt]# cd /opt/spark2.1/conf/
[[email protected] conf]# cp slaves.template slaves
[[email protected] conf]# vim slaves


Environment configuration

[[email protected] conf]# cp
[[email protected] conf]# vim

export JAVA_HOME=/opt/jdk1.8
export SCALA_HOME=/opt/scala2.12
export SPARK_MASTER_IP=hop01
export SPARK_ LOCAL_ IP = installation node IP
export HADOOP_CONF_DIR=/opt/hadoop2.7/etc/hadoop

be carefulSPARK_LOCAL_IPConfiguration of.

4. Spark start

It depends on Hadoop related environment, so it needs to be started first.

Start / opt / spark2.1/sbin/
Stop / opt / spark2.1/sbin/

Here, two processes will be started in the master node: master and worker, and only one worker process will be started in other nodes.

5. Visit spark cluster

The default port is 8080.


Real time computing framework: Spark cluster construction and introduction case

Basic operation cases:

[[email protected] spark2.1]# cd /opt/spark2.1/
[[email protected] spark2.1]# bin/spark-submit --class org.apache.spark.examples.SparkPi --master local examples/jars/spark-examples_2.11-2.1.1.jar

Results: Pi is rough 3.1455357276786384

3、 Development case

1. Core dependence

Depending on spark2.1.1 version:


The scala compiler plug-in is introduced


2. Case code development

Read the specified location of the file, and output file content word statistics results.

public class WordWeb implements Serializable {

    public String getWeb (){
        //1. Create the configuration object of spark
        SparkConf sparkConf = new SparkConf().setAppName("LocalCount")

        //2. Create sparkcontext object
        JavaSparkContext sc = new JavaSparkContext(sparkConf);

        //3. Read the test file
        JavaRDD lineRdd = sc.textFile("/var/spark/test/word.txt");

        //4. Line content segmentation
        JavaRDD wordsRdd = lineRdd.flatMap(new FlatMapFunction() {
            public Iterator call(Object obj) throws Exception {
                String value = String.valueOf(obj);
                String[] words = value.split(",");
                return Arrays.asList(words).iterator();

        //5. Mark the segmented words
        JavaPairRDD wordAndOneRdd = wordsRdd.mapToPair(new PairFunction() {
            public Tuple2 call(Object obj) throws Exception {
                //Mark words:
                return new Tuple2(String.valueOf(obj), 1);

        //6. Count the number of words
        JavaPairRDD wordAndCountRdd = wordAndOneRdd.reduceByKey(new Function2() {
            public Object call(Object obj1, Object obj2) throws Exception {
                return Integer.parseInt(obj1.toString()) + Integer.parseInt(obj2.toString());

        //7. Sorting
        JavaPairRDD sortedRdd = wordAndCountRdd.sortByKey();
        List<Tuple2> finalResult = sortedRdd.collect();

        //8. Result printing
        for (Tuple2 tuple2 : finalResult) {
            System.out.println(tuple2._1 + " ===> " + tuple2._2);

        //9. Save the statistical results
        return "success" ;

Package execution result:

Real time computing framework: Spark cluster construction and introduction case

To view the file output:

[[email protected] output]# vim /var/spark/output/part-00000

4、 Source code address

GitHub · address
Gitee · address

Real time computing framework: Spark cluster construction and introduction case

Read the label

Java Basics】【Design pattern】【Structure and algorithm】【Linux system】【database

Distributed architecture】【Microservice】【Big data component】【Springboot advanced】【Spring & Boot Foundation

Data analysis】【Technology Map】【 Workplace

Technology Series

OLAP engine: Druid component for data statistical analysis

OLAP engine: Presto component cross data source analysis

OLAP engine: Clickhouse high performance column query