Real time computing framework: Spark cluster construction and introduction case

Time:2021-4-28

1、 Spark overview

1. Introduction to spark

Spark is designed for large-scale data processing. It is a fast, universal and scalable cluster computing engine based on memory. It implements an efficient DAG execution engine. It can process data streams efficiently based on memory. Compared with MapReduce, the computing speed has been significantly improved.

2. Operation structure

Real time computing framework: Spark cluster construction and introduction case

Driver

Run the main() function in Spark’s application to create a sparkcontext, which is responsible for communicating with cluster manager, applying for resources, assigning tasks and monitoring.

ClusterManager

It is responsible for applying and managing the resources needed to run applications on workernode. It can efficiently scale computing from one computing node to thousands of computing nodes. At present, it includes spark native cluster manager, Apache mesos and Hadoop horn.

Executor

Application is a process running on workernode. As a work node, it is responsible for running tasks and storing data in memory or disk. Each application has its own batch of executors, and the tasks are independent of each other.

2、 Environment deployment

1. Scala environment

Installation package management

[[email protected] opt]# tar -zxvf scala-2.12.2.tgz
[[email protected] opt]# mv scala-2.12.2 scala2.12

Configuration variables

[[email protected] opt]# vim /etc/profile

export SCALA_HOME=/opt/scala2.12
export PATH=$PATH:$SCALA_HOME/bin

[[email protected] opt]# source /etc/profile

Version view

[[email protected] opt]# scala -version

Scala environment needs to be deployed on the related service nodes running in spark.

2. Spark basic environment

Installation package management

[[email protected] opt]# tar -zxvf spark-2.1.1-bin-hadoop2.7.tgz
[[email protected] opt]# mv spark-2.1.1-bin-hadoop2.7 spark2.1

Configuration variables

[[email protected] opt]# vim /etc/profile

export SPARK_HOME=/opt/spark2.1
export PATH=$PATH:$SPARK_HOME/bin

[[email protected] opt]# source /etc/profile

Version view

[[email protected] opt]# spark-shell

Real time computing framework: Spark cluster construction and introduction case

3. Spark cluster configuration

Service node

[[email protected] opt]# cd /opt/spark2.1/conf/
[[email protected] conf]# cp slaves.template slaves
[[email protected] conf]# vim slaves

hop01
hop02
hop03

Environment configuration

[[email protected] conf]# cp spark-env.sh.template spark-env.sh
[[email protected] conf]# vim spark-env.sh

export JAVA_HOME=/opt/jdk1.8
export SCALA_HOME=/opt/scala2.12
export SPARK_MASTER_IP=hop01
export SPARK_ LOCAL_ IP = installation node IP
export SPARK_WORKER_MEMORY=1g
export HADOOP_CONF_DIR=/opt/hadoop2.7/etc/hadoop

be carefulSPARK_LOCAL_IPConfiguration of.

4. Spark start

It depends on Hadoop related environment, so it needs to be started first.

Start / opt / spark2.1/sbin/start-all.sh
Stop / opt / spark2.1/sbin/stop-all.sh

Here, two processes will be started in the master node: master and worker, and only one worker process will be started in other nodes.

5. Visit spark cluster

The default port is 8080.

http://hop01:8080/

Real time computing framework: Spark cluster construction and introduction case

Basic operation cases:

[[email protected] spark2.1]# cd /opt/spark2.1/
[[email protected] spark2.1]# bin/spark-submit --class org.apache.spark.examples.SparkPi --master local examples/jars/spark-examples_2.11-2.1.1.jar

Results: Pi is rough 3.1455357276786384

3、 Development case

1. Core dependence

Depending on spark2.1.1 version:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <version>2.1.1</version>
</dependency>

The scala compiler plug-in is introduced

<plugin>
    <groupId>net.alchim31.maven</groupId>
    <artifactId>scala-maven-plugin</artifactId>
    <version>3.2.2</version>
    <executions>
        <execution>
            <goals>
                <goal>compile</goal>
                <goal>testCompile</goal>
            </goals>
        </execution>
    </executions>
</plugin>

2. Case code development

Read the specified location of the file, and output file content word statistics results.

@RestController
public class WordWeb implements Serializable {

    @GetMapping("/word/web")
    public String getWeb (){
        //1. Create the configuration object of spark
        SparkConf sparkConf = new SparkConf().setAppName("LocalCount")
                                             .setMaster("local[*]");

        //2. Create sparkcontext object
        JavaSparkContext sc = new JavaSparkContext(sparkConf);
        sc.setLogLevel("WARN");

        //3. Read the test file
        JavaRDD lineRdd = sc.textFile("/var/spark/test/word.txt");

        //4. Line content segmentation
        JavaRDD wordsRdd = lineRdd.flatMap(new FlatMapFunction() {
            @Override
            public Iterator call(Object obj) throws Exception {
                String value = String.valueOf(obj);
                String[] words = value.split(",");
                return Arrays.asList(words).iterator();
            }
        });

        //5. Mark the segmented words
        JavaPairRDD wordAndOneRdd = wordsRdd.mapToPair(new PairFunction() {
            @Override
            public Tuple2 call(Object obj) throws Exception {
                //Mark words:
                return new Tuple2(String.valueOf(obj), 1);
            }
        });

        //6. Count the number of words
        JavaPairRDD wordAndCountRdd = wordAndOneRdd.reduceByKey(new Function2() {
            @Override
            public Object call(Object obj1, Object obj2) throws Exception {
                return Integer.parseInt(obj1.toString()) + Integer.parseInt(obj2.toString());
            }
        });

        //7. Sorting
        JavaPairRDD sortedRdd = wordAndCountRdd.sortByKey();
        List<Tuple2> finalResult = sortedRdd.collect();

        //8. Result printing
        for (Tuple2 tuple2 : finalResult) {
            System.out.println(tuple2._1 + " ===> " + tuple2._2);
        }

        //9. Save the statistical results
        sortedRdd.saveAsTextFile("/var/spark/output");
        sc.stop();
        return "success" ;
    }
}

Package execution result:

Real time computing framework: Spark cluster construction and introduction case

To view the file output:

[[email protected] output]# vim /var/spark/output/part-00000

4、 Source code address

GitHub · address
https://github.com/cicadasmile/big-data-parent
Gitee · address
https://gitee.com/cicadasmile/big-data-parent

Real time computing framework: Spark cluster construction and introduction case

Read the label

Java Basics】【Design pattern】【Structure and algorithm】【Linux system】【database

Distributed architecture】【Microservice】【Big data component】【Springboot advanced】【Spring & Boot Foundation

Data analysis】【Technology Map】【 Workplace

Technology Series

OLAP engine: Druid component for data statistical analysis

OLAP engine: Presto component cross data source analysis

OLAP engine: Clickhouse high performance column query