Big data series: Spark Learning Notes

Time:2019-10-21

1. About spark

  • In 2009, spark was born inUniversity of Berkeley’s amplab。 Most of all, spark is just an experimental project, with a very small amount of code, which belongs to a lightweight framework.
  • In 2010, the University of Berkeley officially opened the spark project.
  • In June 2013, spark became a project under the Apache foundation and entered a high-speed development period. Third party developers contribute a lot of code and are very active
  • February 2014, spark is calledApache’s top projectsAt the same time, cloudera, a big data company, announced to increase the investment in spark framework to replace MapReduce.
  • In April 2014, MAPR, a big data company, entered the spark camp. Apache mahout gave up MapReduce and will use spark as the computing engine.
  • In May 2014, spark 1.0.0 was released.
  • In 2015, spark became more and more popular in the domestic IT industry. More and more companies began to focus on deploying or using spark to replace MR2, hive, storm and other traditional big data parallel computing frameworks.

2. What is spark?

  • Apache Spark™ is a unified analytics engine for large-scale data processing.
  • A unified analysis engine for large data sets
  • Spark is a general parallel computing framework based on memory, which aims to make data analysis faster
  • Spark contains various common computing frameworks in the field of big data

    • Spark core (offline computing)
    • Sparksql (interactive query)
    • Spark streaming (real time computing)
    • Spark mllib (machine learning)
    • Spark graphx (graph calculation)

3. Can spark replace Hadoop?

Not exactly right.

Because we can only use spark core instead of MR for offline computing, the data storage still depends on HDFS.

The combination of spark + Hadoop is the most popular combination and the most promising one in the field of big data in the future!

4. Features of spark

  • speed

    • Memory computing is 100 times faster than Mr
    • Disk computing is more than 10 times faster than Mr
  • Easy to use

    • Provides the API interface of Java Scala Python R language
  • One stop solution

    • Spark core (offline computing)
    • Spark SQL (interactive query)
    • Spark streaming (real time computing)
    • …..
  • Can run on any platform

    • yarn
    • Mesos
    • standalone

5. Shortcomings of spark

  • The memory overhead of JVM is too large, 1g data usually consumes 5g memory (project tungsten is trying to solve this problem)
  • There is no effective shared memory mechanism between different spark apps (project Tachyon is trying to introduce distributed memory management, so different spark apps can share cached data)

6. Spark vs MR

6.1 limitations of MR

  • Low level of abstraction, need to write code by hand, hard to use
  • Only two operations are provided, map and reduce, lacking expression
  • A job only has two phases: map and reduce. Complex calculation requires a large number of jobs to complete. The dependency between jobs is managed by the developers themselves.
  • Intermediate results (the output of reduce) are also placed in the HDFS file system
  • High latency, only applicable to batch data processing, and insufficient support for interactive data processing and real-time data processing
  • Poor performance for iterative data processing

6.2 what problems does spark solve in Mr?

  • Low level of abstraction, need to write code by hand, hard to use

    • Abstract through RDD (resilient distributed datasets) in spark
  • Only two operations are provided, map and reduce, lacking expression

    • Many operators are provided in spark
  • A job only has two stages: map and reduce.

    • There can be multiple stages in spark
  • Intermediate results are also on the HDFS file system (slow)

    • If the intermediate result is in memory, it will be written to the local disk instead of HDFS.
  • High latency, only applicable to batch data processing, and insufficient support for interactive data processing and real-time data processing

    • Sparksql and sparkstreaming solve the above problems
  • Poor performance for iterative data processing

    • Improve the performance of iterative computing by caching data in memory

==Therefore, it is the trend of technology development that Hadoop MapReduce will be replaced by the new generation of big data processing platform. In the new generation of big data processing platform, spark is currently the most widely recognized and supported.==

7. Spark version

  • spark1.6.3 : Version 2.10.5 of Scala
  • spark2.2.0 : Version 2.11.8 of Scala(recommended for new projects)
  • hadoop2.7.5

8. Installation of spark standalone

  • Prepare the installation package spark-2.2.0-bin-hadoop 2.7.tgz

    tar -zxvf spark-2.2.0-bin-hadoop2.7.tgz  -C /opt/
    mv spark-2.2.0-bin-hadoop2.7/ spark
  • Modify spark env.sh

    export JAVA_HOME=/opt/jdk
    export SPARK_MASTER_IP=uplooking01
    export SPARK_MASTER_PORT=7077
    export SPARK_WORKER_CORES=4
    export SPARK_WORKER_INSTANCES=1
    export SPARK_WORKER_MEMORY=2g
    export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
  • Configure environment variables

    #Configure environment variables for spark
    export SPARK_HOME=/opt/spark
    export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
  • Launch stand-alone spark

    start-all-spark.sh
  • View startup

    http://uplooking01:8080

9. Installation of spark distributed cluster

  • Configure spark env.sh

    [[email protected] /opt/spark/conf]    
            export JAVA_HOME=/opt/jdk
            #Configure the host of the master
            export SPARK_MASTER_IP=uplooking01
            #Configure the port for master host communication
            export SPARK_MASTER_PORT=7077
            #Configure the number of CPU cores used by spark in each worker
            export SPARK_WORKER_CORES=4
            #Configure one worker per host
            export SPARK_WORKER_INSTANCES=1
            #The memory used by worker is 2GB
            export SPARK_WORKER_MEMORY=2g
            #Directory in Hadoop's configuration file
            export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
  • Configure slaves

    [[email protected] /opt/spark/conf]
            uplooking03
            uplooking04
            uplooking05
  • Distribute spark

    [[email protected] /opt/spark/conf]    
            scp -r /opt/spark  uplooking02:/opt/
            scp -r /opt/spark  uplooking03:/opt/
            scp -r /opt/spark  uplooking04:/opt/
            scp -r /opt/spark  uplooking05:/opt/
  • Distribute environment variables configured on uplooking01

    [[email protected] /]    
            scp -r /etc/profile  uplooking02:/etc/
            scp -r /etc/profile  uplooking03:/etc/
            scp -r /etc/profile  uplooking04:/etc/
            scp -r /etc/profile  uplooking05:/etc/
  • Start spark

    [[email protected] /]    
        start-all-spark.sh

10. Spark high availability cluster

Stop the running spark cluster first

  • Modify spark env.sh

    #Note the following two lines
    #export SPARK_MASTER_IP=uplooking01
    #export SPARK_MASTER_PORT=7077
  • Add content

    export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=uplooking03:2181,uplooking04:2181,uplooking05:2181 -Dspark.deploy.zookeeper.dir=/spark"
  • Distribute modified [configuration

    scp /opt/spark/conf/spark-env.sh uplooking02:/opt/spark/conf
    scp /opt/spark/conf/spark-env.sh uplooking03:/opt/spark/conf
    scp /opt/spark/conf/spark-env.sh uplooking04:/opt/spark/conf
    scp /opt/spark/conf/spark-env.sh uplooking05:/opt/spark/conf
  • Start cluster

    [[email protected] /]
        start-all-spark.sh
    [[email protected] /]
        start-master.sh

11. The first spark shell program

spark-shell  --master spark://uplooking01:7077 
#Spark shell can specify the resources (total cores, memory used on each work) used by the spark shell application at startup.
spark-shell  --master  spark://uplooking01:7077   --total-executor-cores 6   --executor-memory 1g

#If you do not specify to use all cores on each worker by default, and 1G memory on each worker
sc.textFile("hdfs://ns1/sparktest/").flatMap(_.split(",")).map((_,1)).reduceByKey(_+_).collect

12. Roles in spark

  • Master

    • Responsible for receiving requests for submitted jobs
    • Master is responsible for scheduling resources (starting coarsegrainedexecutorbackend in woker)
  • Worker

    • The executor in the worker is responsible for executing the task
  • Spark-Submitter ===> Driver

    • Submit spark application to master

13. General process of Spark’s submission

Big data series: Spark Learning Notes