Spark installation

Time:2020-10-24

Install Hadoop and spark (Ubuntu 16.04)

Install JDK

  • Download JDK (as jdk-8u91-linux-x64 tar.gz For example)

  • New folder

    sudo mkdir /usr/lib/jvm

  • Unzip the downloaded JDK file and move it to the new folder

    sudo tar -xzvf jdk-8u91-linux-x64.tar.gz -C /usr/lib/jvm

  • Enter the JVM folder and rename the extracted folder

    cd /usr/lib/jvm
    sudo mv jdk1.8.0_91 jdk
  • Add environment variable

    sudo vim /etc/profile
    #Add the following configuration
    export JAVA_HOME=/usr/lib/jvm/jdk
    export CLASSPATH=.:$JAVA_HOME/lib:$JAVA_HOME/jre/lib:$CLASSPATH
    export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$PATH
  • Make configuration take effect

    source /etc/profile

  • test

    java -version

Install Scala

  • Installation similar to JDK

  • Download Scala (take scala-2.11.8.tgz as an example)

  • Unzip the downloaded Scala file

    sudo tar -xzvf scala-2.11.8.tgz -C /usr/local

  • rename

    cd /usr/local
    sudo mv scala-2.11.8 scala
  • Add environment variable

    sudo vim /etc/profile
    #Add the following at the end
    export SCALA_HOME=/usr/local/scala
    export PATH=$SCALA_HOME/bin:$PATH
  • Make configuration take effect

    source /etc/profile

  • test

    scala -version

Install Hadoop

Spark uses HDFS as the persistence layer by default, so Hadoop needs to be installed or not

reference resources

install

  • Install SSH

    sudo apt install openssh-server

  • Configure SSH unclassified login

    SSH keygen - t RSA
    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  • Test SSH unclassified login

    SSH localhost ා if you do not prompt for password, the configuration is successful

  • Download Hadoop (as hadoop-2.7.2 tar.gz For example)

  • decompression

    sudo tar -xzvf hadoop-2.7.2.tar.gz -C /usr/local

  • rename

    cd /usr/local
    sudo mv hadoop-2.7.2 hadoop
  • Modify permissions

    cd /usr/local
    sudo chown -R yourusername:yourusername hadoop
  • Configure environment variables

    sudo vim /etc/profile
    #Add the following code at the end
    export HADOOP_HOME=/usr/local/hadoop
    export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
  • test

    hadoop version

Hadoop pseudo distributed configuration

  • Modify configuration filecore-site.xml

    cd /usr/local/hadoop
    vim ./etc/hadoop/core-site.xml
    #It is amended as follows
    <configuration>
            <property>
                 <name>hadoop.tmp.dir</name>
                 <value>file:/usr/local/hadoop/tmp</value>
                 <description>Abase for other temporary directories.</description>
            </property>
            <property>
                 <name>fs.defaultFS</name>
                 <value>hdfs://localhost:9000</value>
            </property>
    </configuration>
  • Modify configuration filehdfs-site.xml

    cd /usr/local/hadoop
    vim ./etc/hadoop/hdfs-site/xml
    #It is amended as follows
    <configuration>
            <property>
                 <name>dfs.replication</name>
                 <value>1</value>
            </property>
            <property>
                 <name>dfs.namenode.name.dir</name>
                 <value>file:/usr/local/hadoop/tmp/dfs/name</value>
            </property>
            <property>
                 <name>dfs.datanode.data.dir</name>
                 <value>file:/usr/local/hadoop/tmp/dfs/data</value>
            </property>
    </configuration>
  • Modify configuration filehadoop-env.sh

    cd /usr/local/hadoop
    vim ./etc/hadoop/hadoop-env.sh
    #Export Java_ HOME=${JAVA_ Change home} to:
    export JAVA_HOME=/usr/lib/jvm/jdk
  • Perform namenode format

    hdfs namenode -format

  • function

    start-dfs.sh

  • test

    jps

    There are several processes as follows

    5939 Jps
    5636 DataNode
    5493 NameNode
    5814 SecondaryNameNode
  • View through browser

    Enter the following address in the browser:localhost:50070

Configure yarn

  • Modify configuration filemapred-site.xml

    cd /usr/local/hadoop
    cp ./etc/hadoop/mapred-site.xml.template ./etc/hadoop/mapred-site.xml
    vim ./etc/hadoop/mapred-site.xml
    #Change to the following configuration
    <configuration>
            <property>
                 <name>mapreduce.framework.name</name>
                 <value>yarn</value>
            </property>
    </configuration>
  • Modify configuration fileyarn-site.xml

    cd /usr/local/hadoop
    vim ./etc/hadoop/yarn-site.xml
    #Change to the following configuration
    <configuration>
            <property>
                 <name>yarn.nodemanager.aux-services</name>
                 <value>mapreduce_shuffle</value>
                </property>
    </configuration>
  • Write startup script

    #!/bin/bash
    #Start Hadoop
    start-dfs.sh
    #Start yarn
    start-yarn.sh
    #Start the history server to see the task running in the web
    mr-jobhistory-daemon.sh start historyserver
  • Write stop scripts

    #!/bin/bash
    #Stop history server
    mr-jobhistory-daemon.sh stop historyserver
    #Stop yarn
    stop-yarn.sh
    #Stop Hadoop
    stop-dfs.sh
  • View the running status of the task through the web interface

    Enter the address in the browser:localhost:8088

Install spark

  • Download spark (take spark-2.0.0-bin-hadoop 2.7.tgz as an example)

  • Unzip the downloaded spark file

    sudo tar -zxf spark-2.0.0-bin-hadoop2.7.tgz -C /usr/local

  • rename

    cd /usr/local
    sudo mv spark-2.0.0-bin-hadoop2.7 spark
  • Add environment variable

    sudo vim /etc/profile
    #Add the following at the end
    export SPARK_HOME=/usr/local/spark
    export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
  • Modify the authority

    cd /usr/local
    sudo chown -R yourusername:yourusername ./spark
  • Copy configuration file

    cd /usr/local/spark
    cp ./conf/spark-env.sh.template ./conf/spark-env.sh
  • Modify configuration file

    cd /usr/loca/spark
    vim ./conf/spark-env.sh
    #Add the following line
    export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)
    export JAVA_HOME=/usr/lib/jvm/jdk
  • Run a simple example

    /usr/local/spark/bin/run-example SparkPi 2>&1 | grep "Pi is roughly"

  • Start spark

    /usr/local/spark/sbin/start-all.sh

  • Script

    Start Hadoop and spark

    #!/bin/bash
    #Start Hadoop and yarn
    start-dfs.sh
    start-yarn.sh
    #Start history server
    mr-jobhistory-daemon.sh start historyserver
    #Start spark
    /usr/local/spark/sbin/start-all.sh
Stop Hadoop and spark
#!/bin/bash
#Stop spark
stop-dfs.sh
stop-yarn.sh
#Stop history server
mr-jobhistory-daemon.sh stop historyserver
#Stop Hadoop and yarn
/usr/local/hadoop/sbin/stop-all.sh
  • View through web page

    Enter the address in the browser:` localhost:8080 `