Keep up with the pace of big data: quickly build spark cluster

Time:2021-4-21

Spark is a general parallel computing framework similar to Hadoop MapReduce, which is open source of UC Berkeley amp lab. the distributed computing based on the map reduce algorithm of spark has the advantages of Hadoop MapReduce, and can be better applied to the iterative map reduce algorithms such as data mining and machine learning. This paper introduces in detail how to deploy spark cluster, which is a digital cloud based on mesos for cluster resource scheduling.

Spark supports three distributed deployment modes, namely, standalone, spark on horn and spark on mesos. Among them, spark on mesos mode is adopted by many companies, and spark officials also recommend this mode. It is precisely because spark started to support mesos at the beginning of its development. Therefore, at present, spark running on mesos is more flexible and natural than on yarn. It is through mesos that the multi person cloud cluster resources are scheduled. Therefore, the deployment of spark cluster in the multi person cloud has a natural advantage.

Next, let’s experience the deployment of spark cluster with the cloud of several people.

The first step is to make a mirror image

First, we need to create a docker image of spark in the docker environment and push it to an accessible docker image warehouse.

1. Write the following configuration file

(1) mesos-site.xml

    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    
    <configuration>
        
      <property>
        <name>mesos.hdfs.namenode.cpus</name>
        <value>0.25</value>
      </property>
    
      <property>
        <name>mesos.hdfs.datanode.cpus</name>
        <value>0.25</value>
      </property>
        
      <property>
        <name>mesos.hdfs.journalnode.cpus</name>
        <value>0.25</value>
      </property>
        
      <property>
        <name>mesos.hdfs.executor.cpus</name>
        <value>0.1</value>
      </property>
        
      <property>
        <name>mesos.hdfs.data.dir</name>
        <description>The primary data directory in HDFS</description>
        <value>/var/lib/hdfs/data</value>
      </property>
    
      <property>
        <name>mesos.hdfs.framework.mnt.path</name>
        <value>/opt/mesosphere</value>
        <description>This is the default for all DCOS installs</description>
      </property>
    
      <property>
        <name>mesos.hdfs.state.zk</name>
        <value>master.mesos:2181</value>
        <description>See the Mesos DNS config file for explanation for this</description>
      </property>
    
      <property>
        <name>mesos.master.uri</name>
        <value>zk://master.mesos:2181/mesos</value>
        <description>See the Mesos DNS config file for explanation for this</description>
      </property>
    
      <property>
        <name>mesos.hdfs.zkfc.ha.zookeeper.quorum</name>
        <value>master.mesos:2181</value>
        <description>See the Mesos DNS config file for explanation for this</description>
      </property>
    
      <property>
        <name>mesos.hdfs.mesosdns</name>
          <value>true</value>
        <description>All DCOS installs come with mesos DNS to maintain static configurations</description>
      </property>
    
      <property>
        <name>mesos.hdfs.native-hadoop-binaries</name>
        <value>true</value>
        <description>DCOS comes with pre-distributed HDFS binaries in a single-tenant environment</description>
      </property>
    
      <property>
        <name>mesos.native.library</name>
        <value>/opt/mesosphere/lib/libmesos.so</value>
      </property>
      
      <property>
        <name>mesos.hdfs.ld-library-path</name>
        <value>/opt/mesosphere/lib</value>
      </property>
    </configuration>

(2) hdfs-site.xml

    <configuration>
      <property>
        <name>dfs.ha.automatic-failover.enabled</name>
        <value>true</value>
      </property>
    
      <property>
        <name>dfs.nameservice.id</name>
        <value>hdfs</value>
      </property>
    
      <property>
        <name>dfs.nameservices</name>
        <value>hdfs</value>
      </property>
    
      <property>
        <name>dfs.ha.namenodes.hdfs</name>
        <value>nn1,nn2</value>
      </property>
    
      <property>
        <name>dfs.namenode.rpc-address.hdfs.nn1</name>
        <value>namenode1.hdfs.mesos:50071</value>
      </property>
    
      <property>
        <name>dfs.namenode.http-address.hdfs.nn1</name>
        <value>namenode1.hdfs.mesos:50070</value>
      </property>
    
      <property>
        <name>dfs.namenode.rpc-address.hdfs.nn2</name>
        <value>namenode2.hdfs.mesos:50071</value>
      </property>
    
      <property>
        <name>dfs.namenode.http-address.hdfs.nn2</name>
        <value>namenode2.hdfs.mesos:50070</value>
      </property>
    
      <property>
        <name>dfs.client.failover.proxy.provider.hdfs</name>
        <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
      </property>
    </configuration>

Note: when docking HDFS, you need todfs.namenode.http-address.hdfs.nn1It is configured as HDFS namenode address;

(3) core-site.xml

    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    
    <configuration>
      <property>
        <name>fs.default.name</name>
        <value>hdfs://hdfs</value>
      </property>
      <property>
        <name>hadoop.proxyuser.hue.hosts</name>
        <value>*</value>
      </property>
      <property>
        <name>hadoop.proxyuser.hue.groups</name>
        <value>*</value>
      </property>
      <property>
        <name>hadoop.proxyuser.root.hosts</name>
        <value>*</value>
      </property>
      <property>
        <name>hadoop.proxyuser.root.groups</name>
        <value>*</value>
      </property>
      <property>
        <name>hadoop.proxyuser.httpfs.hosts</name>
        <value>*</value>
      </property>
      <property>
        <name>hadoop.proxyuser.httpfs.groups</name>
        <value>*</value>
      </property>
    </configuration>

(4) spark-env.sh

  export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:jre/bin/java::")
  export MASTER=mesos://zk://${ZOOKEEPER_ADDRESS}/mesos
  export SPARK_HOME=/opt/spark/dist
  export SPARK_LOCAL_IP=`ifconfig eth0 | awk '/inet addr/{print substr($2,6)}'`
  export SPARK_LOCAL_HOSTNAME=`ifconfig eth0 | awk '/inet addr/{print substr($2,6)}'`
  export LIBPROCESS_IP=`ifconfig eth0 | awk '/inet addr/{print substr($2,6)}'`

Note 1: several people cloud has installed zookeeper in the master node, soMASTERNeed to configure zookeeper address for several people cluster;
Note 2SPARK_LOCAL_IP、SPARK_LOCAL_HOSTNAME、LIBPROCESS_IPAll values are host IP. If the network card used by the host is not eth0, please modify the name of the network card here;

(5) spark-default.conf

  spark.mesos.coarse=true
  spark.mesos.executor.home /opt/spark/dist
  spark.mesos.executor.docker.image your.registry.site/spark:1.5.0-hadoop2.6.0

Among them,spark.mesos.executor.docker.imageThe address of the spark image in the image warehouse needs to be configured; the cloud has pushed the image to the test warehouseindex.shurenyun.com

2. Write dockerfile

    FROM mesosphere/mesos:0.23.0-1.0.ubuntu1404
    
    # Set environment variables.
    ENV DEBIAN_FRONTEND "noninteractive"
    ENV DEBCONF_NONINTERACTIVE_SEEN "true"
    
    # Upgrade package index and install basic commands.
    RUN apt-get update && \
        apt-get install -y openjdk-7-jdk curl
    
    ENV JAVA_HOME /usr/lib/jvm/java-7-openjdk-amd64
    
    ENV MESOS_NATIVE_JAVA_LIBRARY /usr/local/lib/libmesos.so
    
    ADD . /opt/spark/dist
    
    ADD hdfs-site.xml /etc/hadoop/hdfs-site.xml
    ADD core-site.xml /etc/hadoop/core-site.xml
    ADD mesos-site.xml /etc/hadoop/mesos-site.xml
    ADD spark-env.sh /opt/spark/dist/conf/spark-env.sh
    ADD spark-default.conf /opt/spark/dist/conf/spark-default.conf
    
    RUN ln -sf /usr/lib/libmesos.so /usr/lib/libmesos-0.23.1.so
    
    WORKDIR /opt/spark/dist

three . Create and upload docker image :

    docker build -t your.registry.site/spark:1.5.0-hadoop2.6.0   
    docker push your.registry.site/spark:1.5.0-hadoop2.6.0  

Need toyour.registry.siteChange to your image warehouse address; the cloud has pushed the image to the test warehouseindex.shurenyun.com

The second step is to establish a cluster

SeeCreate / delete clusterTo create your cluster.

The third step is to publish the application

The architecture of deploying spark on mesos is as follows:
Keep up with the pace of big data: quickly build spark cluster

Among them, the role of cluster manager is assumed by mesos, which is the master of the cloud group; the driver program is used to distribute spark computing tasks, which needs to be manually started on a node of the intranet of the cloud group, which can be a master or slave, or an intranet machine connected with the cloud group; the woker node is assumed by mesos slave, which is the master of the cloud group Slave。

Log in to the host that needs to start the driver program and start spark container:

docker run -it --net host -e ZOOKEEPER_ADDRESS=10.3.10.29:2181,10.3.10.63:2181,10.3.10.51:2181 index.shurenyun.com/spark:1.5.0-hadoop2.6.0 bash 

Note 1: sufficient resources are required for spark startup. It is recommended that the minimum number of CPUs is 1 and the minimum memory is 1g;
Note 2: Spark nodes need to communicate with each other, so the host mode is selected to avoid the communication failure caused by port steganography;
Note 3: pleaseZOOKEEPER_ADDRESSChange the value of to the master address of your group of several people, and the port is 2181.

Step 4 test

Start spark shell

  bin/spark-shell

Run demo

  sc.parallelize(1 to 1000) count

If you see that the keyspace named test has been successfully added, as shown in the following figure:

Keep up with the pace of big data: quickly build spark cluster

Congratulations, now your spark cluster is in normal operation! If you still find it inconvenient to use spark in this way, and want a more intuitive method, such as writing and testing spark algorithm on the browser, you can try to use Zeppelin to write and run spark tasks. Later, the cloud will provide you with the best practice of playing Zeppelin on the cloud. Please look forward to it!

Recommended Today

Review of SQL Sever basic command

catalogue preface Installation of virtual machine Commands and operations Basic command syntax Case sensitive SQL keyword and function name Column and Index Names alias Too long to see? Space Database connection Connection of SSMS Connection of command line Database operation establish delete constraint integrity constraint Common constraints NOT NULL UNIQUE PRIMARY KEY FOREIGN KEY DEFAULT […]