Spark is a general parallel computing framework similar to Hadoop MapReduce, which is open source of UC Berkeley amp lab. the distributed computing based on the map reduce algorithm of spark has the advantages of Hadoop MapReduce, and can be better applied to the iterative map reduce algorithms such as data mining and machine learning. This paper introduces in detail how to deploy spark cluster, which is a digital cloud based on mesos for cluster resource scheduling.
Spark supports three distributed deployment modes, namely, standalone, spark on horn and spark on mesos. Among them, spark on mesos mode is adopted by many companies, and spark officials also recommend this mode. It is precisely because spark started to support mesos at the beginning of its development. Therefore, at present, spark running on mesos is more flexible and natural than on yarn. It is through mesos that the multi person cloud cluster resources are scheduled. Therefore, the deployment of spark cluster in the multi person cloud has a natural advantage.
Next, let’s experience the deployment of spark cluster with the cloud of several people.
The first step is to make a mirror image
First, we need to create a docker image of spark in the docker environment and push it to an accessible docker image warehouse.
1. Write the following configuration file
(1) mesos-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mesos.hdfs.namenode.cpus</name>
<value>0.25</value>
</property>
<property>
<name>mesos.hdfs.datanode.cpus</name>
<value>0.25</value>
</property>
<property>
<name>mesos.hdfs.journalnode.cpus</name>
<value>0.25</value>
</property>
<property>
<name>mesos.hdfs.executor.cpus</name>
<value>0.1</value>
</property>
<property>
<name>mesos.hdfs.data.dir</name>
<description>The primary data directory in HDFS</description>
<value>/var/lib/hdfs/data</value>
</property>
<property>
<name>mesos.hdfs.framework.mnt.path</name>
<value>/opt/mesosphere</value>
<description>This is the default for all DCOS installs</description>
</property>
<property>
<name>mesos.hdfs.state.zk</name>
<value>master.mesos:2181</value>
<description>See the Mesos DNS config file for explanation for this</description>
</property>
<property>
<name>mesos.master.uri</name>
<value>zk://master.mesos:2181/mesos</value>
<description>See the Mesos DNS config file for explanation for this</description>
</property>
<property>
<name>mesos.hdfs.zkfc.ha.zookeeper.quorum</name>
<value>master.mesos:2181</value>
<description>See the Mesos DNS config file for explanation for this</description>
</property>
<property>
<name>mesos.hdfs.mesosdns</name>
<value>true</value>
<description>All DCOS installs come with mesos DNS to maintain static configurations</description>
</property>
<property>
<name>mesos.hdfs.native-hadoop-binaries</name>
<value>true</value>
<description>DCOS comes with pre-distributed HDFS binaries in a single-tenant environment</description>
</property>
<property>
<name>mesos.native.library</name>
<value>/opt/mesosphere/lib/libmesos.so</value>
</property>
<property>
<name>mesos.hdfs.ld-library-path</name>
<value>/opt/mesosphere/lib</value>
</property>
</configuration>
(2) hdfs-site.xml
<configuration>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.nameservice.id</name>
<value>hdfs</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>hdfs</value>
</property>
<property>
<name>dfs.ha.namenodes.hdfs</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.hdfs.nn1</name>
<value>namenode1.hdfs.mesos:50071</value>
</property>
<property>
<name>dfs.namenode.http-address.hdfs.nn1</name>
<value>namenode1.hdfs.mesos:50070</value>
</property>
<property>
<name>dfs.namenode.rpc-address.hdfs.nn2</name>
<value>namenode2.hdfs.mesos:50071</value>
</property>
<property>
<name>dfs.namenode.http-address.hdfs.nn2</name>
<value>namenode2.hdfs.mesos:50070</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.hdfs</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
</configuration>
Note: when docking HDFS, you need to
dfs.namenode.http-address.hdfs.nn1
It is configured as HDFS namenode address;
(3) core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://hdfs</value>
</property>
<property>
<name>hadoop.proxyuser.hue.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hue.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.httpfs.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.httpfs.groups</name>
<value>*</value>
</property>
</configuration>
(4) spark-env.sh
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:jre/bin/java::")
export MASTER=mesos://zk://${ZOOKEEPER_ADDRESS}/mesos
export SPARK_HOME=/opt/spark/dist
export SPARK_LOCAL_IP=`ifconfig eth0 | awk '/inet addr/{print substr($2,6)}'`
export SPARK_LOCAL_HOSTNAME=`ifconfig eth0 | awk '/inet addr/{print substr($2,6)}'`
export LIBPROCESS_IP=`ifconfig eth0 | awk '/inet addr/{print substr($2,6)}'`
Note 1: several people cloud has installed zookeeper in the master node, so
MASTER
Need to configure zookeeper address for several people cluster;
Note 2SPARK_LOCAL_IP、SPARK_LOCAL_HOSTNAME、LIBPROCESS_IP
All values are host IP. If the network card used by the host is not eth0, please modify the name of the network card here;
(5) spark-default.conf
spark.mesos.coarse=true
spark.mesos.executor.home /opt/spark/dist
spark.mesos.executor.docker.image your.registry.site/spark:1.5.0-hadoop2.6.0
Among them,
spark.mesos.executor.docker.image
The address of the spark image in the image warehouse needs to be configured; the cloud has pushed the image to the test warehouseindex.shurenyun.com
。
2. Write dockerfile
FROM mesosphere/mesos:0.23.0-1.0.ubuntu1404
# Set environment variables.
ENV DEBIAN_FRONTEND "noninteractive"
ENV DEBCONF_NONINTERACTIVE_SEEN "true"
# Upgrade package index and install basic commands.
RUN apt-get update && \
apt-get install -y openjdk-7-jdk curl
ENV JAVA_HOME /usr/lib/jvm/java-7-openjdk-amd64
ENV MESOS_NATIVE_JAVA_LIBRARY /usr/local/lib/libmesos.so
ADD . /opt/spark/dist
ADD hdfs-site.xml /etc/hadoop/hdfs-site.xml
ADD core-site.xml /etc/hadoop/core-site.xml
ADD mesos-site.xml /etc/hadoop/mesos-site.xml
ADD spark-env.sh /opt/spark/dist/conf/spark-env.sh
ADD spark-default.conf /opt/spark/dist/conf/spark-default.conf
RUN ln -sf /usr/lib/libmesos.so /usr/lib/libmesos-0.23.1.so
WORKDIR /opt/spark/dist
three . Create and upload docker image :
docker build -t your.registry.site/spark:1.5.0-hadoop2.6.0
docker push your.registry.site/spark:1.5.0-hadoop2.6.0
Need to
your.registry.site
Change to your image warehouse address; the cloud has pushed the image to the test warehouseindex.shurenyun.com
。
The second step is to establish a cluster
SeeCreate / delete clusterTo create your cluster.
The third step is to publish the application
The architecture of deploying spark on mesos is as follows:
Among them, the role of cluster manager is assumed by mesos, which is the master of the cloud group; the driver program is used to distribute spark computing tasks, which needs to be manually started on a node of the intranet of the cloud group, which can be a master or slave, or an intranet machine connected with the cloud group; the woker node is assumed by mesos slave, which is the master of the cloud group Slave。
Log in to the host that needs to start the driver program and start spark container:
docker run -it --net host -e ZOOKEEPER_ADDRESS=10.3.10.29:2181,10.3.10.63:2181,10.3.10.51:2181 index.shurenyun.com/spark:1.5.0-hadoop2.6.0 bash
Note 1: sufficient resources are required for spark startup. It is recommended that the minimum number of CPUs is 1 and the minimum memory is 1g;
Note 2: Spark nodes need to communicate with each other, so the host mode is selected to avoid the communication failure caused by port steganography;
Note 3: pleaseZOOKEEPER_ADDRESS
Change the value of to the master address of your group of several people, and the port is 2181.
Step 4 test
Start spark shell
bin/spark-shell
Run demo
sc.parallelize(1 to 1000) count
If you see that the keyspace named test has been successfully added, as shown in the following figure:
Congratulations, now your spark cluster is in normal operation! If you still find it inconvenient to use spark in this way, and want a more intuitive method, such as writing and testing spark algorithm on the browser, you can try to use Zeppelin to write and run spark tasks. Later, the cloud will provide you with the best practice of playing Zeppelin on the cloud. Please look forward to it!