Build spark cluster? I didn’t expect you to be like this k8s

Time:2021-3-4

In this example, you will use kubernetes and docker to create a functional Apache spark cluster.

You will use spark standby mode to install a spark master service and a set of spark workers.

For readers who are familiar with this part, you can jump to the TL; Dr chapter directly.

source code

Docker image is mainly based onhttps://github.com/mattf/dock…. Source code is hosted inhttps://github.com/kubernetes…

Step 0: preparation

This example assumes that you have:

There are kubernetes clusters installed and running.
The kubectl command line tool has been installed in a path.
A spark master k8s service has been run, which can be automatically found in the Kube DNS instance using the ‘spark master’ domain name.

More details can be found in the docker file of the source code.

Step 1: create a namespace

$ kubectl create -f examples/spark/namespace-spark-cluster.yaml

Now list all the namespace:

$ kubectl get namespaces
NAME          LABELS             STATUS
default       <none>             Active
spark-cluster name=spark-cluster Active

For the namespace used by the kubectl client, we define an environment and use it:

$ kubectl config set-context spark --namespace=spark-cluster --cluster=${CLUSTER_NAME} --user=${USER_NAME}
$ kubectl config use-context spark

You can view the cluster name and user name in the kubernetes configuration file ~ /. Kube / config.

Step 2: start your master service

Master service is the master service of spark cluster.
Using examples / spark / spark master- controller.yaml File to create a replication controller to run the spark master service.

$ kubectl create -f examples/spark/spark-master-controller.yaml
replicationcontroller "spark-master-controller" created

Then, use examples / spark / spark master- service.yaml File to create a logical service endpoint that spark workers can use to access the master pod

$ kubectl create -f examples/spark/spark-master-service.yaml
service "spark-master" created

Then you can create a service for spark master webui:

$ kubectl create -f examples/spark/spark-webui.yaml
service "spark-webui" created

Check whether the master can run and access

$ kubectl get podsNAME                            READY     STATUS    RESTARTS   AGEspark-master-controller-5u0q5   1/1       Running   0          8m

Check the log to see the status of the master. (use the pod name of the previous instruction output)

{{{$ kubectl logs spark-master-controller-5u0q5
starting org.apache.spark.deploy.master.Master, logging to /opt/spark
-1.5.1-bin-hadoop2.6/sbin/../logs/spark--org.apache.spark.deploy.master.
- Master-1-spark-
master-controller-g0oao.out
Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /
opt/spark-1.5.1-bin-hadoop2.6/sbin/../conf/:/opt/spark-1.5.1-bin-
hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/opt/spark-1.5.1
-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.5.1-bin
- -hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/spark-1.5.1-bin-
- hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip spark-master --port 7077
-  --webui-port 8080
========================================
15/10/27 21:25:05 INFO Master: Registered signal handlers for 
[TERM, HUP, INT]
15/10/27 21:25:05 INFO SecurityManager: Changing view acls to: root
15/10/27 21:25:05 INFO SecurityManager: Changing modify acls to: root
15/10/27 21:25:05 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(root); users with 
modify permissions: Set(root)
15/10/27 21:25:06 INFO Slf4jLogger: Slf4jLogger started
15/10/27 21:25:06 INFO Remoting: Starting remoting
15/10/27 21:25:06 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:7077]
15/10/27 21:25:06 INFO Utils: Successfully started service 'sparkMaster' on port 
7077.
15/10/27 21:25:07 INFO Master: Starting Spark master at spark://spark-master:
7077
15/10/27 21:25:07 INFO Master: Running Spark version 1.5.1
15/10/27 21:25:07 INFO Utils: Successfully started service 'MasterUI' on 
port 8080.
15/10/27 21:25:07 INFO MasterWebUI: Started MasterWebUI at http://
spark-master:8080
15/10/27 21:25:07 INFO Utils: Successfully started service on port 6066.
15/10/27 21:25:07 INFO StandaloneRestServer: Started REST server for
 submitting applications on port 6066
15/10/27 21:25:07 INFO Master: I have been elected leader! New state: ALIVE}}}

After confirming that the master is running normally, you can use k8s cluster agent to access spark webui

kubectl proxy --port=8001

At this point, you canhttp://localhost:8001/api/v1/…Access UI

Step 3: start spark workers

Spark workers play a very important role in spark cluster. They provide execution resources and data caching for programs.

Spark workers requires the master service to be running.

Using examples / spark / spark worker- controller.yaml File to create a replication controller to manage worker pods.

$ kubectl create -f examples/spark/spark-worker-controller.yaml
replicationcontroller "spark-worker-controller" created

Check whether the workers are running normally

If you start the spark Web UI, it should appear in the UI when the worker is ready. (this may take some time to pull the image and start pods.) You can also query the status in the following ways:

$ kubectl get pods
NAME                            READY     STATUS    RESTARTS   AGE
spark-master-controller-5u0q5   1/1       Running   0          25m
spark-worker-controller-e8otp   1/1       Running   0          6m
spark-worker-controller-fiivl   1/1       Running   0          6m
spark-worker-controller-ytc7o   1/1       Running   0          6m
$ kubectl logs spark-master-controller-5u0q5
[...]
15/10/26 18:20:14 INFO Master: Registering worker 10.244.1.13:53567 
with 2 cores, 6.3 GB RAM
15/10/26 18:20:14 INFO Master: Registering worker 10.244.2.7:46195
 with 2 cores, 6.3 GB RAM
15/10/26 18:20:14 INFO Master: Registering worker 10.244.3.8:39926 
with 2 cores, 6.3 GB RAM

If kubectl proxy is still running in the previous section, you should also see workers in the UI. Note: the UI will contain hyperlinks to the worker Web UI. These links do not work properly (links will try to connect to the cluster IP, while kubernetes will not automatically proxy the cluster IP).

Step 4: start Zeppelin UI in spark cluster to load work tasks

Zeppelin UI pod can be used to load jobs in the spark cluster. The loading operation can be completed either through the notebook on the web side or through the traditional spark command line. See Zeppelin and spark architecture for more details.

$ kubectl create -f examples/spark/zeppelin-controller.yaml
replicationcontroller "zeppelin-controller" created

Zeppelin needs the master service to be running.

Check whether Zeppelin is working properly

$ kubectl get pods -l component=zeppelin
NAME                        READY     STATUS    RESTARTS   AGE
zeppelin-controller-ja09s   1/1       Running   0          53s

Step 5: Operation cluster

Now you have two choices: you can access the spark cluster through the graphical interface, or you can keep using the CLI.

Quick use of pyspark

Use kubectl exec to connect to the Zeppelin driver and run the pipeline.

Use kubectl exec to connect to the Zeppelin driver and run the pipeline.

$ kubectl exec zeppelin-controller-ja09s -it pyspark
Python 2.7.9 (default, Mar  1 2015, 12:57:24)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more
 information.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.1
      /_/
Using Python version 2.7.9 (default, Mar  1 2015 12:57:24)
SparkContext available as sc, HiveContext available as sqlContext.
>>> sc.textFile("gs://dataflow-samples/shakespeare/*").map
(lambda s: len(s.split())).sum()
939193

Congratulations, you have counted the number of words in all Shakespeare’s plays.

Use the GUI to blind your eyes!

Using the Zeppelin pod created before, set the forwarding port of webui:

$ kubectl port-forward zeppelin-controller-ja09s 8080:8080

This command forwards the request to access port 8080 of localhost to port 8080 in the container. And then you can go through ithttps://localhost:8080/Visit Zeppelin.

Create a new notebook. Enter:

%pyspark
print sc.textFile("gs://dataflow-samples/shakespeare/*").map
(lambda s: len(s.split())).sum()

conclusion

Now you have created service and replication controllers for spark master, spark workers and spark driver. You can take this example to the next step and start using the Apache spark cluster you just created. See the spark documentation for more information.

tl;dr

kubectl create -f examples/spark

After setting:

kubectl get pods # Make sure everything is running
kubectl proxy --port=8001 # Start an application proxy, if you want
 to see the Spark Master WebUI
kubectl get pods -lcomponent=zeppelin # Get the driver pod to interact
 with.

At this time, the master UI canhttp://localhost:8001/api/v1/…visit.

You can use the traditional spark shell / spark submit / pyspark command line to interact with the spark cluster through kubectl exec, or if you want to interact with Zeppelin:

kubectl port-forward zeppelin-controller-abc123 8080:8080 &
kubectl port-forward zeppelin-controller-abc123 8080:8080 &

Then visithttp://localhost:8080/

Known problems with spark

The scheme provides a spark configuration limited to cluster network, which means that spark master can only be accessed through cluster service. If you need to use external clients other than Zeppelin or spark submit to submit jobs in Zeppelin pod, you need to provide a way for clients to access examples / spark / spark master- service.yaml It’s the same way. See service for more information.

Known problems with Zeppelin

● Zeppelin pod is very large, so it may take a while to pull the image. The pull speed depends on your network conditions. The size of Zeppelin pod is the problem we are trying to solve. See problem 17231 for details.
The first time Zeppelin is running, the pipeline may take a lot of time (about one minute). It seems to take quite a while to load.
In gke environment, kubectl port forward can not keep stable for a long time. If you find that Zeppelin becomes disconnected, port forward is likely to fail and need to restart. See # 12179 for details.

This paper is written bySpeed cloudTranslation, if reprinted, reprinted from“Speed cloud
Link to the original text:
https://github.com/kubernetes…