Integrating Argo workflow and spark on kubernetes

Time:2020-10-26

Integrating Argo workflow and spark on kubernetes

In my first article, I talked about Argo CD. This is Argo workflow, which comes from the Argo project, spark on kubernetes, and how we can make both work together.

Argo Workflow

Argo workflow is a cloud native workflow engine in which we can choreograph jobs with task sequences (each step in the workflow acts as a container). Using workflow definitions, we can use DAG to capture dependencies between tasks. Replace airflow? Maybe! If you’re looking for kubernetes native products, I’m sure Argo workflow won’t let you down.

Deploy Argo workflow to k8s (create a namespace for Argo workflow)

1. Install Helm:

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh

2. Add Argo repo to helm and install it

helm repo add argo https://argoproj.github.io/argo-helm
helm repo update
helm install argo-wf argo/argo -n argo-wf -f values.yaml

stay values.yaml In, you can enable portals (if there are portal controllers available in the cluster)

ingress:
  enabled: true
  annotations:
    kubernetes.io/ingress.class: nginx
    kubernetes.io/tls-acme: “true”
  hosts:
   - argo.example.com

You will now deploy and run the Argo workflow!

Integrating Argo workflow and spark on kubernetes

Workflow automation in Argo is driven by yaml template. Argo provides a wealth of documentation and relatedExamples。 If you’re looking for automation, we can even submit workflows through the rest API. I won’t go into too much detail here because the document is detailed and well explained. Let’s move on to the next topic.

Spark on Kubernetes

Starting with spark 2.3, you can use kubernetes to run and manage spark resources. Spark can run on a cluster managed by kubernetes. This feature uses the native kubernetes scheduler that has been added to spark. We can run spark driver and pod on demand, which means there is no dedicated spark cluster.

There are two ways to run spark on kubernetes: Using Spark submit and spark operator.

By using the spark submit cli, you can submit spark jobs using various configuration options supported by kubernetes.

  • spark-submit
    Spark submit delegates the job submission to spark driver pod on kubernetes, and finally creates relevant kubernetes resources by communicating with kubernetes API server.

Integrating Argo workflow and spark on kubernetes

Spark submit is the easiest way to run spark on kubernetes. After looking at the code snippet, you notice two small changes. One is to change the kubernetes cluster endpoint. Second, the container image that hosts your spark application.

./bin/spark-submit \ — master k8s://https://<KUBERNETES_CLUSTER_ENDPOINT> \ — deploy-mode cluster \ — name spark-pi \ — class org.apache.spark.examples.SparkPi \ — conf spark.executor.instances=3 \ — conf spark.kubernetes.container.image=aws/spark:2.4.5-SNAPSHOT \ — conf spark.kubernetes.driver.pod.name=sparkpi-test-driver \ — conf spark.kubernetes.authenticate.driver.serviceAccountName=spark local:///opt/spark/examples/jars/spark-examples_2.11–2.4.5-SNAPSHOT.jar

Please refer to thislinkTo get all available parameters / properties.

  • Spark operator

The spark operator provides a native kubernetes experience for spark workloads. In addition, you can submit spark jobs using kubectl and sparkctl. These examples can be found in thehereFind it.

Integrating Argo workflow and spark on kubernetes

Now, how do we submit spark jobs from Argo workflow?

Spark-submit:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: wf-spark-pi
  namespace: argo
spec:
  arguments: {}
  entrypoint: sparkapp
  templates:
  - arguments: {}
    container:
      args:
      - /opt/spark/bin/spark-submit
      - --master
      - k8s://https://kubernetes.default.svc
      - --deploy-mode
      - cluster
      - --conf
      - spark.kubernetes.namespace=spark-apps
      - --conf
      - spark.kubernetes.container.image=sparkimage
      - --conf
      - spark.executor.instances=2
      - --class
      - org.apache.spark.examples.SparkPi
      - local:///opt/spark/examples/examples/jars/spark-examples_2.12-2.4.2.jar
      command:
      - sh
      image: sparkimage
      imagePullPolicy: Always
      name: ""
      resources: {}
  inputs: {}
  metadata: {}
  name: sparkapp
  outputs: {}

Spark operator:

We can use Argo workflowResource templateTo achieve this.

This is a good way to create a spark imagereference resources

Integrating Argo workflow and spark on kubernetes

Argo workflow and spark integration related information is relatively small, so I wrote this article, hoping to help you.

PS: This article belongs to translation,original text

Recommended Today

MySQL tutorial

MySQL tutorial MySQL is the most popular relational database management system. In web application, MySQL is one of the best RDBMS (relational database management system) application software. In this tutorial, you will quickly master the basic knowledge of MySQL, and easy to use MySQL database. What is a database? Database is a warehouse that organizes, […]