Four integration steps for migrating to spark operator and S3


CNCF China Cloud native survey in 2020

10 people will receive $100 gift certificate from CNCF store!

Did you fill it in?

Four integration steps for migrating to spark operator and S3

Questionnaire links(…

Guest article by Allison richardet, chief software development engineer of MasterCard

At MasterCard, the internal cloud team maintains our kubernetes platform. Our work includes the maintenance of kubernetes cluster, which is the core deployment we rely on, and provides the tenants with logging, monitoring and other services, as well as a good experience.

One of our tenants, the data warehouse team, used native Apache spark on horn and HDFS. They came to our team and wanted to transfer their big data work to kubernetes; they wanted to realize cloud protogenesis, and we also had the opportunity to work with Apache spark on kubernetes.

So our journey starts with spark operator. The migration to kubernetes and operators will open cloud native possibilities for our internal customer data warehouse team. We have the opportunity to help them take advantage of scalability and cost improvements, and switching to S3 will further those goals.


What is an operator and why are we, or you, interested in it? First, the operator extends the kubernetes API with custom resources. The operator also defines a custom controller to monitor its resource types. Combining custom resources with custom controllers produces a declarative API, in which the operator coordinates the difference between the declared state and the actual state of the cluster. In other words, the operator handles automation related to its resources.

With these benefits, our team is happy to be able to use kubernetes’ spark operator to support our tenants. In general, native Apache spark uses HDFS. However, moving to the cloud and running spark on kuberentes, S3 is a good alternative to HDFS because of its cost advantage and the ability to scale as needed. Interestingly, S3 cannot be used with the spark operator by default. We refer to the spark operator and Hadoop AWS integration document. In addition, we will share the details of the following four steps: image update, sparkapplication configuration, S3 credentials, and S3 style. Follow our steps to integrate S3 with your spark job and kubernetes’ spark operator.


Like most of the applications we deploy to the kubernetes cluster, we use helm chart. The helm chart of kubernetes’ Apache spark operator can be downloaded fromhereFind it.

Values & helm template

We update values.yaml And then run the helm template to generate the list that we will deploy to the kubernetes cluster. We found that having visibility and control over the deployment of the content to be created is worth the extra step; the template is stored in git, and our CD tool is responsible for the deployment.

The default chart values will allow you to quickly start and run. Depending on your needs, here are some changes you may need to make:

  • Enable webhook: by default, mutating admission webhook is not enabled. Enable allow custom sparkapplication driver and executor pod, including Mount volume, configmaps, affinity / incompatibility, etc.
  • Define ingressurlformat: optional ingress for spark UI.

see alsoquick start guide And defaultvalues.yamlFor more details and options.


To run sparkapplication using S3, you need additional configuration of sparkapplication, including a custom docker image. Hadoop s3aconnector is a tool that can read and write S3.

1. Image update

The docker image used by sparkapplication needs to add two jars (Hadoop AWS and AWS Java SDK or AWS Java SDK bundle). The version depends on the spark version and Hadoop configuration file.

There are a few things to remember in this step.

  • Users and permissions
  • Extra jar

If you use spark image as a starting point, refer to their respective docker files when adding jars to align users and locations correctly.

Let’s take a look at pythonDockerfile. Before performing any installation tasks, the user is set to root and then reset to ${spark}_ uid}。

clear throughBasic imageYou can see that the jar is located at / opt / spark / jars or $spark_ In home / jars. Finally, update the jar permissions to be able to use them.

Upload to S3However, we need a file that contains FS. S3a Configure a new version of Hadoop – we’ll discuss this in a later section. In writing this article, we used the spark manipulator version v1beta2-1.2.0-3.0.0, which includes the basic spark version 3.0.0. use -Operator / spark py: v3.0.0-hadoop 3 image as a starting point, we add the following jars: hadoop-aws-3.1.0.jar and aws-java-sdk-bundle-1.11.271.jar. It requires some experimentation to determine the correct combination of mirrors that will eventually work.

2. Configuration of sparkapplication

Sparkapplication requires additional configuration to communicate with S3. spec.sparkConf The minimum configuration required in is as follows:

    spark.hadoop.fs .s3a。 Endpoint: < endpoint >
    spark.hadoop.fs.s3a。impl: org.apache.hadoop.fs.s3a.S3AFileSystem

You must also provide credentials to access S3. There are configuration options similar to the above; however, this is very frustrating because they are string values and therefore contrary to security best practices.

3. S3 voucher

Instead of providing S3 credentials in sparkconf of sparkapplication, we create a kubernetes secret and define environment variables for drivers and executors. The spark operator documentation provides several options for using secret, as well as forMount secretorSpecify environment variablesA complete example of.

Next, because we use environment variables to verify S3, we set the following options in sparkconf:

sparkConf: com.amazonaws.auth.EnvironmentVariableCredentialsProvider

This is not required, if not provided, the credential provider classes will be attempted in the following order:

  1. org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
  2. com.amazonaws.auth.EnvironmentVariableCredentialsProvider
  3. com.amazonaws.auth.InstanceProfileCredentialsProvider

4. S3 style

In sparkapplication’s sparkconf, there are some other options to remember. These options are based on your specific S3:

    extraJavaOptions: “true”
    spark.hadoop.fs.s3a.connection.ssl.enabled: “true”

Path style access – by enabling path style access, the virtual host is disabled (enabled by default). Enabling path style access eliminates the need to set DNS for the default virtual host.

Enable SSL – if you are using TLS / SSL, make sure that this option is enabled in sparkconf of sparkapplication.

Additional Java options – depending on your needs.

Using S3

Now that you have all the settings for using S3, you have two options: use S3 to process dependencies or upload to S3.

S3 processing dependencies

Additional dependencies (including files or jars) used by mainapplicationfile and spark jobs can also be stored and retrieved from S3. They can be found in spec.deps Field is defined with other dependencies in sparkapplication. Spark submit will be used separately spec.deps.jar and spec.deps.files The jar or file specified in. The format of access dependency in S3 is S3a: / / bucket / path / to / file.

Upload to S3

When uploading to S3, the file location format is S3a: / / bucket / path / to / destination. The bucket must exist, otherwise the upload fails. If the destination file already exists, the upload will fail.


We introduced the four steps required to start and run the spark operator and S3: image update, options required in sparkconf of sparkapplication, S3 credentials, and other options based on specific S3. Finally, we give some suggestions on how to use S3 to implement dependencies and upload to S3.

Finally, we help our internal customer, the data warehouse team, move their big data workload from native Apache spark to kubernetes. The spark operator on kubernetes has great advantages in cloud computing, and we want to share our experience with the larger community. We hope that this walkthrough on the integration of spark and S3 will help you and / or your team start and run spark and S3.

Click to read the original text of the website

CNCF (cloud native Computing Foundation) was established in December 2015, which is a non-profit organization under the Linux foundation.
CNCF (cloud native Computing Foundation) is committed to cultivating and maintaining a vendor neutral open source ecosystem to promote cloud native technology. We democratize cutting-edge models to make these innovations available to the public. Scanning two-dimensional code concerns CNCF WeChat official account.
Four integration steps for migrating to spark operator and S3