You can enable GPU in k8s cluster in 3 simple steps without entering commands manually!

Time:2021-9-25

As major enterprises around the world begin to widely adopt kubernetes, we see that kubernetes is developing to a new stage. On the one hand, kubernetes is adopted by edge workloads and provides value beyond the data center. On the other hand, kubernetes is driving the development of machine learning (ML) and high-quality and high-speed data analysis performance.

The case of applying kubernetes to machine learning we know now mainly comes from a function in kubernetes 1.10. At that time, the graphics processing unit (GPUs) became a schedulable resource – this function is now in beta. Individually, both are exciting developments in kubernetes. What’s more exciting is that kubernetes can be used to adopt GPU in data center and edge. In the data center, GPU is a way to build ml library. Those trained libraries will be migrated to the edge kubernetes cluster as a reasoning tool for machine learning to provide data analysis as close to the data collection as possible.

Earlier, kubernetes provided a pool of CPU and ram resources for distributed applications. If we have CPU and ram pools, why can’t we have a GPU pool? Of course, there is no problem, but not all servers have GPUs. So, how can our server be equipped with GPU in kubernetes?

In this article, I’ll explain a simple way to use GPUs in kubernetes clusters. In future articles, we will also push the GPU to the edge and show you how to do this. In order to really simplify the steps, I will use the rancher UI to operate the process of enabling GPU. The rancher UI is just a client of rancher restful APIs. You can use clients of other APIs in gitops, Devops and other automation solutions, such as golang, Python and terraform. However, we will not explore these in depth in this article.

In essence, the steps are very simple:

  • Building infrastructure for kubernetes clusters
  • Install kubernetes
  • Installing GPU operator from Helm

Start and run using rancher and available GPU resources

Rancher is a multi cluster management solution and is the glue of the above steps. You can find a pure NVIDIA solution to simplify GPU management in NVIDIA’s blog, and some important information about the difference between GPU operator and building GPU driver stack without operator.

https://developer.nvidia.com/blog/nvidia-gpu-operator-simplifying-gpu-management-in-kubernetes/

preparation in advance

The following is the bill of materials (BOM) required to start and run the GPU in the rancher:

  1. Rancher
  2. GPU Operator(https://nvidia.github.io/gpu-…
  3. Infrastructure – we will use GPU nodes on AWS

In the official document, we have a special chapter on how to install rancher with high availability, so we assume that you have installed rancher:

https://docs.rancher.cn/docs/rancher2/installation/k8s-install/_index/

Process steps

Installing kubernetes clusters using GPUs

After installing rancher, we will first build and configure a kubernetes cluster (you can use any cluster with NVIDIA GPU).

Using the global context, we select Add cluster

You can enable GPU in k8s cluster in 3 simple steps without entering commands manually!

In the “hosts from cloud service providers” section, select Amazon EC2.

You can enable GPU in k8s cluster in 3 simple steps without entering commands manually!

We implement this through node driven – a set of pre configured infrastructure templates, some of which have GPU resources.

You can enable GPU in k8s cluster in 3 simple steps without entering commands manually!

Note that there are three node pools: one is for the master, one is for the standard worker node, and the other is for the worker with GPU. The template of GPU is based on p3.2xlarge machine type, using Ubuntu 18.04 Amazon machine image or AMI (ami-0ac80df6eff0e70b5). Of course, these choices vary according to the needs of each infrastructure provider and enterprise. In addition, we set the kubernetes option in the “add cluster” form to the default value.

Setting up GPU operator

Now we will use the GPU operator library(https://nvidia.github.io/gpu-operator)Set a catalog in the rancher. (there are other solutions to expose GPU, including using Linux for Tegra [l4t] Linux distribution or device plug-in). At the time of writing this article, GPU operator has been tested and verified by NVIDIA Tesla driver 440.

Using the rancher global context menu, we select the cluster to install to:

You can enable GPU in k8s cluster in 3 simple steps without entering commands manually!

Then use the tools menu to view the catalog list.

You can enable GPU in k8s cluster in 3 simple steps without entering commands manually!

Click the add catalog button and name it, then add the URL:https://nvidia.github.io/gpu-operator

We chose helm V3 and cluster scope. We click create to add the catalog to the rancher. When using automation, we can use this step as part of the cluster build. According to the enterprise policy, we can add this catalog to each cluster, even if it does not have a GPU node or node pool. This step gives us the opportunity to access the GPU operator chart, which we will install next.

You can enable GPU in k8s cluster in 3 simple steps without entering commands manually!

Now we want to use the rancher context menu in the upper left corner to enter the “system” project of the cluster. We have added the GPU operator function here.

You can enable GPU in k8s cluster in 3 simple steps without entering commands manually!

In the system project, select apps:

You can enable GPU in k8s cluster in 3 simple steps without entering commands manually!

Then click the launch button at the top right.

You can enable GPU in k8s cluster in 3 simple steps without entering commands manually!

We can search for “NVIDIA” or scroll down to the catalog we just created.

You can enable GPU in k8s cluster in 3 simple steps without entering commands manually!

Click GPU operator app, and then click launch at the bottom of the page.

You can enable GPU in k8s cluster in 3 simple steps without entering commands manually!

In this case, all default values should be OK. Similarly, we can add this step to automation through rancher APIs.

Using GPU

Now that the GPU is accessible, we can deploy a GPU capable workload. At the same time, we can verify whether the installation is successful by viewing the page of Cluster – > nodes in the rancher. We can see that the GPU operator has installed node feature discovery (NFD) and labeled our nodes for GPU use.

You can enable GPU in k8s cluster in 3 simple steps without entering commands manually!

Summary

The reason why kubernetes can run together with GPU with such a simple method is inseparable from these three important parts:

  1. NVIDIA’s GPU operator
  2. Node feature discovery (NFD) from kubernetes sig with the same name.
  3. Cluster deployment and catalog app integration of rancher

You are welcome to try according to this tutorial, and please continue to pay attention. In later tutorials, we will try to reference the GPU to the edge.