As major enterprises around the world begin to widely adopt kubernetes, we see that kubernetes is developing to a new stage. On the one hand, kubernetes is adopted by edge workloads and provides value beyond the data center. On the other hand, kubernetes is driving the development of machine learning (ML) and high-quality and high-speed data analysis performance.
The case of applying kubernetes to machine learning we know now mainly comes from a function in kubernetes 1.10. At that time, the graphics processing unit (GPUs) became a schedulable resource – this function is now in beta. Individually, both are exciting developments in kubernetes. What’s more exciting is that kubernetes can be used to adopt GPU in data center and edge. In the data center, GPU is a way to build ml library. Those trained libraries will be migrated to the edge kubernetes cluster as a reasoning tool for machine learning to provide data analysis as close to the data collection as possible.
Earlier, kubernetes provided a pool of CPU and ram resources for distributed applications. If we have CPU and ram pools, why can’t we have a GPU pool? Of course, there is no problem, but not all servers have GPUs. So, how can our server be equipped with GPU in kubernetes?
In this article, I’ll explain a simple way to use GPUs in kubernetes clusters. In future articles, we will also push the GPU to the edge and show you how to do this. In order to really simplify the steps, I will use the rancher UI to operate the process of enabling GPU. The rancher UI is just a client of rancher restful APIs. You can use clients of other APIs in gitops, Devops and other automation solutions, such as golang, Python and terraform. However, we will not explore these in depth in this article.
In essence, the steps are very simple:
- Building infrastructure for kubernetes clusters
- Install kubernetes
- Installing GPU operator from Helm
Start and run using rancher and available GPU resources
Rancher is a multi cluster management solution and is the glue of the above steps. You can find a pure NVIDIA solution to simplify GPU management in NVIDIA’s blog, and some important information about the difference between GPU operator and building GPU driver stack without operator.
preparation in advance
The following is the bill of materials (BOM) required to start and run the GPU in the rancher:
In the official document, we have a special chapter on how to install rancher with high availability, so we assume that you have installed rancher:
Installing kubernetes clusters using GPUs
After installing rancher, we will first build and configure a kubernetes cluster (you can use any cluster with NVIDIA GPU).
Using the global context, we select Add cluster
In the “hosts from cloud service providers” section, select Amazon EC2.
We implement this through node driven – a set of pre configured infrastructure templates, some of which have GPU resources.
Note that there are three node pools: one is for the master, one is for the standard worker node, and the other is for the worker with GPU. The template of GPU is based on p3.2xlarge machine type, using Ubuntu 18.04 Amazon machine image or AMI (ami-0ac80df6eff0e70b5). Of course, these choices vary according to the needs of each infrastructure provider and enterprise. In addition, we set the kubernetes option in the “add cluster” form to the default value.
Setting up GPU operator
Now we will use the GPU operator library（https://nvidia.github.io/gpu-operator）Set a catalog in the rancher. (there are other solutions to expose GPU, including using Linux for Tegra [l4t] Linux distribution or device plug-in). At the time of writing this article, GPU operator has been tested and verified by NVIDIA Tesla driver 440.
Using the rancher global context menu, we select the cluster to install to:
Then use the tools menu to view the catalog list.
Click the add catalog button and name it, then add the URL:https://nvidia.github.io/gpu-operator
We chose helm V3 and cluster scope. We click create to add the catalog to the rancher. When using automation, we can use this step as part of the cluster build. According to the enterprise policy, we can add this catalog to each cluster, even if it does not have a GPU node or node pool. This step gives us the opportunity to access the GPU operator chart, which we will install next.
Now we want to use the rancher context menu in the upper left corner to enter the “system” project of the cluster. We have added the GPU operator function here.
In the system project, select apps:
Then click the launch button at the top right.
We can search for “NVIDIA” or scroll down to the catalog we just created.
Click GPU operator app, and then click launch at the bottom of the page.
In this case, all default values should be OK. Similarly, we can add this step to automation through rancher APIs.
Now that the GPU is accessible, we can deploy a GPU capable workload. At the same time, we can verify whether the installation is successful by viewing the page of Cluster – > nodes in the rancher. We can see that the GPU operator has installed node feature discovery (NFD) and labeled our nodes for GPU use.
The reason why kubernetes can run together with GPU with such a simple method is inseparable from these three important parts:
- NVIDIA’s GPU operator
- Node feature discovery (NFD) from kubernetes sig with the same name.
- Cluster deployment and catalog app integration of rancher
You are welcome to try according to this tutorial, and please continue to pay attention. In later tutorials, we will try to reference the GPU to the edge.