OpenShift on NVIDIA GPU Accelerated Clusters · Red Hat OpenShift is a security-centric and enterprise-grade hardened Kubernetes platform for deploying and managing Kubernetes clusters

OPENSHIFT ON NVIDIA GPUACCELERATED CLUSTERS

DU-09608-001_v02 | March 2020

Installation Guide

www.nvidia.comOpenShift on NVIDIA GPU Accelerated Clusters DU-09608-001_v02 | ii

TABLE OF CONTENTS

Chapter 1. Introduction.........................................................................................1Chapter 2. Requirements.......................................................................................2Chapter 3. Installation.......................................................................................... 5Chapter 4. GPU Support........................................................................................ 7

4.1. Installing Helm v3........................................................................................ 74.2. Installing GPU Operator................................................................................. 74.3. Running nvidia-smi for the Pod and via the Scheduler............................................114.4. Cleanup................................................................................................... 11

Chapter 5. References.........................................................................................12

www.nvidia.comOpenShift on NVIDIA GPU Accelerated Clusters DU-09608-001_v02 | 1

Chapter 1.INTRODUCTION

Kubernetes is an open-source platform for automating the deployment, scaling, andmanaging of containerized applications.

Red Hat OpenShift is a security-centric and enterprise-grade hardened Kubernetesplatform for deploying and managing Kubernetes clusters at scale, developed andsupported by Red Hat.

Kubernetes via Red Hat OpenShift 4.1/4.2/4.3 includes enhancements to Kubernetes sousers can easily configure and use GPU resources for accelerating workloads such asdeep learning.


Chapter 2.REQUIREMENTS

‣ Red Hat Subscription

Installing and running OpenShift requires a Red Hat account and additionalsubscriptions.

‣ Internet Access

To perform subscription management, including legally entitling your software fromRed Hat, your systems must have direct internet access to install the cluster.

‣ Hardware

System Type DescriptionMinimum Numberof Systems

RecommendedSpecs

Deployment system System for executingthe deployment.

BYO Any Mac OS or Linuxsystem with 300MB ofdisk space

Load balancer The load balancerservices allowexternal access to theOpenShift ContainerPlatform cluster anddistributes the workacross various nodes ofthe cluster.

BYO or 1 Xeon Gold 5118 (12cores, 16.5MB cache,2.3Ghz/3.2Ghz) /32GB RAM(distributed) / Single40/50GbE NIC / 800GBSAS SSD

Bootstrap Because each machinein the cluster requiresinformation aboutthe cluster when itis provisioned, theOpenShift ContainerPlatform uses atemporary bootstrapmachine during initialconfiguration toprovide the requiredinformation to the

1 (Temporary)

Requirements


System Type DescriptionMinimum Numberof Systems

RecommendedSpecs

permanent controlplane deployedon the Masters.After the clustermachines initialize,the bootstrap machineis destroyed and canbe reallocated.

Master (Control plane) The master nodesrun services that arerequired to control theKubernetes cluster. InOpenShift ContainerPlatform, the mastermachines are thecontrol plane.

3 Dual Xeon Gold5118 (12 cores,16.5MB cache,2.3Ghz/3.2Ghz) /32GB RAM(distributed) / Single40/50GbE NIC / 800GBSAS SSD

Worker (Compute) The worker nodesare where the actualworkloads requestedby Kubernetes usersrun and are managed.The worker nodesadvertise theircapacity and thescheduler, which ispart of the masterservices, determineson which nodes tostart containers andPods.

2+ OEM NGC-Ready T4or V100 Systems withdual 50 GbE NICs,DGX-1, or DGX-2

‣ Network Connectivity

All systems require access to DHCP and DNS servers.

‣ HTTP server

An HTTP server that is accessible from the deployment system is necessary fordeploying the systems.

‣ PXE or iPXE server

A PXE or iPXE server is needed for deploying the systems. Note, Red Hat alsoprovides instructions for deployment without PXE but this method requiresadditional steps to set up the systems.

‣ RHCOS

The compressed metal BIOS, UEFI, kernel, and initramfs files from the Red Hatportal are needed for deploying on the systems. These should be located on yourHTTP server.

https://docs.nvidia.com/ngc/ngc-ready-systems/index.html

https://docs.nvidia.com/ngc/ngc-ready-systems/index.html

Requirements


‣ (Optional) NFS Storage

For production clusters, access to NFS is necessary for setting up a persistent volume(PV). 100 GiB minimum.


Chapter 3.INSTALLATION

The following image illustrates the process of creating the compute machines.

The following is a high-level overview of the installation process.

1. Configure networking (DHCP and system to system communication). 2. Provision two layer-4 load balancers and DNS. 3. Download the installer to the deployment system. 4. Create the installation configuration file. 5. Generate Ignition files for the systems (bootstrap, masters and workers) and upload

to HTTP server.

The bootstrap ignition file is only good for 24 hrs due to the certificates within.

6. Configure systems and images for PXE boot. 7. Create the bootstrap, master and compute systems. 8. Create the cluster by starting the installation process cluster. 9. Remove the bootstrap machine from the load balancer. 10. Log into the cluster, export Kubernetes credentials for safe keeping, approve

pending CSRs (certificate signing requests) if necessary and confirm all operatorscome online.

Installation


11. Configure image registry storage.

‣ For non-production servers, an empty directory can be used.‣ For production servers, a persistent volume (PV) is necessary (100GiB

minimum, NFS). 12. The installation is complete when all components report as available and the

Kubernetes API is communicating with PODS.

The detailed installation instructions are maintained by Red Hat on their documentationwebsite.

https://access.redhat.com/documentation/en-us/openshift_container_platform/4.1/

https://access.redhat.com/documentation/en-us/openshift_container_platform/4.1/


Chapter 4.GPU SUPPORT

This section assumes that an OCP cluster is up and running with a GPU worker node.

4.1. Installing Helm v3

To install Helm on OpenShift, please follow the official documentation from Red Hat.

4.2. Installing GPU Operator

This section describes how to install GPU Operator via Helm v3.

1. Add a cluster-wide entitlement via a Kubernetes secret.

This is required to download packages used to build the driver container.

These instructions assume you downloaded an entitlement encoded in base64 fromaccess.redhat.com or extracted it from an existing node.In the following commands, the entitlement certificate is copied to nvidia.pem, butit can be copied to any accessible location.

$ cp 2417243652391403362.pem nvidia.pem

$ curl -O https://raw.githubusercontent.com/openshift-psap/blog-artifacts/master/how-to-use-entitled-builds-with-ubi/0003-cluster-wide-machineconfigs.yaml.template

$ sed "s/BASE64_ENCODED_PEM_FILE/$(base64 -w0 nvidia.pem)/g" 0003-cluster-wide-machineconfigs.yaml.template > 0003-cluster-wide-machineconfigs.yaml

$ ./oc create -f 0003-cluster-wide-machineconfigs.yaml

$ ./oc get machineconfig | grep entitlement

// Output50-entitlement-key-pem 2.2.0 10h50-entitlement-pem 2.2.0 10h

https://docs.openshift.com/container-platform/4.3/cli_reference/helm_cli/getting-started-with-helm-on-openshift-container-platform.html

GPU Support


2. Validate the cluster-wide entitlement with a test pod that queries a Red Hatsubscription repo for the kernel-devel package.

$ cat << EOF >> mypod.yaml

//OutputapiVersion: v1kind: Podmetadata: name: cluster-entitled-build-podspec: containers: - name: cluster-entitled-build image: registry.access.redhat.com/ubi8:latest command: [ "/bin/sh", "-c", "dnf search kernel-devel --showduplicates" ] restartPolicy: NeverEOF

$ ./oc create -f mypod.yaml $ ./oc get pods -n default

$ ./oc logs cluster-entitled-build-pod -n default

//OutputUpdating Subscription Management repositories.Unable to read consumer identitySubscription Manager is operating in container mode.Red Hat Enterprise Linux 8 for x86_64 - AppStre 15 MB/s | 14 MB 00:00 Red Hat Enterprise Linux 8 for x86_64 - BaseOS 15 MB/s | 13 MB 00:00 Red Hat Universal Base Image 8 (RPMs) - BaseOS 493 kB/s | 760 kB 00:01 Red Hat Universal Base Image 8 (RPMs) - AppStre 2.0 MB/s | 3.1 MB 00:01 Red Hat Universal Base Image 8 (RPMs) - CodeRea 12 kB/s | 9.1 kB 00:00 ====================== Name Exactly Matched: kernel-devel ======================kernel-devel-4.18.0-80.1.2.el8_0.x86_64 : Development package for building : kernel modules to match the kernelkernel-devel-4.18.0-80.el8.x86_64 : Development package for building kernel : modules to match the kernelkernel-devel-4.18.0-80.4.2.el8_0.x86_64 : Development package for building : kernel modules to match the kernelkernel-devel-4.18.0-80.7.1.el8_0.x86_64 : Development package for building : kernel modules to match the kernelkernel-devel-4.18.0-80.11.1.el8_0.x86_64 : Development package for building : kernel modules to match the kernelkernel-devel-4.18.0-147.el8.x86_64 : Development package for building kernel : modules to match the kernelkernel-devel-4.18.0-80.11.2.el8_0.x86_64 : Development package for building : kernel modules to match the kernelkernel-devel-4.18.0-80.7.2.el8_0.x86_64 : Development package for building : kernel modules to match the kernelkernel-devel-4.18.0-147.0.3.el8_1.x86_64 : Development package for building : kernel modules to match the kernelkernel-devel-4.18.0-147.0.2.el8_1.x86_64 : Development package for building : kernel modules to match the kernel

GPU Support


kernel-devel-4.18.0-147.3.1.el8_1.x86_64 : Development package for building : kernel modules to match the kernel

3. Add and update the NVIDIA Helm repository.

$ helm repo add nvidia https://nvidia.github.io/gpu-operator

$ helm repo update

4. Install the GPU operator.

$ $ helm install --devel nvidia/gpu-operator --set platform.openshift=true,operator.defaultRuntime=crio --wait --generate-name

5. View the NFD pods.

$ ./oc get pods -n node-feature-discovery

//OutputNAME READY STATUS RESTARTS AGEnfd-master-8c7wk 1/1 Running 0 130mnfd-master-d94w9 1/1 Running 0 130mnfd-master-rfr8k 1/1 Running 1 130mnfd-worker-5p5fp 1/1 Running 2 130mnfd-worker-jzn8p 1/1 Running 2 130m

6. View the GPU device is discovered on the GPU node.

$ oc describe node ip-10-0-137-45.us-west-1.compute.internal | egrep 'Roles|pci'

//OutputRoles: worker feature.node.kubernetes.io/pci-10de.present=true feature.node.kubernetes.io/pci-1d0f.present=true

7. View the GPU operator namespace resources.

$ ./oc get pods -n gpu-operator

//OutputNAME READY STATUS RESTARTS AGEspecial-resource-operator-5db5bc4f5c-vld29 1/1 Running 0 154m

$ ./oc get all -n gpu-operator-resources

//OutputNAME READY STATUS RESTARTS AGEpod/nvidia-container-toolkit-daemonset-sgr7h 1/1 Running 0 160mpod/nvidia-dcgm-exporter-twjx4 2/2 Running 0 153mpod/nvidia-device-plugin-daemonset-6tbfv 1/1 Running 0 156mpod/nvidia-device-plugin-validation 0/1 Completed 0 156mpod/nvidia-driver-daemonset-m7mwk 1/1 Running 0 160mpod/nvidia-driver-validation 0/1 Completed 0 160m

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE

GPU Support


service/nvidia-dcgm-exporter ClusterIP 172.30.64.207 <none> 9400/TCP 153m

NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGEdaemonset.apps/nvidia-container-toolkit-daemonset 1 1 1 1 1 feature.node.kubernetes.io/pci-10de.present=true 160mdaemonset.apps/nvidia-dcgm-exporter 1 1 1 1 1 feature.node.kubernetes.io/pci-10de.present=true 153mdaemonset.apps/nvidia-device-plugin-daemonset 1 1 1 1 1 feature.node.kubernetes.io/pci-10de.present=true 156mdaemonset.apps/nvidia-driver-daemonset 1 1 1 1 1 feature.node.kubernetes.io/kernel-version.full=4.18.0-147.3.1.el8_1.x86_64,feature.node.kubernetes.io/pci-10de.present=true 160m

8. Verify that the GPU Operator installation completed successfully.

The GPU Operator validates the stack through the nvidia-device-plugin-validation pod and nvidia-driver-validation pod. If both completedsuccessfully, the stack works as expected.

$ oc logs nvidia-driver-validation -n gpu-operator-resources | tail

//Outputmake[1]: Leaving directory '/usr/local/cuda-10.2/cuda-samples/Samples/warpAggregatedAtomicsCG'make: Target 'all' not remade because of errors.> Using CUDA Device [0]: Tesla T4> GPU Device has SM 7.5 compute capability[Vector addition of 50000 elements]Copy input data from the host memory to the CUDA deviceCUDA kernel launch with 196 blocks of 256 threadsCopy output data from the CUDA device to the host memoryTest PASSEDDone

$ oc logs nvidia-device-plugin-validation -n gpu-operator-resources | tail

//Outputmake[1]: Leaving directory '/usr/local/cuda-10.2/cuda-samples/Samples/warpAggregatedAtomicsCG'make: Target 'all' not remade because of errors.> Using CUDA Device [0]: Tesla T4> GPU Device has SM 7.5 compute capability[Vector addition of 50000 elements]Copy input data from the host memory to the CUDA deviceCUDA kernel launch with 196 blocks of 256 threadsCopy output data from the CUDA device to the host memoryTest PASSEDDone

GPU Support


4.3. Running nvidia-smi for the Pod and via theScheduler

To view GPU utilization, run nvidia-smi from a pod in the GPU operator daemonset.

$ oc project gpu-operator-resources

$ oc get pods

$ oc exec -it nvidia-driver-daemonset-m7mwk nvidia-smi

4.4. Cleanup

This section describes how to clean up (remove) the GPU Operator in case it is no longerneeded.

1. Uninstall the GPU operator.a) Get the GPU Operator instance id.

$ helm ls | grep gpu-operator | awk ‘{print $1}’

//Outputgpu-operator-1-1583173374

b) Delete the GPU operator instance obtained from the previous step.

Example

$ helm delete gpu-operator-1-1583173374

//Output release “gpu-operator-1-1583173374” uninstalled

c) (Perform this step only if using OpenShift 4.3) Delete the OpenShift 4.3 specialresources.

$ oc delete crd specialresources.sro.openshift.io

//Output customresourcedefinition.apiextensions.k8s.io “specialresources.sro.openshift.io” deleted

2. (Optional) If deploying via AWS, you can scale down the GPU-enabled machineSetto ‘0’ replicas.

If reloading the operator only, reboot the node instead of scaling it down.

$ oc scale machineset devel-z8p97-gpu-us-west-1b --replicas=0 -n openshift-machine-api

//Outputmachineset.machine.openshift.io/devel-z8p97-gpu-us-west-1b scaled


Chapter 5.REFERENCES

‣ OpenShift 4.1 Bare Metal Installation Documentation‣ OpenShift 4.2 Bare Metal Installation Documentation‣ NVIDIA blog on OpenShift Installation (for reference only; the steps are for an older

version of the GPU Operator)‣ OpenShift 4.1 Architectural Overview‣ OpenShift 4.2 Architectural Overview‣ NVIDIA GPU Operator

https://access.redhat.com/documentation/en-us/openshift_container_platform/4.1/html-single/installing/index#installing-on-bare-metal

https://access.redhat.com/documentation/en-us/openshift_container_platform/4.2/html-single/installing/index#installing-on-bare-metal

https://devblogs.nvidia.com/gpu-support-ai-workloads-openshift4/

https://access.redhat.com/documentation/en-us/openshift_container_platform/4.1/html/architecture/index

https://access.redhat.com/documentation/en-us/openshift_container_platform/4.2/html/architecture/index

https://github.com/NVIDIA/gpu-operator

Documents

OpenShift on NVIDIA GPU Accelerated Clusters · Red Hat OpenShift is a security-centric and enterprise-grade hardened Kubernetes platform for deploying and managing Kubernetes clusters