AWS EKS And NVIDIA Device Plugin2020-11-12
Failed to initialize NVML: could not load NVML library.
If you see this error when running NVIDIA's k8s-device-plugin on your AWS EKS Kubernetes cluster, here's a straightforward fix you can try.
The Right AMI
By default, even if you are launching a worker group with GPU instances (for example with g4dn.xlarge, the cheapest one, you are using the default EKS AMIs.
They are great for normal worker nodes within your AWS EKS cluster, but they don't bring along the right configurations to make use of those GPUs.
You can see a list of EKS AMIs in the "Amazon EKS optimized accelerated Amazon Linux AMIs" and an explanation in the corresponding section of the EKS User Guide.
In the top part of that page, there's a list of AMIs, grouped by Kubernetes version (you'll need to navigate to your approximate k8s cluster version here) and check the "x86 accelerated" entry. The AMI ID you're looking for will be below "Value" field of the linked page, it should look like "ami-(some numbers and characters here)".
Specifying The AMI ID
Now that you know which accelerated AMI is right for your EKS cluster, you'll need to configure your worker group to use that AMI.
If you are using Terraform, and the "eks" module, you can give it a try by adding a
ami_id: "ami-...", line to your worker group config block. Once you apply the configs, the k8s-device-plugin damonset pods should be able to make the GPUs of the instances usable, thanks to the right AMI.
If you apply the changes, but don't see that result, make sure that the AutoScaling group has had time to start new instances. If you're impatient (and know what you're doing), you can speed up the proccess by terminating the old instances manually.
I hope that this article, and knowing about the need to use non-default AMIs will help you to get a bit closer to using GPUs in your AWS EKS cluster!