Running VMs with GPU Passthrough

Running VMs with GPU Passthrough

This section demonstrates how to deploy virtual machines (VMs) with GPU passthrough using Cozystack. First, we’ll deploy the GPU Operator to configure the worker node for GPU passthrough Then we will deploy a KubeVirt VM that requests a GPU.

By default, to provision a GPU Passthrough, the GPU Operator will deploy the following components:

  • VFIO Manager to bind vfio-pci driver to all GPUs on the node.
  • Sandbox Device Plugin to discover and advertise the passthrough GPUs to kubelet.
  • Sandbox Validator to validate the other operands.

Prerequisites

  • A Cozystack cluster with at least one GPU-enabled node.
  • kubectl installed and cluster access credentials configured.

1. Install the GPU Operator

Follow these steps:

  1. Label the worker node explicitly for GPU passthrough workloads:

    kubectl label node <node-name> --overwrite nvidia.com/gpu.workload.config=vm-passthrough
    
  2. Enable the GPU Operator bundle in your Cozystack configuration:

    kubectl edit -n cozy-system configmap cozystack-config
    
  3. Add gpu-operator to the list of bundle-enabled packages:

    bundle-enable: gpu-operator
    

    This will deploy the components (operands).

  4. Ensure all pods are in a running state and all validations succeed with the sandbox-validator component:

    kubectl get pods -n cozy-gpu-operator
    

    Example output (your pod names may vary):

    NAME                                            READY   STATUS    RESTARTS   AGE
    ...
    nvidia-sandbox-device-plugin-daemonset-4mxsc    1/1     Running   0          40s
    nvidia-sandbox-validator-vxj7t                  1/1     Running   0          40s
    nvidia-vfio-manager-thfwf                       1/1     Running   0          78s
    

To verify the GPU binding, access the node using kubectl debug node or kubectl node-shell -x and run:

lspci --nnk -d 10de:

The vfio-manager pod will bind all GPUs on the node to the vfio-pci driver. Example output:

3b:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:2236] (rev a1)
       Subsystem: NVIDIA Corporation Device [10de:1482]
       Kernel driver in use: vfio-pci
86:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:2236] (rev a1)
       Subsystem: NVIDIA Corporation Device [10de:1482]
       Kernel driver in use: vfio-pci

The sandbox-device-plugin will discover and advertise these resources to kubelet. In this example, the node shows two A10 GPUs as available resources:

kubectl describe node <node-name>

Example output:

...
Capacity:
  ...
  nvidia.com/GA102GL_A10:         2
  ...
Allocatable:
  ...
  nvidia.com/GA102GL_A10:         2
...

2. Update the KubeVirt Custom Resource

Next, we will update the KubeVirt Custom Resource, as documented in the KubeVirt user guide, so that the passthrough GPUs are permitted and can be requested by a KubeVirt VM.

Adjust the pciVendorSelector and resourceName values to match your specific GPU model. Setting externalResourceProvider=true indicates that this resource is provided by an external device plugin, in this case the sandbox-device-plugin which is deployed by the Operator.

kubectl edit kubevirt -n kubevirt

example config:

  ...
  spec:
    permittedHostDevices:
      pciHostDevices:
      - externalResourceProvider: true
        pciVendorSelector: 10DE:2236
        resourceName: nvidia.com/GA102GL_A10
  ...

3. Create a Virtual Machine

We are now ready to create a VM.

  1. Create a sample virtual machine using the following VMI specification that requests the nvidia.com/GA102GL_A10 resource.

    vmi-gpu.yaml:

    ---
    apiVersion: apps.cozystack.io/v1alpha1
    appVersion: '*'
    kind: VirtualMachine
    metadata:
      name: gpu
      namespace: tenant-example
    spec:
      running: true
      instanceProfile: ubuntu
      instanceType: u1.medium
      systemDisk:
        image: ubuntu
        storage: 5Gi
        storageClass: replicated
      gpus:
      - name: nvidia.com/GA102GL_A10
      cloudInit: |
        #cloud-config
        password: ubuntu
        chpasswd: { expire: False }    
    
    kubectl apply -f vmi-gpu.yaml
    

    Example output:

    virtualmachines.apps.cozystack.io/gpu created
    
  2. Verify the VM status:

    kubectl get vmi
    
    NAME                       AGE   PHASE     IP             NODENAME        READY
    virtual-machine-gpu        73m   Running   10.244.3.191   luc-csxhk-002   True
    
  3. Log in to the VM and confirm that it has access to GPU:

    virtctl console virtual-machine-gpu
    

    Example output:

    Successfully connected to vmi-gpu console. The escape sequence is ^]
    
    vmi-gpu login: ubuntu
    Password:
    
    ubuntu@virtual-machine-gpu:~$ lspci -nnk -d 10de:
    08:00.0 3D controller [0302]: NVIDIA Corporation GA102GL [A10] [10de:26b9] (rev a1)
            Subsystem: NVIDIA Corporation GA102GL [A10] [10de:1851]
            Kernel driver in use: nvidia
            Kernel modules: nvidiafb, nvidia_drm, nvidia