GPU Resource Naming #2255

zvonkok · 2025-01-22T14:39:17Z

For the Kata BM use-case, we have VFIOs advertised as nvidia.com/pgpu: 1, we cannot use nvidia.com/gpu: 1 for the peer-pods use-case since this is reserved for GPUs that are using traditional container runtimes and will clash in a cluster where nodes are running GPUs without Kata/PeerPods and nodes running with Kata/PeerPods

We need to come up with a new naming scheme that we use for peer-pods.

In the bare-metal use-case we have e.g. also the SKU name exposed in the cluster nvidia.com/GH100_H800: 8.

The text was updated successfully, but these errors were encountered:

zvonkok · 2025-01-22T14:42:02Z

Since an admin has a curated list of instance types one wants to expose and peer-pods are heavily tied to the instance type we could expose

nvidia.com/<intancen-type-a>-gpu: 1
nvidia.com/<intancee-type-b>-gpu: 1

If we do not care about the GPU type and just need a any instance type we need a common name and then peer-pods could allocate any GPU instance

CSP GPU == cgpu ?

nvidia.com/cgpu: 1

This way we would have distinct names for

Traditional Container: nvidia.com/gpu: 0 or if we need a specific type nvidia.com/mig-1g.10gb.count: 1
Baremetal Kata: nvidia.com/pgpu: 1 or if we need a specific type: nvidia.com/GH100_H800: 1
PeerPods Kata: nvidia.com/cgpu: 1 or if we need a specific type: nvidia.com/<instance-type>: 1

@stevenhorsman @jensfr @bpradipt

mythi · 2025-01-24T06:51:18Z

We need to come up with a new naming scheme that we use for peer-pods.

A bit off-topic but how these resources are advertised on a node and mapped to a podvm? a device plugin with CDI devices that are podvm specific?

zvonkok · 2025-01-24T17:22:04Z

Since we're in peer-pod land, I will answer this question in this context.

I talked to @bpradipt, who told me that one (admin, operator) will usually have a curated list of VM instance types that can be used in a specific cluster.

We can create NFD rules or a device plugin (which is unnecessary since it can only add ENV or mounts to the container; we cannot add annotations depending on the request) to expose this list as an extended resource. Since we added CDI support in the Kata agent, what we can now do is the following:

Pod requests nvidia.com/cgpu: 1, which means we do not care which GPU instances; use one from the list and use the mutating webhook to add the annotation:

"cdi.k8s.io/peer-pod": "nvidia.com/gpu=0"

The kata-agent will read this annotation, and the corresponding CDI device will be injected.

If we need multiple GPUs

"cdi.k8s.io/peer-pod": "nvidia.com/gpu=0"
"cdi.k8s.io/peer-pod": "nvidia.com/gpu=1"

For the instance type, we have another annotation which is not related to CDI but would need to be GPU instance obviously. If we use a CPU instance type and have added the CDI annotations, kata-agent will fail since we cannot create the CDI specs for GPUs and timeout.

inatatsu · 2025-01-29T13:35:21Z

The kata-agent will read this annotation, and the corresponding CDI device will be injected.

@zvonkok In my understanding, Cloud API Adaptor, which sits between the container runtime for remote hypervisor and kata agent (and resides outside of a pod VM), currently handles the GPU resource request annotations to determine an appropriate instance type. Do you suggest kata agent can handle this annotation by using a CDI spec, inside of a pod VM?

bpradipt · 2025-01-30T07:59:28Z

Currently we have the following mechanism for using GPU with peer-pods

User provides the following pod manifest (same like regular Kata or runc, except the runtimeClass changes)

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd-kata   
spec:
  runtimeClassName: kata-remote
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
    resources:
      limits:
        "nvidia.com/gpu": 1

The webhook mutates the pod manifest to something like this (note the removal of resources and addition of annotations)

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd-kata   
  annotations:
      io.katacontainers.config.hypervisor.default_gpus=1
spec:
  runtimeClassName: kata-remote
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"

Then CAA finds out the suitable gpu instance type from the pre-configured instance type list and creates the VM and runs the pod.

Another alternate mechanism is to simply use a pod manifest specific for peer-pods, like the following (note the machine_type annotation to select the specific GPU instance type)

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd-kata   
  annotations:
      io.katacontainers.config.hypervisor.machine_type: Standard_NC4as_T4_v3
spec:
  runtimeClassName: kata-remote
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"

Now with CDI, we can start with the most basic implementation, like the manifest below:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd-kata   
  annotations:
      io.katacontainers.config.hypervisor.default_gpus=1
     cdi.k8s.io/gpu: "nvidia.com/pgpu=1"
spec:
  runtimeClassName: kata-remote
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"

There are two places we can add the CDI annotation. Either in the webhook or in CAA.
If we do it in webhook, it's simple but we won't be able to automatically add suitable annotation based on the number of GPUs available in a specific instance as that info is not available to webhook. IOW if the original manifest is the following:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd-kata   
  annotations:
      io.katacontainers.config.hypervisor.machine_type: Standard_NC64as_T4_v3
spec:
  runtimeClassName: kata-remote
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"

I think we would want the actual manifest to be with the proper cdi annotation added indicating number of pgpus. That's not possible with webhook today. CAA already has this info so should be able to modify the oci spec to add it.

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd-kata   
  annotations:
      io.katacontainers.config.hypervisor.machine_type: Standard_NC64as_T4_v3
      cdi.k8s.io/gpu: "nvidia.com/pgpu=4"
spec:
  runtimeClassName: kata-remote
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"

Does this make sense ?

snir911 · 2025-01-30T09:06:38Z

IIUC eventually we'll need to have some sort of translation between the instance size and a matching CDI annotation (type), no? which cannot be done ATM in the webhook AFAIU.

Having said that, starting with attaching a default CDI annotation in the webhook/caa according to the gpu request looks like a good option to me (assuming i understand the workflow right).

bpradipt · 2025-01-30T09:11:23Z

IIUC eventually we'll need to have some sort of translation between the instance size and a matching CDI annotation (type), no?

Yes. That's my understanding

Having said that, starting with attaching a default CDI annotation in the webhook/caa according to the gpu request looks like a good option to me (assuming i understand the workflow right).

Is there anything needed on the pod VM side or the CDI annotation in the spec is enough ?

snir911 · 2025-01-30T09:40:44Z

Is there anything needed on the pod VM side or the CDI annotation in the spec is enough ?

AFAIU the agent CDI related bits are all in place, podvm needs to have the CDI specification in place and that's it (i've been experimenting with the injunction in the caa and it worked)

snir911 · 2025-01-30T11:01:56Z

Actually adding the CDI annotation it in the webhook (or manually) will fail ATM as the (go) shim cannot add the specified CDI device (should it simply pass the annotation and do nothing else when it's remote hypervisor? IDK)

mythi · 2025-01-31T13:43:42Z

Actually adding the CDI annotation it in the webhook (or manually) will fail ATM as the (go) shim cannot add the specified CDI device (should it simply pass the annotation and do nothing else when it's remote hypervisor? IDK)

I believe the idea is that kata-agent knows about the CDI devices and writes the config.json edits inside the guest. I'm not sure if that's necessary in the peer-pods case where no node device resources to be mapped into the guest device resources.

Would peer-pods simply work if the config.json is prepared on the host before sending it to the kata-agent?

snir911 · 2025-02-05T14:54:16Z

I'm not sure i follow, however, yes, node device resources are not relevant in the peer-pods case

FWIW the following works for me, is that make sense? we may do something like that as a midterm soultion cc @inatatsu , @zvonkok

zvonkok · 2025-02-06T16:27:47Z

@mythi It is necessary. You need to mount the libs/binaries into the container, this is what the hook did in the past and you need to do it in the VM.

@snir911 Yes that is the way to go, you just need to make sure you create the CDI spec upon VM startup after loading the driver.
You can use nvidia-ctk to do that in one go

nvidia-ctk -d system create-device-nodes --control-devices --load-kernel-modules

@bpradipt The naming in the Pod yaml may need to be different than nvidia.com/gpu and the cdi annotation needs to be nvidia.com/gpu not nvidia.com/pgpu take a look at @snir911 snippet he posted, this is the way to go.

nvidia.com/gpu is reserved for traditional container runtimes so we do not want to confuse users.

After CDI spec is created we need to start nvidia-persitenced in the VM once we have this working properly we can think about metrics.

zvonkok · 2025-02-06T16:33:44Z

If we use the CDI annotation, I think the

io.katacontainers.config.hypervisor.default_gpus=1

Is not needed?

requests:
   limits: 
     nvidia.com/cgpu: 2

This would translate into a CSP instance with 2 GPUs or we can use instances with 4 GPUs if the user is willing

cdi.k8s.io/peer-pods: nvidia.com/gpu=0
cdi.k8s.io/peer-pods: nvidia.com/gpu=1

Which is then used inside by the kata-agent to inject two devices into the container.

If we have an instance with two GPUs and the user requested two GPUs then we can use

cdi.k8s.io/peer-pods: nvidia.com/gpu=all

snir911 · 2025-02-06T18:05:36Z

I've used nvidia-ctk in a oneshot service to create the CDI spec, the odd thing is that when i listed the devices in the VM with nvidia-ctk it suggested only:

nvidia.com/gpu=0
nvidia.com/gpu=<device id?>
nvidia.com/gpu=all

Hence i assumed nvidia.com/gpu=all will cover all cases

mythi · 2025-02-07T07:15:43Z

@mythi It is necessary. You need to mount the libs/binaries into the container, this is what the hook did in the past and you need to do it in the VM.

OK. My thinking was just that since this is all static for peer-pod podvms, perhaps config.json edits could also be done on the host side using NRI device-injector like setup that uses CDI devices based on the known instance types.

requests:
limits:
nvidia.com/cgpu: 2

IMO, this is not consistent with the non-GPU peer-pods deployments where the instance type selection defines the available resources.

zvonkok · 2025-02-11T15:29:26Z

What do you mean by "host"? The CDI specs are depedenent on various factors in the VM that cannot be deduced upfront.

zvonkok · 2025-02-11T15:31:58Z

IMO, this is not consistent with the non-GPU peer-pods deployments where the instance type selection defines the available resources.

The non-GPU peer-pods deployments are selecting an instance type? Which means requests/limits in peer-pods does not mean anything?

mythi · 2025-02-11T18:55:13Z

What do you mean by "host"? The CDI specs are depedenent on various factors in the VM that cannot be deduced upfront.

host is the worker according peer-pods docs.

IMO, this is not consistent with the non-GPU peer-pods deployments where the instance type selection defines the available resources.

The non-GPU peer-pods deployments are selecting an instance type? Which means requests/limits in peer-pods does not mean anything?

Looks like I had some old info. Anyway, the instance type is fundamentally selected based on known annotations for vcpu/memory and some heuristics. there's a webhook that helps to set those annotations and removes any requests/limits. with the GPUs added it'd probably be simplest for the user to just go with the instance type selection directly and not rely on any fake resources that do not exist on any host/node.

zvonkok · 2025-02-13T20:19:35Z

You cannot add the CDI spec in/on the host side aka to the config.json edits. You need to generate the CDI spec inside the VM.

zvonkok mentioned this issue Jan 20, 2025

Road to Confidential Containers with GPUs for v<TARGET_VERSION> confidential-containers/confidential-containers#278

Open

55 tasks

zvonkok mentioned this issue Jan 24, 2025

Add CDI support to peer pods #2126

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Resource Naming #2255

GPU Resource Naming #2255

zvonkok commented Jan 22, 2025 •

edited

Loading

zvonkok commented Jan 22, 2025 •

edited

Loading

mythi commented Jan 24, 2025

zvonkok commented Jan 24, 2025

inatatsu commented Jan 29, 2025

bpradipt commented Jan 30, 2025

snir911 commented Jan 30, 2025

bpradipt commented Jan 30, 2025

snir911 commented Jan 30, 2025

snir911 commented Jan 30, 2025 •

edited

Loading

mythi commented Jan 31, 2025 •

edited

Loading

snir911 commented Feb 5, 2025 •

edited

Loading

zvonkok commented Feb 6, 2025 •

edited

Loading

zvonkok commented Feb 6, 2025 •

edited

Loading

snir911 commented Feb 6, 2025

mythi commented Feb 7, 2025

zvonkok commented Feb 11, 2025

zvonkok commented Feb 11, 2025

mythi commented Feb 11, 2025

zvonkok commented Feb 13, 2025 •

edited

Loading

GPU Resource Naming #2255

GPU Resource Naming #2255

Comments

zvonkok commented Jan 22, 2025 • edited Loading

zvonkok commented Jan 22, 2025 • edited Loading

mythi commented Jan 24, 2025

zvonkok commented Jan 24, 2025

inatatsu commented Jan 29, 2025

bpradipt commented Jan 30, 2025

snir911 commented Jan 30, 2025

bpradipt commented Jan 30, 2025

snir911 commented Jan 30, 2025

snir911 commented Jan 30, 2025 • edited Loading

mythi commented Jan 31, 2025 • edited Loading

snir911 commented Feb 5, 2025 • edited Loading

zvonkok commented Feb 6, 2025 • edited Loading

zvonkok commented Feb 6, 2025 • edited Loading

snir911 commented Feb 6, 2025

mythi commented Feb 7, 2025

zvonkok commented Feb 11, 2025

zvonkok commented Feb 11, 2025

mythi commented Feb 11, 2025

zvonkok commented Feb 13, 2025 • edited Loading

zvonkok commented Jan 22, 2025 •

edited

Loading

zvonkok commented Jan 22, 2025 •

edited

Loading

snir911 commented Jan 30, 2025 •

edited

Loading

mythi commented Jan 31, 2025 •

edited

Loading

snir911 commented Feb 5, 2025 •

edited

Loading

zvonkok commented Feb 6, 2025 •

edited

Loading

zvonkok commented Feb 6, 2025 •

edited

Loading

zvonkok commented Feb 13, 2025 •

edited

Loading