Interconnecting GPUs in Managed Service for Kubernetes® clusters using InfiniBand™
To accelerate ML, AI and high-performance computing (HPC) workloads that you run in your Managed Service for Kubernetes clusters with GPUs, you can interconnect the GPUs using InfiniBand, a high-throughput, low-latency networking standard. For more details about InfiniBand in Nebius AI Cloud, see the Compute documentation.
In this article, you will learn how to set up InfiniBand in a Managed Kubernetes cluster.
How to enable InfiniBand for a node group
Note
Create Managed Service for Kubernetes node groups in the same project with the cluster.
Warning
Nodes with GPUs that you create in the web console do not have GPU drivers and other components installed. After creating the nodes, install the drivers and components by using the Nebius AI Cloud CLI.
In the node group creation form (Managed Kubernetes® → your cluster → Node groups → Create node group), under Computing resources:
-
Select With GPU.
-
Select a platform and a preset compatible with GPU clusters. The compatible platforms and presets:
Platform Presets Region NVIDIA® H100 NVLink with Intel Sapphire Rapids
(gpu-h100-sxm
)8gpu-128vcpu-1600gb
eu-north1
NVIDIA® H200 NVLink with Intel Sapphire Rapids
(gpu-h200-sxm
)8gpu-128vcpu-1600gb
eu-north1
,eu-west1
-
Select a GPU cluster or create one. If the field is inactive, make sure you have selected a compatible platform and preset.
-
Depending on your project’s region, select an InfiniBand fabric and save it to an environment variable:
export INFINIBAND_FABRIC=fabric-<2|3|4|5>
-
Create a GPU cluster and save its ID:
export NB_GPU_CLUSTER_ID=$(nebius compute gpu-cluster create \ --name gpu-cluster-name \ --infiniband-fabric $INFINIBAND_FABRIC \ --format json \ | jq -r ".metadata.id")
-
Create a node group with GPUs and specify the GPU cluster ID in its parameters, by using the nebius mk8s node-group create command:
nebius mk8s node-group create \ --template-resources-platform gpu-h100-sxm \ --template-resources-preset 8gpu-128vcpu-1600gb \ --template-gpu-cluster-id $NB_GPU_CLUSTER_ID \ --template-gpu-settings-drivers-preset cuda12 \ ...
-
In
--template.gpu-cluster-id
, specify the GPU cluster ID. -
In
--template-resources-platform
, specify a platform with GPUs. In--template-resources-preset
, specify a compatible preset (number of GPUs and vCPUs, RAM size). The compatible platforms and presets are:Platform Presets Region NVIDIA® H100 NVLink with Intel Sapphire Rapids
(gpu-h100-sxm
)8gpu-128vcpu-1600gb
eu-north1
NVIDIA® H200 NVLink with Intel Sapphire Rapids
(gpu-h200-sxm
)8gpu-128vcpu-1600gb
eu-north1
,eu-west1
-
In
--template-gpu-settings-drivers-preset
, specifycuda12
to use a boot disk image that contains drivers and other components for GPUs. Without this image, you will need to install the drivers and components manually. For more details, see GPU drivers and other components.
-
Example: NCCL tests
To test InfiniBand performance in a Managed Service for Kubernetes cluster, you can run the NVIDIA NCCL test in it. For instructions, see our tutorial.
See also
InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.