Nccl test on aks ndmv4 vm

This write-up aims to replicate the blog Deploy NDm_v4 (A100) Kubernetes Cluster by Cormac Garvey. The original blog assumes you have an exising ACR.

All following commands run on your local laptop, except for the NCCL docker container creation step, which needs to run on a NDmv4 VM.

Login to your az account

az login
az account set -s YourSubscription

Add AKS extension and enable IB

az extension add --name aks-preview
az feature register --name AKSInfinibandSupport --namespace Microsoft.ContainerService

Define environment variables

export AKS_RG='JZ-AKS'
export LOCATION='southcentralus'
export NODE_RG='JZ-AKSnode'
export AKS_NAME='JZ-akscluster'
export AGENT_POOL_NAME='jzpool' #lower case letter and number only
export ACR_NAME='jzacr2' #lower case letter and number only
export NDMv4_POOL_NAME='jzndmv4' #lower case letter and number only

Create a resource group

az group create --resource-group $AKS_RG --location $LOCATION

Create Azure Container Registry (ACR)

az acr create --resource-group $AKS_RG --name $ACR_NAME --sku Standard

Without this step, the follwoing create AKS cluster command with --attach-acr will fail.

Create NCCL container (this step needs to be done on a NDmv4 VM, not your local environment)

Login to ACR

az login
az account set -s YourSubscription
az acr login -n $ACR_NAME # az acr login -n jzacr2; DO NOT use the full "loginServer" name: "jzacr2.azurecr.io"

Create first file nccl-tests.sh, and chmod +x nccl-tests.sh

#!/bin/bash

git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make MPI=1 MPI_HOME=/usr/local/mpi

Create second file ndv4-topo.xml

<system version="1">
  <cpu numaid="0" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49">
    <pci busid="ffff:ff:01.0" class="0x060400" link_speed="16 GT/s" link_width="16">
      <pci busid="0001:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0101:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0002:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0102:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
    </pci>
  </cpu>
  <cpu numaid="1" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49">
    <pci busid="ffff:ff:02.0" class="0x060400" link_speed="16 GT/s" link_width="16">
      <pci busid="0003:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0103:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0004:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0104:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
    </pci>
  </cpu>
  <cpu numaid="2" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49">
      <pci busid="ffff:ff:03.0" class="0x060400" link_speed="16 GT/s" link_width="16">
      <pci busid="000b:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0105:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
      <pci busid="000c:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0106:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
    </pci>
  </cpu>
  <cpu numaid="3" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49">
    <pci busid="ffff:ff:04.0" class="0x060400" link_speed="16 GT/s" link_width="16">
      <pci busid="000d:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0107:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
      <pci busid="000e:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0108:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
    </pci>
  </cpu>
</system>

Create third file Dockerfile

ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:23.03-py3

FROM ${FROM_IMAGE_NAME}

RUN apt update
RUN apt-get -y install build-essential
RUN apt-get -y install infiniband-diags
RUN apt-get -y install openssh-server
RUN apt-get -y install kmod
COPY nccl-tests.sh .
RUN ./nccl-tests.sh
COPY ndv4-topo.xml .

Put above three files in the same directory, then build and push to ACR.

docker build -t jzacr2.azurecr.io/pytorch_nccl_tests_2303 .
docker push jzacr2.azurecr.io/pytorch_nccl_tests_2303:latest

Create AKS cluster (now back to your local laptop)

az aks create \
      -g $AKS_RG \
      --node-resource-group $NODE_RG \
      -n $AKS_NAME \
      --enable-managed-identity \
      --node-count 2 \
      --generate-ssh-keys \
      -l $LOCATION \
      --node-vm-size Standard_D2s_v3 \
      --nodepool-name $AGENT_POOL_NAME \
      --os-sku Ubuntu \
      --attach-acr $ACR_NAME

Add a node pool

az aks nodepool add --resource-group $AKS_RG --cluster-name $AKS_NAME --name $NDMv4_POOL_NAME --node-count 1 --node-vm-size Standard_ND96amsr_A100_v4 --node-osdisk-size 128 --os-sku Ubuntu --tags SkipGPUDriverInstallation=true
or
az aks nodepool add --resource-group $AKS_RG --cluster-name $AKS_NAME --name $NDMv4_POOL_NAME --node-count 1 --node-vm-size Standard_ND96amsr_A100_v4 --node-osdisk-size 128 --os-sku Ubuntu --tags SkipGPUDriverInstall=true

Note: Need to verify which tag is right. The blog has the second one. I tested the first one which worked.

Save the credentials to your local config file

$ az aks get-credentials --overwrite-existing --resource-group $AKS_RG --name $AKS_NAME
Merged "JZ-akscluster" as current context in /home/jingchao/.kube/config

Check the created nodes

$ kubectl get nodes
NAME                              STATUS   ROLES   AGE    VERSION
aks-jzndmv4-29195301-vmss000000   Ready    agent   135m   v1.26.6
aks-jzpool-33093035-vmss000000    Ready    agent   153m   v1.26.6
aks-jzpool-33093035-vmss000001    Ready    agent   153m   v1.26.6

Install GPU and network drivers

Save the following script to a script driver.sh, and execute it with bash driver.sh

#! /bin/bash

# Apply required manifests
kubectl get namespace nvidia-operator 2>/dev/null || kubectl create namespace nvidia-operator

# Install node feature discovery
helm upgrade -i --wait \
  -n nvidia-operator node-feature-discovery node-feature-discovery \
  --repo https://kubernetes-sigs.github.io/node-feature-discovery/charts \
  --set-json master.nodeSelector='{"kubernetes.azure.com/mode": "system"}' \
  --set-json worker.nodeSelector='{"kubernetes.azure.com/accelerator": "nvidia"}' \
  --set-json worker.config.sources.pci.deviceClassWhitelist='["02","03","0200","0207"]' \
  --set-json worker.config.sources.pci.deviceLabelFields='["vendor"]'

# Install the network-operator
helm upgrade -i --wait \
  -n nvidia-operator network-operator network-operator \
  --repo https://helm.ngc.nvidia.com/nvidia \
  --set deployCR=true \
  --set nfd.enabled=false \
  --set ofedDriver.deploy=true \
  --set rdmaSharedDevicePlugin.deploy=false \
  --set secondaryNetwork.deploy=true \
  --set secondaryNetwork.ipamPlugin.deploy=true \
  --set secondaryNetwork.ipoib.deploy=true \
  --set secondaryNetwork.multus.deploy=true \
  --set sriovDevicePlugin.deploy=true \
  --set-json sriovDevicePlugin.resources='[{"name":"mlnxnics","linkTypes": ["infiniband"], "vendors":["15b3"]}]'
# Note: use --set ofedDriver.version="<MOFED VERSION>"
#       to install a specific MOFED version
#
# Install the gpu-operator
helm upgrade -i --wait \
  -n nvidia-operator gpu-operator gpu-operator \
  --repo https://helm.ngc.nvidia.com/nvidia \
  --set nfd.enabled=false \
  --set driver.enabled=true \
  --set driver.version="525.60.13" \
  --set driver.rdma.enabled=true \
  --set toolkit.enabled=true

# Apply the hostdev-net configuration for Infiniband
cat <<EOF | kubectl apply -f -
apiVersion: mellanox.com/v1alpha1
kind: HostDeviceNetwork
metadata:
   name: hostdev-net
spec:
  networkNamespace: "default"
  resourceName: "mlnxnics"
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "range": "100.127.0.0/16",
      "exclude": [],
      "log_file" : "/var/log/whereabouts.log",
      "log_level" : "info"
    }
EOF

Verify the drivers are installed

$ kubectl describe node $NDmv4_AKS_node | grep -e "nvidia.com/mlnxnics" -e "nvidia.com/gpu"
                    nvidia.com/gpu-driver-upgrade-state=upgrade-done
                    nvidia.com/gpu.compute.major=8
                    nvidia.com/gpu.compute.minor=0
                    nvidia.com/gpu.count=8
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=true
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.mig-manager=true
                    nvidia.com/gpu.deploy.node-status-exporter=true
                    nvidia.com/gpu.deploy.nvsm=
                    nvidia.com/gpu.deploy.operator-validator=true
                    nvidia.com/gpu.family=ampere
                    nvidia.com/gpu.machine=Virtual-Machine
                    nvidia.com/gpu.memory=81920
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB
                    nvidia.com/gpu.replicas=1
                    nvidia.com/gpu-driver-upgrade-enabled: true
  nvidia.com/gpu:       8
  nvidia.com/mlnxnics:  8
  nvidia.com/gpu:       8
  nvidia.com/mlnxnics:  8
  nvidia.com/gpu       0           0
  nvidia.com/mlnxnics  0           0

Install Volcano Kubernetes scheduler

$ kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/release-1.7/installer/volcano-development.yaml
$ kubectl get all -n volcano-system
NAME                                      READY   STATUS      RESTARTS   AGE
pod/volcano-admission-7b864f5d49-x8bv9    1/1     Running     0          129m
pod/volcano-admission-init-pb7nr          0/1     Completed   0          129m
pod/volcano-controllers-5d784c876-hxmdz   1/1     Running     0          129m
pod/volcano-scheduler-65fb9b4dd-5pmhm     1/1     Running     0          129m

NAME                                TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)    AGE 
service/volcano-admission-service   ClusterIP   10.0.104.73   <none>        443/TCP    129m
service/volcano-scheduler-service   ClusterIP   10.0.8.41     <none>        8080/TCP   129m

NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/volcano-admission     1/1     1            1           129m
deployment.apps/volcano-controllers   1/1     1            1           129m
deployment.apps/volcano-scheduler     1/1     1            1           129m

NAME                                            DESIRED   CURRENT   READY   AGE
replicaset.apps/volcano-admission-7b864f5d49    1         1         1       129m
replicaset.apps/volcano-controllers-5d784c876   1         1         1       129m
replicaset.apps/volcano-scheduler-65fb9b4dd     1         1         1       129m

NAME                               COMPLETIONS   DURATION   AGE
job.batch/volcano-admission-init   1/1           8s         129m

Scale GPU nodes to 2

az aks nodepool scale --resource-group $AKS_RG --cluster-name $AKS_NAME --name $NDMv4_POOL_NAME --node-count 2

Create a kubernetes service account to view the output

kubectl create serviceaccount -n default mpi-worker-view
kubectl create rolebinding default-view --namespace default --serviceaccount default:mpi-worker-view --clusterrole view

Create the NCCL job

Create the NCCL job file job.yaml with content below:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: nccl-allreduce-job1
spec:
  minAvailable: 3
  schedulerName: volcano
  plugins:
    ssh: []
    svc: []
  tasks:
    - replicas: 1
      name: mpimaster
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          initContainers:
            - command:
                - /bin/bash
                - -c
                - |
                  until [[ "$(kubectl get pod -l volcano.sh/job-name=nccl-allreduce-job1,volcano.sh/task-spec=mpiworker -o json | jq '.items | length')" != 0 ]]; do
                    echo "Waiting for MPI worker pods..."
                    sleep 3
                  done
                  echo "Waiting for MPI worker pods to be ready..."
                  kubectl wait pod -l volcano.sh/job-name=nccl-allreduce-job1,volcano.sh/task-spec=mpiworker --for=condition=Ready --timeout=600s
              image: mcr.microsoft.com/oss/kubernetes/kubectl:v1.26.3
              name: wait-for-workers
          serviceAccount: mpi-worker-view
          containers:
            - command:
                - /bin/bash
                - -c
                - |
                  MPI_HOST=$(cat /etc/volcano/mpiworker.host | tr "\n" ",")
                  mkdir -p /var/run/sshd; /usr/sbin/sshd
                  echo "HOSTS: $MPI_HOST"
                  mpirun --allow-run-as-root \
                  -np 16 -npernode 8 \
                  --bind-to numa --map-by ppr:8:node \
                  -hostfile /etc/volcano/mpiworker.host \
                  -x NCCL_DEBUG=info \
                  -x UCX_TLS=tcp \
                  -x NCCL_TOPO_FILE=/workspace/ndv4-topo.xml \
                  -x UCX_NET_DEVICES=eth0 \
                  -x CUDA_DEVICE_ORDER=PCI_BUS_ID \
                  -x NCCL_SOCKET_IFNAME=eth0 \
                  -mca coll_hcoll_enable 0 \
                  /workspace/nccl-tests/build/all_reduce_perf -b 8 -f 2 -g 1 -e 8G -c 1 \
                  | tee /home/re
              image: jzacr2.azurecr.io/pytorch_nccl_tests_2303:latest
              securityContext:
                capabilities:
                  add: ["IPC_LOCK"]
              name: mpimaster
              ports:
                - containerPort: 22
                  name: mpijob-port
              workingDir: /workspace
              resources:
                requests:
                  cpu: 1
          restartPolicy: OnFailure
    - replicas: 2
      name: mpiworker
      template:
        metadata:
          annotations:
            k8s.v1.cni.cncf.io/networks: hostdev-net,hostdev-net,hostdev-net,hostdev-net,hostdev-net,hostdev-net,hostdev-net,hostdev-net
        spec:
          containers:
            - command:
                - /bin/bash
                - -c
                - |
                  mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
              image: jzacr2.azurecr.io/pytorch_nccl_tests_2303:latest
              securityContext:
                capabilities:
                  add: ["IPC_LOCK"]
              name: mpiworker
              ports:
                - containerPort: 22
                  name: mpijob-port
              workingDir: /workspace
              resources:
                requests:
                  cpu: 1
          restartPolicy: OnFailure
    - replicas: 2
      name: mpiworker
      template:
        metadata:
          annotations:
            k8s.v1.cni.cncf.io/networks: hostdev-net,hostdev-net,hostdev-net,hostdev-net,hostdev-net,hostdev-net,hostdev-net,hostdev-net
        spec:
          containers:
            - command:
                - /bin/bash
                - -c
                - |
                  mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
              image: jzacr2.azurecr.io/pytorch_nccl_tests_2303:latest
              securityContext:
                capabilities:
                  add: ["IPC_LOCK"]
              name: mpiworker
              ports:
                - containerPort: 22
                  name: mpijob-port
              workingDir: /workspace
              resources:
                requests:
                  nvidia.com/gpu: 8
                  nvidia.com/mlnxnics: 8
                limits:
                  nvidia.com/gpu: 8
                  nvidia.com/mlnxnics: 8
              volumeMounts:
              - mountPath: /dev/shm
                name: shm
          restartPolicy: OnFailure
          terminationGracePeriodSeconds: 0
          volumes:
          - name: shm
            emptyDir:
              medium: Memory
              sizeLimit: 8Gi
---

Note: there are two occurances of jzacr2.azurecr.io/pytorch_nccl_tests_2303:latest in the above script, which is the NCCL container you pushed to your ACR. Edit it before proceeding.

Submit the NCCL job

$ kubectl apply -f job.yaml 
job.batch.volcano.sh/nccl-allreduce-job1 created

Get the pod name

$ kubectl get pods
NAME                              READY   STATUS    RESTARTS   AGE
nccl-allreduce-job1-mpimaster-0   1/1     Running   0          16s
nccl-allreduce-job1-mpiworker-0   1/1     Running   0          16s
nccl-allreduce-job1-mpiworker-1   1/1     Running   0          16s

Check the NCCL test output

$ kubectl logs -f nccl-allreduce-job1-mpimaster-0
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
nccl-allreduce-job1-mpiworker-1:57:214 [7] NCCL INFO comm 0x55c13a6640d0 rank 15 nranks 16 cudaDev 7 busId e00000 commId 0x46cbea2567b59372 - Init COMPLETE
nccl-allreduce-job1-mpiworker-1:51:174 [3] NCCL INFO comm 0x5586358b6c10 rank 11 nranks 16 cudaDev 3 busId 400000 commId 0x46cbea2567b59372 - Init COMPLETE
nccl-allreduce-job1-mpiworker-1:49:169 [1] NCCL INFO comm 0x5590048f9910 rank 9 nranks 16 cudaDev 1 busId 200000 commId 0x46cbea2567b59372 - Init COMPLETE
nccl-allreduce-job1-mpiworker-1:53:231 [5] NCCL INFO comm 0x564e5f765c40 rank 13 nranks 16 cudaDev 5 busId c00000 commId 0x46cbea2567b59372 - Init COMPLETE
nccl-allreduce-job1-mpiworker-1:54:194 [6] NCCL INFO comm 0x564950a5b020 rank 14 nranks 16 cudaDev 6 busId d00000 commId 0x46cbea2567b59372 - Init COMPLETE
nccl-allreduce-job1-mpiworker-1:50:212 [2] NCCL INFO comm 0x555b01ca9170 rank 10 nranks 16 cudaDev 2 busId 300000 commId 0x46cbea2567b59372 - Init COMPLETE
nccl-allreduce-job1-mpiworker-1:48:168 [0] NCCL INFO comm 0x55a22905c240 rank 8 nranks 16 cudaDev 0 busId 100000 commId 0x46cbea2567b59372 - Init COMPLETE
nccl-allreduce-job1-mpiworker-1:52:197 [4] NCCL INFO comm 0x55567f894360 rank 12 nranks 16 cudaDev 4 busId b00000 commId 0x46cbea2567b59372 - Init COMPLETE
           8             2     float     sum      -1    37.30    0.00    0.00      0    34.44    0.00    0.00      0
          16             4     float     sum      -1    36.03    0.00    0.00      0    33.94    0.00    0.00      0
          32             8     float     sum      -1    36.50    0.00    0.00      0    33.57    0.00    0.00      0
          64            16     float     sum      -1    36.33    0.00    0.00      0    33.99    0.00    0.00      0
         128            32     float     sum      -1    37.62    0.00    0.01      0    34.42    0.00    0.01      0
         256            64     float     sum      -1    38.28    0.01    0.01      0    34.77    0.01    0.01      0
         512           128     float     sum      -1    38.20    0.01    0.03      0    35.15    0.01    0.03      0
        1024           256     float     sum      -1    40.92    0.03    0.05      0    37.37    0.03    0.05      0
        2048           512     float     sum      -1    42.87    0.05    0.09      0    39.49    0.05    0.10      0
        4096          1024     float     sum      -1    41.82    0.10    0.18      0    40.85    0.10    0.19      0
        8192          2048     float     sum      -1    46.31    0.18    0.33      0    42.78    0.19    0.36      0
       16384          4096     float     sum      -1    58.10    0.28    0.53      0    55.03    0.30    0.56      0
       32768          8192     float     sum      -1    58.73    0.56    1.05      0    56.11    0.58    1.09      0
       65536         16384     float     sum      -1    60.01    1.09    2.05      0    59.40    1.10    2.07      0
      131072         32768     float     sum      -1    63.71    2.06    3.86      0    63.33    2.07    3.88      0
      262144         65536     float     sum      -1    68.25    3.84    7.20      0    68.67    3.82    7.16      0
      524288        131072     float     sum      -1    80.23    6.54   12.25      0    79.70    6.58   12.33      0
     1048576        262144     float     sum      -1    96.39   10.88   20.40      0    96.73   10.84   20.33      0
     2097152        524288     float     sum      -1    128.6   16.31   30.59      0    127.8   16.41   30.77      0
     4194304       1048576     float     sum      -1    148.1   28.32   53.11      0    146.5   28.62   53.67      0
     8388608       2097152     float     sum      -1    211.1   39.74   74.51      0    207.8   40.37   75.70      0
    16777216       4194304     float     sum      -1    333.4   50.32   94.35      0    330.8   50.72   95.10      0
    33554432       8388608     float     sum      -1    615.6   54.51  102.21      0    626.3   53.58  100.45      0
    67108864      16777216     float     sum      -1    932.6   71.96  134.92      0    929.6   72.19  135.36      0
   134217728      33554432     float     sum      -1   1672.7   80.24  150.45      0   1676.3   80.07  150.13      0
   268435456      67108864     float     sum      -1   3013.5   89.08  167.02      0   3004.6   89.34  167.52      0
   536870912     134217728     float     sum      -1   5702.0   94.15  176.54      0   5705.8   94.09  176.42      0
  1073741824     268435456     float     sum      -1    11063   97.05  181.98      0    11089   96.83  181.56      0
  2147483648     536870912     float     sum      -1    21637   99.25  186.10      0    21673   99.09  185.79      0
  4294967296    1073741824     float     sum      -1    42758  100.45  188.34      0    42779  100.40  188.25      0
  8589934592    2147483648     float     sum      -1    85129  100.90  189.20      0    85091  100.95  189.28      0
nccl-allreduce-job1-mpiworker-1:51:51 [3] NCCL INFO comm 0x5586358b6c10 rank 11 nranks 16 cudaDev 3 busId 400000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-0:51:51 [3] NCCL INFO comm 0x563b9a846840 rank 3 nranks 16 cudaDev 3 busId 400000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-1:57:57 [7] NCCL INFO comm 0x55c13a6640d0 rank 15 nranks 16 cudaDev 7 busId e00000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-1:53:53 [5] NCCL INFO comm 0x564e5f765c40 rank 13 nranks 16 cudaDev 5 busId c00000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-0:50:50 [2] NCCL INFO comm 0x55ce61480260 rank 2 nranks 16 cudaDev 2 busId 300000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-0:52:52 [4] NCCL INFO comm 0x5632e283bb30 rank 4 nranks 16 cudaDev 4 busId b00000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-0:48:48 [0] NCCL INFO comm 0x55d407b24020 rank 0 nranks 16 cudaDev 0 busId 100000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-1:50:50 [2] NCCL INFO comm 0x555b01ca9170 rank 10 nranks 16 cudaDev 2 busId 300000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-0:55:55 [6] NCCL INFO comm 0x55dc04852d60 rank 6 nranks 16 cudaDev 6 busId d00000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-0:49:49 [1] NCCL INFO comm 0x555ead805480 rank 1 nranks 16 cudaDev 1 busId 200000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-1:48:48 [0] NCCL INFO comm 0x55a22905c240 rank 8 nranks 16 cudaDev 0 busId 100000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-0:56:56 [7] NCCL INFO comm 0x556f8d65b050 rank 7 nranks 16 cudaDev 7 busId e00000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-1:49:49 [1] NCCL INFO comm 0x5590048f9910 rank 9 nranks 16 cudaDev 1 busId 200000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-1:54:54 [6] NCCL INFO comm 0x564950a5b020 rank 14 nranks 16 cudaDev 6 busId d00000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-0:53:53 [5] NCCL INFO comm 0x556afbcbdc10 rank 5 nranks 16 cudaDev 5 busId c00000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 57.347
#
nccl-allreduce-job1-mpiworker-1:52:52 [4] NCCL INFO comm 0x55567f894360 rank 12 nranks 16 cudaDev 4 busId b00000 - Destroy COMPLETE

If you see ~189 GBps output then you are done with this exercise.

Updated: