Jekyll2024-02-02T03:34:26+00:00https://jingchaozhang.github.io/feed.xmlJingchao’s WebsiteJingchao's WebsiteJingchao Zhangjingczhang@microsoft.comHpc network technologies2023-09-14T00:00:00+00:002023-09-14T00:00:00+00:00https://jingchaozhang.github.io/HPC%20network%20technologies<p>In the rapidly evolving landscape of High-Performance Computing (HPC) and Artificial Intelligence (AI), understanding the nuances between various networking protocols and libraries is crucial for performance optimization and system design. This blog aims to demystify key technologies such as RDMA, RoCE, TCP/IP, IPoIB, InfiniBand, NCCL, and MPI by categorizing them according to the OSI model layers at which they operate. By doing so, we provide a structured framework that aids in grasping how these technologies interact and complement one another in real-world applications. Whether you are an enterprise architect, a developer, or a researcher looking to harness the full potential of HPC and AI, this comprehensive guide will serve as a valuable reference point.</p>
<h2 id="categorized-by-osi-layers">Categorized by OSI layers</h2>
<table>
<thead>
<tr>
<th>Layer</th>
<th>RDMA</th>
<th>RoCE</th>
<th>TCP/IP</th>
<th>IPoIB</th>
<th>IB</th>
<th>NCCL</th>
<th>MPI</th>
</tr>
</thead>
<tbody>
<tr>
<td>Physical (1)</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>Yes</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>Data Link (2)</td>
<td>-</td>
<td>Yes</td>
<td>-</td>
<td>-</td>
<td>Yes</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>Network (3)</td>
<td>-</td>
<td>-</td>
<td>Yes</td>
<td>Yes</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>Transport (4)</td>
<td>Yes</td>
<td>Yes</td>
<td>Yes</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>Application (7)</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>Yes</td>
<td>Yes</td>
</tr>
</tbody>
</table>
<h2 id="categorized-by-function">Categorized by function</h2>
<table>
<thead>
<tr>
<th>Term</th>
<th>Full Form</th>
<th>Layer</th>
<th>Description</th>
<th>Use-Cases</th>
<th>Compatibility/Co-existence</th>
</tr>
</thead>
<tbody>
<tr>
<td>RDMA</td>
<td>Remote Direct Memory Access</td>
<td>Data Link</td>
<td>Direct memory access from one computer into another without involving either’s OS</td>
<td>HPC, Data Transfer</td>
<td>Can be used over IB or Ethernet (RoCE)</td>
</tr>
<tr>
<td>RoCE</td>
<td>RDMA over Converged Ethernet</td>
<td>Data Link</td>
<td>An extension of RDMA, it allows RDMA to run over Ethernet networks</td>
<td>Data Center, Cloud Networking</td>
<td>Ethernet-based</td>
</tr>
<tr>
<td>TCP/IP</td>
<td>Transmission Control Protocol/IP</td>
<td>Transport</td>
<td>Standard communication protocol over the Internet, involves packet switching</td>
<td>General Internet, Web Services</td>
<td>Most Networks</td>
</tr>
<tr>
<td>IPoIB</td>
<td>IP over InfiniBand</td>
<td>Transport</td>
<td>Allows the transmission of IP traffic over InfiniBand, making it compatible with existing IP-based applications</td>
<td>IP Services on IB network</td>
<td>Shares InfiniBand port</td>
</tr>
<tr>
<td>IB</td>
<td>InfiniBand</td>
<td>Data Link</td>
<td>High-throughput, low-latency networking stack, commonly used in HPC</td>
<td>HPC, Data Centers</td>
<td>Exclusive port usually</td>
</tr>
<tr>
<td>NCCL</td>
<td>NVIDIA Collective Communications Library</td>
<td>Application</td>
<td>Optimized primitives library for collective communications in multi-GPU environments</td>
<td>Deep Learning, AI Training</td>
<td>Can work over IB, RoCE, or even TCP/IP</td>
</tr>
<tr>
<td>MPI</td>
<td>Message Passing Interface</td>
<td>Application</td>
<td>A standardized and portable API used for parallel computing, operates over various kinds of networks</td>
<td>High-Performance Computing, parallelized applications</td>
<td>Can work over IB, RoCE, TCP/IP, and more</td>
</tr>
</tbody>
</table>
<ul>
<li>
<p><strong>RDMA</strong>: This is the foundation for zero-copy networking. It offers lower latency and higher bandwidth.</p>
</li>
<li>
<p><strong>RoCE</strong>: It’s RDMA adapted for Ethernet. It’s useful in modern data center applications where you might not have InfiniBand but still want low latency.</p>
</li>
<li>
<p><strong>TCP/IP</strong>: This is the most commonly used protocol stack and is generally slower and more resource-intensive than RDMA or RoCE.</p>
</li>
<li>
<p><strong>IPoIB</strong>: It’s a way to map IP over InfiniBand so that you can run IP-based applications without modification. It’s generally slower than native InfiniBand but offers compatibility.</p>
</li>
<li>
<p><strong>InfiniBand (IB)</strong>: This is a high-performance network protocol that uses high-throughput and low-latency networking technologies. Typically, InfiniBand will have its own dedicated port, but it can share a port if running IPoIB.</p>
</li>
<li>
<p><strong>NCCL</strong>: This is a library for collective communication that’s particularly useful in multi-GPU setups for machine learning. It’s protocol agnostic to an extent and can work over InfiniBand, RoCE, or even TCP/IP if necessary.</p>
</li>
<li>
<p><strong>MPI (Message Passing Interface)</strong>: MPI is an application-layer API that allows for high-performance communication between nodes in a parallel computing environment. Unlike the other technologies listed, which are more focused on networking layers, MPI operates at the application layer and can be used on top of multiple kinds of networking technologies including InfiniBand, RoCE, and TCP/IP.</p>
</li>
</ul>
<h2 id="network-interfaces">Network interfaces</h2>
<p>Understanding the network interfaces used by InfiniBand, RoCE, and TCP/IP is crucial for their effective deployment and operation. Below is a brief explanation:</p>
<h3 id="infiniband-ib">InfiniBand (IB)</h3>
<ul>
<li><strong>Network Interface</strong>: InfiniBand Host Channel Adapter (HCA)</li>
<li><strong>Details</strong>: InfiniBand uses its own specialized network interfaces known as <a href="https://www.google.com/search?sca_esv=565545338&rlz=1C1CHBF_enUS1013US1013&sxsrf=AM9HkKnaY_cAe3tG3Uf77OinaP3Wgu8Qxg:1694748959609&q=InfiniBand+Host+Channel+Adapter+(HCA)&tbm=isch&source=lnms&sa=X&ved=2ahUKEwif4uHt16uBAxU2gGoFHdIsBv4Q0pQJegQIChAB&biw=2048&bih=995&dpr=1.25">HCAs</a>. These are different from standard Ethernet NICs (Network Interface Cards). HCAs are designed to provide low-latency and high-throughput communication.</li>
</ul>
<h3 id="rdma-over-converged-ethernet-roce">RDMA over Converged Ethernet (RoCE)</h3>
<ul>
<li><strong>Network Interface</strong>: Converged Network Adapter (CNA) or RDMA-enabled NIC</li>
<li><strong>Details</strong>: RoCE often uses <a href="https://www.google.com/search?q=Converged+Network+Adapter+(CNA)&tbm=isch&ved=2ahUKEwiXgdnu16uBAxUHAWIAHRX5CLcQ2-cCegQIABAA&oq=Converged+Network+Adapter+(CNA)&gs_lcp=CgNpbWcQAzIFCAAQgAQ6BAgjECdQ0gRY0gRgxAdoAHAAeACAAW2IAcsBkgEDMS4xmAEAoAEBqgELZ3dzLXdpei1pbWfAAQE&sclient=img&ei=IdEDZdfsIYeCiLMPlfKjuAs&bih=995&biw=2048&rlz=1C1CHBF_enUS1013US1013">Converged Network Adapters</a> (CNAs) that support both RDMA and traditional Ethernet communications. These adapters can also be RDMA-enabled NICs specifically optimized for RDMA over Ethernet.</li>
</ul>
<h3 id="tcpip">TCP/IP</h3>
<ul>
<li><strong>Network Interface</strong>: Ethernet Network Interface Card (NIC)</li>
<li><strong>Details</strong>: The standard network interface for TCP/IP-based communication is an <a href="https://www.bing.com/images/search?q=Ethernet+NIC&form=HDRSC4&first=1">Ethernet NIC</a>. These are ubiquitous and come in various speeds like Gigabit Ethernet, 10 Gigabit Ethernet, etc.</li>
</ul>
<p>It’s important to note that each of these network interfaces is optimized for the particular protocol stack they are designed to support. While you can run different protocols over the same physical infrastructure (for example, RoCE and TCP/IP over Ethernet), the network interface card must support those protocols for them to operate efficiently.</p>Jingchao Zhangjingczhang@microsoft.comIn the rapidly evolving landscape of High-Performance Computing (HPC) and Artificial Intelligence (AI), understanding the nuances between various networking protocols and libraries is crucial for performance optimization and system design. This blog aims to demystify key technologies such as RDMA, RoCE, TCP/IP, IPoIB, InfiniBand, NCCL, and MPI by categorizing them according to the OSI model layers at which they operate. By doing so, we provide a structured framework that aids in grasping how these technologies interact and complement one another in real-world applications. Whether you are an enterprise architect, a developer, or a researcher looking to harness the full potential of HPC and AI, this comprehensive guide will serve as a valuable reference point.Nccl test on aks ndmv4 vm2023-09-11T00:00:00+00:002023-09-11T00:00:00+00:00https://jingchaozhang.github.io/NCCL%20test%20on%20AKS%20NDmV4%20VM<p>This write-up aims to replicate the blog <a href="https://techcommunity.microsoft.com/t5/azure-high-performance-computing/deploy-ndm-v4-a100-kubernetes-cluster/ba-p/3838871">Deploy NDm_v4 (A100) Kubernetes Cluster</a> by <a href="https://techcommunity.microsoft.com/t5/user/viewprofilepage/user-id/364170">Cormac Garvey</a>. The original blog assumes you have an exising ACR.</p>
<p>All following commands run on your local laptop, except for the NCCL docker container creation step, which needs to run on a NDmv4 VM.</p>
<h2 id="login-to-your-az-account">Login to your az account</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>az login
az account <span class="nb">set</span> <span class="nt">-s</span> YourSubscription
</code></pre></div></div>
<h2 id="add-aks-extension-and-enable-ib">Add AKS extension and enable IB</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>az extension add <span class="nt">--name</span> aks-preview
az feature register <span class="nt">--name</span> AKSInfinibandSupport <span class="nt">--namespace</span> Microsoft.ContainerService
</code></pre></div></div>
<h2 id="define-environment-variables">Define environment variables</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">AKS_RG</span><span class="o">=</span><span class="s1">'JZ-AKS'</span>
<span class="nb">export </span><span class="nv">LOCATION</span><span class="o">=</span><span class="s1">'southcentralus'</span>
<span class="nb">export </span><span class="nv">NODE_RG</span><span class="o">=</span><span class="s1">'JZ-AKSnode'</span>
<span class="nb">export </span><span class="nv">AKS_NAME</span><span class="o">=</span><span class="s1">'JZ-akscluster'</span>
<span class="nb">export </span><span class="nv">AGENT_POOL_NAME</span><span class="o">=</span><span class="s1">'jzpool'</span> <span class="c">#lower case letter and number only</span>
<span class="nb">export </span><span class="nv">ACR_NAME</span><span class="o">=</span><span class="s1">'jzacr2'</span> <span class="c">#lower case letter and number only</span>
<span class="nb">export </span><span class="nv">NDMv4_POOL_NAME</span><span class="o">=</span><span class="s1">'jzndmv4'</span> <span class="c">#lower case letter and number only</span>
</code></pre></div></div>
<h2 id="create-a-resource-group">Create a resource group</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>az group create <span class="nt">--resource-group</span> <span class="nv">$AKS_RG</span> <span class="nt">--location</span> <span class="nv">$LOCATION</span>
</code></pre></div></div>
<h2 id="create-azure-container-registry-acr">Create Azure Container Registry (ACR)</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>az acr create <span class="nt">--resource-group</span> <span class="nv">$AKS_RG</span> <span class="nt">--name</span> <span class="nv">$ACR_NAME</span> <span class="nt">--sku</span> Standard
</code></pre></div></div>
<p>Without this step, the follwoing create AKS cluster command with <code class="language-plaintext highlighter-rouge">--attach-acr</code> will fail.</p>
<h2 id="create-nccl-container-this-step-needs-to-be-done-on-a-ndmv4-vm-not-your-local-environment">Create NCCL container (this step needs to be done on a NDmv4 VM, not your local environment)</h2>
<p>Login to ACR</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>az login
az account set -s YourSubscription
az acr login -n $ACR_NAME # az acr login -n jzacr2; DO NOT use the full "loginServer" name: "jzacr2.azurecr.io"
</code></pre></div></div>
<p>Create first file <code class="language-plaintext highlighter-rouge">nccl-tests.sh</code>, and <code class="language-plaintext highlighter-rouge">chmod +x nccl-tests.sh</code></p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
git clone https://github.com/NVIDIA/nccl-tests.git
<span class="nb">cd </span>nccl-tests
make <span class="nv">MPI</span><span class="o">=</span>1 <span class="nv">MPI_HOME</span><span class="o">=</span>/usr/local/mpi
</code></pre></div></div>
<p>Create second file <code class="language-plaintext highlighter-rouge">ndv4-topo.xml</code></p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><system <span class="nv">version</span><span class="o">=</span><span class="s2">"1"</span><span class="o">></span>
<cpu <span class="nv">numaid</span><span class="o">=</span><span class="s2">"0"</span> <span class="nv">affinity</span><span class="o">=</span><span class="s2">"0000ffff,0000ffff"</span> <span class="nb">arch</span><span class="o">=</span><span class="s2">"x86_64"</span> <span class="nv">vendor</span><span class="o">=</span><span class="s2">"AuthenticAMD"</span> <span class="nv">familyid</span><span class="o">=</span><span class="s2">"23"</span> <span class="nv">modelid</span><span class="o">=</span><span class="s2">"49"</span><span class="o">></span>
<pci <span class="nv">busid</span><span class="o">=</span><span class="s2">"ffff:ff:01.0"</span> <span class="nv">class</span><span class="o">=</span><span class="s2">"0x060400"</span> <span class="nv">link_speed</span><span class="o">=</span><span class="s2">"16 GT/s"</span> <span class="nv">link_width</span><span class="o">=</span><span class="s2">"16"</span><span class="o">></span>
<pci <span class="nv">busid</span><span class="o">=</span><span class="s2">"0001:00:00.0"</span> <span class="nv">class</span><span class="o">=</span><span class="s2">"0x030200"</span> <span class="nv">link_speed</span><span class="o">=</span><span class="s2">"16 GT/s"</span> <span class="nv">link_width</span><span class="o">=</span><span class="s2">"16"</span>/>
<pci <span class="nv">busid</span><span class="o">=</span><span class="s2">"0101:00:00.0"</span> <span class="nv">class</span><span class="o">=</span><span class="s2">"0x020700"</span> <span class="nv">link_speed</span><span class="o">=</span><span class="s2">"16 GT/s"</span> <span class="nv">link_width</span><span class="o">=</span><span class="s2">"16"</span>/>
<pci <span class="nv">busid</span><span class="o">=</span><span class="s2">"0002:00:00.0"</span> <span class="nv">class</span><span class="o">=</span><span class="s2">"0x030200"</span> <span class="nv">link_speed</span><span class="o">=</span><span class="s2">"16 GT/s"</span> <span class="nv">link_width</span><span class="o">=</span><span class="s2">"16"</span>/>
<pci <span class="nv">busid</span><span class="o">=</span><span class="s2">"0102:00:00.0"</span> <span class="nv">class</span><span class="o">=</span><span class="s2">"0x020700"</span> <span class="nv">link_speed</span><span class="o">=</span><span class="s2">"16 GT/s"</span> <span class="nv">link_width</span><span class="o">=</span><span class="s2">"16"</span>/>
</pci>
</cpu>
<cpu <span class="nv">numaid</span><span class="o">=</span><span class="s2">"1"</span> <span class="nv">affinity</span><span class="o">=</span><span class="s2">"0000ffff,0000ffff"</span> <span class="nb">arch</span><span class="o">=</span><span class="s2">"x86_64"</span> <span class="nv">vendor</span><span class="o">=</span><span class="s2">"AuthenticAMD"</span> <span class="nv">familyid</span><span class="o">=</span><span class="s2">"23"</span> <span class="nv">modelid</span><span class="o">=</span><span class="s2">"49"</span><span class="o">></span>
<pci <span class="nv">busid</span><span class="o">=</span><span class="s2">"ffff:ff:02.0"</span> <span class="nv">class</span><span class="o">=</span><span class="s2">"0x060400"</span> <span class="nv">link_speed</span><span class="o">=</span><span class="s2">"16 GT/s"</span> <span class="nv">link_width</span><span class="o">=</span><span class="s2">"16"</span><span class="o">></span>
<pci <span class="nv">busid</span><span class="o">=</span><span class="s2">"0003:00:00.0"</span> <span class="nv">class</span><span class="o">=</span><span class="s2">"0x030200"</span> <span class="nv">link_speed</span><span class="o">=</span><span class="s2">"16 GT/s"</span> <span class="nv">link_width</span><span class="o">=</span><span class="s2">"16"</span>/>
<pci <span class="nv">busid</span><span class="o">=</span><span class="s2">"0103:00:00.0"</span> <span class="nv">class</span><span class="o">=</span><span class="s2">"0x020700"</span> <span class="nv">link_speed</span><span class="o">=</span><span class="s2">"16 GT/s"</span> <span class="nv">link_width</span><span class="o">=</span><span class="s2">"16"</span>/>
<pci <span class="nv">busid</span><span class="o">=</span><span class="s2">"0004:00:00.0"</span> <span class="nv">class</span><span class="o">=</span><span class="s2">"0x030200"</span> <span class="nv">link_speed</span><span class="o">=</span><span class="s2">"16 GT/s"</span> <span class="nv">link_width</span><span class="o">=</span><span class="s2">"16"</span>/>
<pci <span class="nv">busid</span><span class="o">=</span><span class="s2">"0104:00:00.0"</span> <span class="nv">class</span><span class="o">=</span><span class="s2">"0x020700"</span> <span class="nv">link_speed</span><span class="o">=</span><span class="s2">"16 GT/s"</span> <span class="nv">link_width</span><span class="o">=</span><span class="s2">"16"</span>/>
</pci>
</cpu>
<cpu <span class="nv">numaid</span><span class="o">=</span><span class="s2">"2"</span> <span class="nv">affinity</span><span class="o">=</span><span class="s2">"0000ffff,0000ffff"</span> <span class="nb">arch</span><span class="o">=</span><span class="s2">"x86_64"</span> <span class="nv">vendor</span><span class="o">=</span><span class="s2">"AuthenticAMD"</span> <span class="nv">familyid</span><span class="o">=</span><span class="s2">"23"</span> <span class="nv">modelid</span><span class="o">=</span><span class="s2">"49"</span><span class="o">></span>
<pci <span class="nv">busid</span><span class="o">=</span><span class="s2">"ffff:ff:03.0"</span> <span class="nv">class</span><span class="o">=</span><span class="s2">"0x060400"</span> <span class="nv">link_speed</span><span class="o">=</span><span class="s2">"16 GT/s"</span> <span class="nv">link_width</span><span class="o">=</span><span class="s2">"16"</span><span class="o">></span>
<pci <span class="nv">busid</span><span class="o">=</span><span class="s2">"000b:00:00.0"</span> <span class="nv">class</span><span class="o">=</span><span class="s2">"0x030200"</span> <span class="nv">link_speed</span><span class="o">=</span><span class="s2">"16 GT/s"</span> <span class="nv">link_width</span><span class="o">=</span><span class="s2">"16"</span>/>
<pci <span class="nv">busid</span><span class="o">=</span><span class="s2">"0105:00:00.0"</span> <span class="nv">class</span><span class="o">=</span><span class="s2">"0x020700"</span> <span class="nv">link_speed</span><span class="o">=</span><span class="s2">"16 GT/s"</span> <span class="nv">link_width</span><span class="o">=</span><span class="s2">"16"</span>/>
<pci <span class="nv">busid</span><span class="o">=</span><span class="s2">"000c:00:00.0"</span> <span class="nv">class</span><span class="o">=</span><span class="s2">"0x030200"</span> <span class="nv">link_speed</span><span class="o">=</span><span class="s2">"16 GT/s"</span> <span class="nv">link_width</span><span class="o">=</span><span class="s2">"16"</span>/>
<pci <span class="nv">busid</span><span class="o">=</span><span class="s2">"0106:00:00.0"</span> <span class="nv">class</span><span class="o">=</span><span class="s2">"0x020700"</span> <span class="nv">link_speed</span><span class="o">=</span><span class="s2">"16 GT/s"</span> <span class="nv">link_width</span><span class="o">=</span><span class="s2">"16"</span>/>
</pci>
</cpu>
<cpu <span class="nv">numaid</span><span class="o">=</span><span class="s2">"3"</span> <span class="nv">affinity</span><span class="o">=</span><span class="s2">"0000ffff,0000ffff"</span> <span class="nb">arch</span><span class="o">=</span><span class="s2">"x86_64"</span> <span class="nv">vendor</span><span class="o">=</span><span class="s2">"AuthenticAMD"</span> <span class="nv">familyid</span><span class="o">=</span><span class="s2">"23"</span> <span class="nv">modelid</span><span class="o">=</span><span class="s2">"49"</span><span class="o">></span>
<pci <span class="nv">busid</span><span class="o">=</span><span class="s2">"ffff:ff:04.0"</span> <span class="nv">class</span><span class="o">=</span><span class="s2">"0x060400"</span> <span class="nv">link_speed</span><span class="o">=</span><span class="s2">"16 GT/s"</span> <span class="nv">link_width</span><span class="o">=</span><span class="s2">"16"</span><span class="o">></span>
<pci <span class="nv">busid</span><span class="o">=</span><span class="s2">"000d:00:00.0"</span> <span class="nv">class</span><span class="o">=</span><span class="s2">"0x030200"</span> <span class="nv">link_speed</span><span class="o">=</span><span class="s2">"16 GT/s"</span> <span class="nv">link_width</span><span class="o">=</span><span class="s2">"16"</span>/>
<pci <span class="nv">busid</span><span class="o">=</span><span class="s2">"0107:00:00.0"</span> <span class="nv">class</span><span class="o">=</span><span class="s2">"0x020700"</span> <span class="nv">link_speed</span><span class="o">=</span><span class="s2">"16 GT/s"</span> <span class="nv">link_width</span><span class="o">=</span><span class="s2">"16"</span>/>
<pci <span class="nv">busid</span><span class="o">=</span><span class="s2">"000e:00:00.0"</span> <span class="nv">class</span><span class="o">=</span><span class="s2">"0x030200"</span> <span class="nv">link_speed</span><span class="o">=</span><span class="s2">"16 GT/s"</span> <span class="nv">link_width</span><span class="o">=</span><span class="s2">"16"</span>/>
<pci <span class="nv">busid</span><span class="o">=</span><span class="s2">"0108:00:00.0"</span> <span class="nv">class</span><span class="o">=</span><span class="s2">"0x020700"</span> <span class="nv">link_speed</span><span class="o">=</span><span class="s2">"16 GT/s"</span> <span class="nv">link_width</span><span class="o">=</span><span class="s2">"16"</span>/>
</pci>
</cpu>
</system>
</code></pre></div></div>
<p>Create third file <code class="language-plaintext highlighter-rouge">Dockerfile</code></p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ARG <span class="nv">FROM_IMAGE_NAME</span><span class="o">=</span>nvcr.io/nvidia/pytorch:23.03-py3
FROM <span class="k">${</span><span class="nv">FROM_IMAGE_NAME</span><span class="k">}</span>
RUN apt update
RUN apt-get <span class="nt">-y</span> <span class="nb">install </span>build-essential
RUN apt-get <span class="nt">-y</span> <span class="nb">install </span>infiniband-diags
RUN apt-get <span class="nt">-y</span> <span class="nb">install </span>openssh-server
RUN apt-get <span class="nt">-y</span> <span class="nb">install </span>kmod
COPY nccl-tests.sh <span class="nb">.</span>
RUN ./nccl-tests.sh
COPY ndv4-topo.xml <span class="nb">.</span>
</code></pre></div></div>
<p>Put above three files in the same directory, then build and push to ACR.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker build <span class="nt">-t</span> jzacr2.azurecr.io/pytorch_nccl_tests_2303 <span class="nb">.</span>
docker push jzacr2.azurecr.io/pytorch_nccl_tests_2303:latest
</code></pre></div></div>
<h2 id="create-aks-cluster-now-back-to-your-local-laptop">Create AKS cluster (now back to your local laptop)</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>az aks create <span class="se">\</span>
<span class="nt">-g</span> <span class="nv">$AKS_RG</span> <span class="se">\</span>
<span class="nt">--node-resource-group</span> <span class="nv">$NODE_RG</span> <span class="se">\</span>
<span class="nt">-n</span> <span class="nv">$AKS_NAME</span> <span class="se">\</span>
<span class="nt">--enable-managed-identity</span> <span class="se">\</span>
<span class="nt">--node-count</span> 2 <span class="se">\</span>
<span class="nt">--generate-ssh-keys</span> <span class="se">\</span>
<span class="nt">-l</span> <span class="nv">$LOCATION</span> <span class="se">\</span>
<span class="nt">--node-vm-size</span> Standard_D2s_v3 <span class="se">\</span>
<span class="nt">--nodepool-name</span> <span class="nv">$AGENT_POOL_NAME</span> <span class="se">\</span>
<span class="nt">--os-sku</span> Ubuntu <span class="se">\</span>
<span class="nt">--attach-acr</span> <span class="nv">$ACR_NAME</span>
</code></pre></div></div>
<h2 id="add-a-node-pool">Add a node pool</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>az aks nodepool add <span class="nt">--resource-group</span> <span class="nv">$AKS_RG</span> <span class="nt">--cluster-name</span> <span class="nv">$AKS_NAME</span> <span class="nt">--name</span> <span class="nv">$NDMv4_POOL_NAME</span> <span class="nt">--node-count</span> 1 <span class="nt">--node-vm-size</span> Standard_ND96amsr_A100_v4 <span class="nt">--node-osdisk-size</span> 128 <span class="nt">--os-sku</span> Ubuntu <span class="nt">--tags</span> <span class="nv">SkipGPUDriverInstallation</span><span class="o">=</span><span class="nb">true
</span>or
az aks nodepool add <span class="nt">--resource-group</span> <span class="nv">$AKS_RG</span> <span class="nt">--cluster-name</span> <span class="nv">$AKS_NAME</span> <span class="nt">--name</span> <span class="nv">$NDMv4_POOL_NAME</span> <span class="nt">--node-count</span> 1 <span class="nt">--node-vm-size</span> Standard_ND96amsr_A100_v4 <span class="nt">--node-osdisk-size</span> 128 <span class="nt">--os-sku</span> Ubuntu <span class="nt">--tags</span> <span class="nv">SkipGPUDriverInstall</span><span class="o">=</span><span class="nb">true</span>
</code></pre></div></div>
<p>Note: Need to verify which tag is right. The blog has the second one. I tested the first one which worked.</p>
<h2 id="save-the-credentials-to-your-local-config-file">Save the credentials to your local config file</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>az aks get-credentials <span class="nt">--overwrite-existing</span> <span class="nt">--resource-group</span> <span class="nv">$AKS_RG</span> <span class="nt">--name</span> <span class="nv">$AKS_NAME</span>
Merged <span class="s2">"JZ-akscluster"</span> as current context <span class="k">in</span> /home/jingchao/.kube/config
</code></pre></div></div>
<h2 id="check-the-created-nodes">Check the created nodes</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
aks-jzndmv4-29195301-vmss000000 Ready agent 135m v1.26.6
aks-jzpool-33093035-vmss000000 Ready agent 153m v1.26.6
aks-jzpool-33093035-vmss000001 Ready agent 153m v1.26.6
</code></pre></div></div>
<h2 id="install-gpu-and-network-drivers">Install GPU and network drivers</h2>
<p>Save the following script to a script <code class="language-plaintext highlighter-rouge">driver.sh</code>, and execute it with <code class="language-plaintext highlighter-rouge">bash driver.sh</code></p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#! /bin/bash</span>
<span class="c"># Apply required manifests</span>
kubectl get namespace nvidia-operator 2>/dev/null <span class="o">||</span> kubectl create namespace nvidia-operator
<span class="c"># Install node feature discovery</span>
helm upgrade <span class="nt">-i</span> <span class="nt">--wait</span> <span class="se">\</span>
<span class="nt">-n</span> nvidia-operator node-feature-discovery node-feature-discovery <span class="se">\</span>
<span class="nt">--repo</span> https://kubernetes-sigs.github.io/node-feature-discovery/charts <span class="se">\</span>
<span class="nt">--set-json</span> master.nodeSelector<span class="o">=</span><span class="s1">'{"kubernetes.azure.com/mode": "system"}'</span> <span class="se">\</span>
<span class="nt">--set-json</span> worker.nodeSelector<span class="o">=</span><span class="s1">'{"kubernetes.azure.com/accelerator": "nvidia"}'</span> <span class="se">\</span>
<span class="nt">--set-json</span> worker.config.sources.pci.deviceClassWhitelist<span class="o">=</span><span class="s1">'["02","03","0200","0207"]'</span> <span class="se">\</span>
<span class="nt">--set-json</span> worker.config.sources.pci.deviceLabelFields<span class="o">=</span><span class="s1">'["vendor"]'</span>
<span class="c"># Install the network-operator</span>
helm upgrade <span class="nt">-i</span> <span class="nt">--wait</span> <span class="se">\</span>
<span class="nt">-n</span> nvidia-operator network-operator network-operator <span class="se">\</span>
<span class="nt">--repo</span> https://helm.ngc.nvidia.com/nvidia <span class="se">\</span>
<span class="nt">--set</span> <span class="nv">deployCR</span><span class="o">=</span><span class="nb">true</span> <span class="se">\</span>
<span class="nt">--set</span> nfd.enabled<span class="o">=</span><span class="nb">false</span> <span class="se">\</span>
<span class="nt">--set</span> ofedDriver.deploy<span class="o">=</span><span class="nb">true</span> <span class="se">\</span>
<span class="nt">--set</span> rdmaSharedDevicePlugin.deploy<span class="o">=</span><span class="nb">false</span> <span class="se">\</span>
<span class="nt">--set</span> secondaryNetwork.deploy<span class="o">=</span><span class="nb">true</span> <span class="se">\</span>
<span class="nt">--set</span> secondaryNetwork.ipamPlugin.deploy<span class="o">=</span><span class="nb">true</span> <span class="se">\</span>
<span class="nt">--set</span> secondaryNetwork.ipoib.deploy<span class="o">=</span><span class="nb">true</span> <span class="se">\</span>
<span class="nt">--set</span> secondaryNetwork.multus.deploy<span class="o">=</span><span class="nb">true</span> <span class="se">\</span>
<span class="nt">--set</span> sriovDevicePlugin.deploy<span class="o">=</span><span class="nb">true</span> <span class="se">\</span>
<span class="nt">--set-json</span> sriovDevicePlugin.resources<span class="o">=</span><span class="s1">'[{"name":"mlnxnics","linkTypes": ["infiniband"], "vendors":["15b3"]}]'</span>
<span class="c"># Note: use --set ofedDriver.version="<MOFED VERSION>"</span>
<span class="c"># to install a specific MOFED version</span>
<span class="c">#</span>
<span class="c"># Install the gpu-operator</span>
helm upgrade <span class="nt">-i</span> <span class="nt">--wait</span> <span class="se">\</span>
<span class="nt">-n</span> nvidia-operator gpu-operator gpu-operator <span class="se">\</span>
<span class="nt">--repo</span> https://helm.ngc.nvidia.com/nvidia <span class="se">\</span>
<span class="nt">--set</span> nfd.enabled<span class="o">=</span><span class="nb">false</span> <span class="se">\</span>
<span class="nt">--set</span> driver.enabled<span class="o">=</span><span class="nb">true</span> <span class="se">\</span>
<span class="nt">--set</span> driver.version<span class="o">=</span><span class="s2">"525.60.13"</span> <span class="se">\</span>
<span class="nt">--set</span> driver.rdma.enabled<span class="o">=</span><span class="nb">true</span> <span class="se">\</span>
<span class="nt">--set</span> toolkit.enabled<span class="o">=</span><span class="nb">true</span>
<span class="c"># Apply the hostdev-net configuration for Infiniband</span>
<span class="nb">cat</span> <span class="o"><<</span><span class="no">EOF</span><span class="sh"> | kubectl apply -f -
apiVersion: mellanox.com/v1alpha1
kind: HostDeviceNetwork
metadata:
name: hostdev-net
spec:
networkNamespace: "default"
resourceName: "mlnxnics"
ipam: |
{
"type": "whereabouts",
"datastore": "kubernetes",
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
},
"range": "100.127.0.0/16",
"exclude": [],
"log_file" : "/var/log/whereabouts.log",
"log_level" : "info"
}
</span><span class="no">EOF
</span></code></pre></div></div>
<h2 id="verify-the-drivers-are-installed">Verify the drivers are installed</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>kubectl describe node <span class="nv">$NDmv4_AKS_node</span> | <span class="nb">grep</span> <span class="nt">-e</span> <span class="s2">"nvidia.com/mlnxnics"</span> <span class="nt">-e</span> <span class="s2">"nvidia.com/gpu"</span>
nvidia.com/gpu-driver-upgrade-state<span class="o">=</span>upgrade-done
nvidia.com/gpu.compute.major<span class="o">=</span>8
nvidia.com/gpu.compute.minor<span class="o">=</span>0
nvidia.com/gpu.count<span class="o">=</span>8
nvidia.com/gpu.deploy.container-toolkit<span class="o">=</span><span class="nb">true
</span>nvidia.com/gpu.deploy.dcgm<span class="o">=</span><span class="nb">true
</span>nvidia.com/gpu.deploy.dcgm-exporter<span class="o">=</span><span class="nb">true
</span>nvidia.com/gpu.deploy.device-plugin<span class="o">=</span><span class="nb">true
</span>nvidia.com/gpu.deploy.driver<span class="o">=</span><span class="nb">true
</span>nvidia.com/gpu.deploy.gpu-feature-discovery<span class="o">=</span><span class="nb">true
</span>nvidia.com/gpu.deploy.mig-manager<span class="o">=</span><span class="nb">true
</span>nvidia.com/gpu.deploy.node-status-exporter<span class="o">=</span><span class="nb">true
</span>nvidia.com/gpu.deploy.nvsm<span class="o">=</span>
nvidia.com/gpu.deploy.operator-validator<span class="o">=</span><span class="nb">true
</span>nvidia.com/gpu.family<span class="o">=</span>ampere
nvidia.com/gpu.machine<span class="o">=</span>Virtual-Machine
nvidia.com/gpu.memory<span class="o">=</span>81920
nvidia.com/gpu.present<span class="o">=</span><span class="nb">true
</span>nvidia.com/gpu.product<span class="o">=</span>NVIDIA-A100-SXM4-80GB
nvidia.com/gpu.replicas<span class="o">=</span>1
nvidia.com/gpu-driver-upgrade-enabled: <span class="nb">true
</span>nvidia.com/gpu: 8
nvidia.com/mlnxnics: 8
nvidia.com/gpu: 8
nvidia.com/mlnxnics: 8
nvidia.com/gpu 0 0
nvidia.com/mlnxnics 0 0
</code></pre></div></div>
<h2 id="install-volcano-kubernetes-scheduler">Install Volcano Kubernetes scheduler</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>kubectl apply <span class="nt">-f</span> https://raw.githubusercontent.com/volcano-sh/volcano/release-1.7/installer/volcano-development.yaml
<span class="nv">$ </span>kubectl get all <span class="nt">-n</span> volcano-system
NAME READY STATUS RESTARTS AGE
pod/volcano-admission-7b864f5d49-x8bv9 1/1 Running 0 129m
pod/volcano-admission-init-pb7nr 0/1 Completed 0 129m
pod/volcano-controllers-5d784c876-hxmdz 1/1 Running 0 129m
pod/volcano-scheduler-65fb9b4dd-5pmhm 1/1 Running 0 129m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT<span class="o">(</span>S<span class="o">)</span> AGE
service/volcano-admission-service ClusterIP 10.0.104.73 <none> 443/TCP 129m
service/volcano-scheduler-service ClusterIP 10.0.8.41 <none> 8080/TCP 129m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/volcano-admission 1/1 1 1 129m
deployment.apps/volcano-controllers 1/1 1 1 129m
deployment.apps/volcano-scheduler 1/1 1 1 129m
NAME DESIRED CURRENT READY AGE
replicaset.apps/volcano-admission-7b864f5d49 1 1 1 129m
replicaset.apps/volcano-controllers-5d784c876 1 1 1 129m
replicaset.apps/volcano-scheduler-65fb9b4dd 1 1 1 129m
NAME COMPLETIONS DURATION AGE
job.batch/volcano-admission-init 1/1 8s 129m
</code></pre></div></div>
<h2 id="scale-gpu-nodes-to-2">Scale GPU nodes to 2</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>az aks nodepool scale <span class="nt">--resource-group</span> <span class="nv">$AKS_RG</span> <span class="nt">--cluster-name</span> <span class="nv">$AKS_NAME</span> <span class="nt">--name</span> <span class="nv">$NDMv4_POOL_NAME</span> <span class="nt">--node-count</span> 2
</code></pre></div></div>
<h2 id="create-a-kubernetes-service-account-to-view-the-output">Create a kubernetes service account to view the output</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl create serviceaccount <span class="nt">-n</span> default mpi-worker-view
kubectl create rolebinding default-view <span class="nt">--namespace</span> default <span class="nt">--serviceaccount</span> default:mpi-worker-view <span class="nt">--clusterrole</span> view
</code></pre></div></div>
<h2 id="create-the-nccl-job">Create the NCCL job</h2>
<p>Create the NCCL job file <code class="language-plaintext highlighter-rouge">job.yaml</code> with content below:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: nccl-allreduce-job1
spec:
minAvailable: 3
schedulerName: volcano
plugins:
ssh: <span class="o">[]</span>
svc: <span class="o">[]</span>
tasks:
- replicas: 1
name: mpimaster
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
initContainers:
- <span class="nb">command</span>:
- /bin/bash
- <span class="nt">-c</span>
- |
<span class="k">until</span> <span class="o">[[</span> <span class="s2">"</span><span class="si">$(</span>kubectl get pod <span class="nt">-l</span> volcano.sh/job-name<span class="o">=</span>nccl-allreduce-job1,volcano.sh/task-spec<span class="o">=</span>mpiworker <span class="nt">-o</span> json | jq <span class="s1">'.items | length'</span><span class="si">)</span><span class="s2">"</span> <span class="o">!=</span> 0 <span class="o">]]</span><span class="p">;</span> <span class="k">do
</span><span class="nb">echo</span> <span class="s2">"Waiting for MPI worker pods..."</span>
<span class="nb">sleep </span>3
<span class="k">done
</span><span class="nb">echo</span> <span class="s2">"Waiting for MPI worker pods to be ready..."</span>
kubectl <span class="nb">wait </span>pod <span class="nt">-l</span> volcano.sh/job-name<span class="o">=</span>nccl-allreduce-job1,volcano.sh/task-spec<span class="o">=</span>mpiworker <span class="nt">--for</span><span class="o">=</span><span class="nv">condition</span><span class="o">=</span>Ready <span class="nt">--timeout</span><span class="o">=</span>600s
image: mcr.microsoft.com/oss/kubernetes/kubectl:v1.26.3
name: wait-for-workers
serviceAccount: mpi-worker-view
containers:
- <span class="nb">command</span>:
- /bin/bash
- <span class="nt">-c</span>
- |
<span class="nv">MPI_HOST</span><span class="o">=</span><span class="si">$(</span><span class="nb">cat</span> /etc/volcano/mpiworker.host | <span class="nb">tr</span> <span class="s2">"</span><span class="se">\n</span><span class="s2">"</span> <span class="s2">","</span><span class="si">)</span>
<span class="nb">mkdir</span> <span class="nt">-p</span> /var/run/sshd<span class="p">;</span> /usr/sbin/sshd
<span class="nb">echo</span> <span class="s2">"HOSTS: </span><span class="nv">$MPI_HOST</span><span class="s2">"</span>
mpirun <span class="nt">--allow-run-as-root</span> <span class="se">\</span>
<span class="nt">-np</span> 16 <span class="nt">-npernode</span> 8 <span class="se">\</span>
<span class="nt">--bind-to</span> numa <span class="nt">--map-by</span> ppr:8:node <span class="se">\</span>
<span class="nt">-hostfile</span> /etc/volcano/mpiworker.host <span class="se">\</span>
<span class="nt">-x</span> <span class="nv">NCCL_DEBUG</span><span class="o">=</span>info <span class="se">\</span>
<span class="nt">-x</span> <span class="nv">UCX_TLS</span><span class="o">=</span>tcp <span class="se">\</span>
<span class="nt">-x</span> <span class="nv">NCCL_TOPO_FILE</span><span class="o">=</span>/workspace/ndv4-topo.xml <span class="se">\</span>
<span class="nt">-x</span> <span class="nv">UCX_NET_DEVICES</span><span class="o">=</span>eth0 <span class="se">\</span>
<span class="nt">-x</span> <span class="nv">CUDA_DEVICE_ORDER</span><span class="o">=</span>PCI_BUS_ID <span class="se">\</span>
<span class="nt">-x</span> <span class="nv">NCCL_SOCKET_IFNAME</span><span class="o">=</span>eth0 <span class="se">\</span>
<span class="nt">-mca</span> coll_hcoll_enable 0 <span class="se">\</span>
/workspace/nccl-tests/build/all_reduce_perf <span class="nt">-b</span> 8 <span class="nt">-f</span> 2 <span class="nt">-g</span> 1 <span class="nt">-e</span> 8G <span class="nt">-c</span> 1 <span class="se">\</span>
| <span class="nb">tee</span> /home/re
image: jzacr2.azurecr.io/pytorch_nccl_tests_2303:latest
securityContext:
capabilities:
add: <span class="o">[</span><span class="s2">"IPC_LOCK"</span><span class="o">]</span>
name: mpimaster
ports:
- containerPort: 22
name: mpijob-port
workingDir: /workspace
resources:
requests:
cpu: 1
restartPolicy: OnFailure
- replicas: 2
name: mpiworker
template:
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: hostdev-net,hostdev-net,hostdev-net,hostdev-net,hostdev-net,hostdev-net,hostdev-net,hostdev-net
spec:
containers:
- <span class="nb">command</span>:
- /bin/bash
- <span class="nt">-c</span>
- |
<span class="nb">mkdir</span> <span class="nt">-p</span> /var/run/sshd<span class="p">;</span> /usr/sbin/sshd <span class="nt">-D</span><span class="p">;</span>
image: jzacr2.azurecr.io/pytorch_nccl_tests_2303:latest
securityContext:
capabilities:
add: <span class="o">[</span><span class="s2">"IPC_LOCK"</span><span class="o">]</span>
name: mpiworker
ports:
- containerPort: 22
name: mpijob-port
workingDir: /workspace
resources:
requests:
cpu: 1
restartPolicy: OnFailure
- replicas: 2
name: mpiworker
template:
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: hostdev-net,hostdev-net,hostdev-net,hostdev-net,hostdev-net,hostdev-net,hostdev-net,hostdev-net
spec:
containers:
- <span class="nb">command</span>:
- /bin/bash
- <span class="nt">-c</span>
- |
<span class="nb">mkdir</span> <span class="nt">-p</span> /var/run/sshd<span class="p">;</span> /usr/sbin/sshd <span class="nt">-D</span><span class="p">;</span>
image: jzacr2.azurecr.io/pytorch_nccl_tests_2303:latest
securityContext:
capabilities:
add: <span class="o">[</span><span class="s2">"IPC_LOCK"</span><span class="o">]</span>
name: mpiworker
ports:
- containerPort: 22
name: mpijob-port
workingDir: /workspace
resources:
requests:
nvidia.com/gpu: 8
nvidia.com/mlnxnics: 8
limits:
nvidia.com/gpu: 8
nvidia.com/mlnxnics: 8
volumeMounts:
- mountPath: /dev/shm
name: shm
restartPolicy: OnFailure
terminationGracePeriodSeconds: 0
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 8Gi
<span class="nt">---</span>
</code></pre></div></div>
<p>Note: there are two occurances of <code class="language-plaintext highlighter-rouge">jzacr2.azurecr.io/pytorch_nccl_tests_2303:latest</code> in the above script, which is the NCCL container you pushed to your ACR. Edit it before proceeding.</p>
<h2 id="submit-the-nccl-job">Submit the NCCL job</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>kubectl apply <span class="nt">-f</span> job.yaml
job.batch.volcano.sh/nccl-allreduce-job1 created
</code></pre></div></div>
<h2 id="get-the-pod-name">Get the pod name</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>kubectl get pods
NAME READY STATUS RESTARTS AGE
nccl-allreduce-job1-mpimaster-0 1/1 Running 0 16s
nccl-allreduce-job1-mpiworker-0 1/1 Running 0 16s
nccl-allreduce-job1-mpiworker-1 1/1 Running 0 16s
</code></pre></div></div>
<h2 id="check-the-nccl-test-output">Check the NCCL test output</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>kubectl logs <span class="nt">-f</span> nccl-allreduce-job1-mpimaster-0
<span class="c"># out-of-place in-place</span>
<span class="c"># size count type redop root time algbw busbw #wrong time algbw busbw #wrong</span>
<span class="c"># (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)</span>
nccl-allreduce-job1-mpiworker-1:57:214 <span class="o">[</span>7] NCCL INFO <span class="nb">comm </span>0x55c13a6640d0 rank 15 nranks 16 cudaDev 7 busId e00000 commId 0x46cbea2567b59372 - Init COMPLETE
nccl-allreduce-job1-mpiworker-1:51:174 <span class="o">[</span>3] NCCL INFO <span class="nb">comm </span>0x5586358b6c10 rank 11 nranks 16 cudaDev 3 busId 400000 commId 0x46cbea2567b59372 - Init COMPLETE
nccl-allreduce-job1-mpiworker-1:49:169 <span class="o">[</span>1] NCCL INFO <span class="nb">comm </span>0x5590048f9910 rank 9 nranks 16 cudaDev 1 busId 200000 commId 0x46cbea2567b59372 - Init COMPLETE
nccl-allreduce-job1-mpiworker-1:53:231 <span class="o">[</span>5] NCCL INFO <span class="nb">comm </span>0x564e5f765c40 rank 13 nranks 16 cudaDev 5 busId c00000 commId 0x46cbea2567b59372 - Init COMPLETE
nccl-allreduce-job1-mpiworker-1:54:194 <span class="o">[</span>6] NCCL INFO <span class="nb">comm </span>0x564950a5b020 rank 14 nranks 16 cudaDev 6 busId d00000 commId 0x46cbea2567b59372 - Init COMPLETE
nccl-allreduce-job1-mpiworker-1:50:212 <span class="o">[</span>2] NCCL INFO <span class="nb">comm </span>0x555b01ca9170 rank 10 nranks 16 cudaDev 2 busId 300000 commId 0x46cbea2567b59372 - Init COMPLETE
nccl-allreduce-job1-mpiworker-1:48:168 <span class="o">[</span>0] NCCL INFO <span class="nb">comm </span>0x55a22905c240 rank 8 nranks 16 cudaDev 0 busId 100000 commId 0x46cbea2567b59372 - Init COMPLETE
nccl-allreduce-job1-mpiworker-1:52:197 <span class="o">[</span>4] NCCL INFO <span class="nb">comm </span>0x55567f894360 rank 12 nranks 16 cudaDev 4 busId b00000 commId 0x46cbea2567b59372 - Init COMPLETE
8 2 float <span class="nb">sum</span> <span class="nt">-1</span> 37.30 0.00 0.00 0 34.44 0.00 0.00 0
16 4 float <span class="nb">sum</span> <span class="nt">-1</span> 36.03 0.00 0.00 0 33.94 0.00 0.00 0
32 8 float <span class="nb">sum</span> <span class="nt">-1</span> 36.50 0.00 0.00 0 33.57 0.00 0.00 0
64 16 float <span class="nb">sum</span> <span class="nt">-1</span> 36.33 0.00 0.00 0 33.99 0.00 0.00 0
128 32 float <span class="nb">sum</span> <span class="nt">-1</span> 37.62 0.00 0.01 0 34.42 0.00 0.01 0
256 64 float <span class="nb">sum</span> <span class="nt">-1</span> 38.28 0.01 0.01 0 34.77 0.01 0.01 0
512 128 float <span class="nb">sum</span> <span class="nt">-1</span> 38.20 0.01 0.03 0 35.15 0.01 0.03 0
1024 256 float <span class="nb">sum</span> <span class="nt">-1</span> 40.92 0.03 0.05 0 37.37 0.03 0.05 0
2048 512 float <span class="nb">sum</span> <span class="nt">-1</span> 42.87 0.05 0.09 0 39.49 0.05 0.10 0
4096 1024 float <span class="nb">sum</span> <span class="nt">-1</span> 41.82 0.10 0.18 0 40.85 0.10 0.19 0
8192 2048 float <span class="nb">sum</span> <span class="nt">-1</span> 46.31 0.18 0.33 0 42.78 0.19 0.36 0
16384 4096 float <span class="nb">sum</span> <span class="nt">-1</span> 58.10 0.28 0.53 0 55.03 0.30 0.56 0
32768 8192 float <span class="nb">sum</span> <span class="nt">-1</span> 58.73 0.56 1.05 0 56.11 0.58 1.09 0
65536 16384 float <span class="nb">sum</span> <span class="nt">-1</span> 60.01 1.09 2.05 0 59.40 1.10 2.07 0
131072 32768 float <span class="nb">sum</span> <span class="nt">-1</span> 63.71 2.06 3.86 0 63.33 2.07 3.88 0
262144 65536 float <span class="nb">sum</span> <span class="nt">-1</span> 68.25 3.84 7.20 0 68.67 3.82 7.16 0
524288 131072 float <span class="nb">sum</span> <span class="nt">-1</span> 80.23 6.54 12.25 0 79.70 6.58 12.33 0
1048576 262144 float <span class="nb">sum</span> <span class="nt">-1</span> 96.39 10.88 20.40 0 96.73 10.84 20.33 0
2097152 524288 float <span class="nb">sum</span> <span class="nt">-1</span> 128.6 16.31 30.59 0 127.8 16.41 30.77 0
4194304 1048576 float <span class="nb">sum</span> <span class="nt">-1</span> 148.1 28.32 53.11 0 146.5 28.62 53.67 0
8388608 2097152 float <span class="nb">sum</span> <span class="nt">-1</span> 211.1 39.74 74.51 0 207.8 40.37 75.70 0
16777216 4194304 float <span class="nb">sum</span> <span class="nt">-1</span> 333.4 50.32 94.35 0 330.8 50.72 95.10 0
33554432 8388608 float <span class="nb">sum</span> <span class="nt">-1</span> 615.6 54.51 102.21 0 626.3 53.58 100.45 0
67108864 16777216 float <span class="nb">sum</span> <span class="nt">-1</span> 932.6 71.96 134.92 0 929.6 72.19 135.36 0
134217728 33554432 float <span class="nb">sum</span> <span class="nt">-1</span> 1672.7 80.24 150.45 0 1676.3 80.07 150.13 0
268435456 67108864 float <span class="nb">sum</span> <span class="nt">-1</span> 3013.5 89.08 167.02 0 3004.6 89.34 167.52 0
536870912 134217728 float <span class="nb">sum</span> <span class="nt">-1</span> 5702.0 94.15 176.54 0 5705.8 94.09 176.42 0
1073741824 268435456 float <span class="nb">sum</span> <span class="nt">-1</span> 11063 97.05 181.98 0 11089 96.83 181.56 0
2147483648 536870912 float <span class="nb">sum</span> <span class="nt">-1</span> 21637 99.25 186.10 0 21673 99.09 185.79 0
4294967296 1073741824 float <span class="nb">sum</span> <span class="nt">-1</span> 42758 100.45 188.34 0 42779 100.40 188.25 0
8589934592 2147483648 float <span class="nb">sum</span> <span class="nt">-1</span> 85129 100.90 189.20 0 85091 100.95 189.28 0
nccl-allreduce-job1-mpiworker-1:51:51 <span class="o">[</span>3] NCCL INFO <span class="nb">comm </span>0x5586358b6c10 rank 11 nranks 16 cudaDev 3 busId 400000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-0:51:51 <span class="o">[</span>3] NCCL INFO <span class="nb">comm </span>0x563b9a846840 rank 3 nranks 16 cudaDev 3 busId 400000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-1:57:57 <span class="o">[</span>7] NCCL INFO <span class="nb">comm </span>0x55c13a6640d0 rank 15 nranks 16 cudaDev 7 busId e00000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-1:53:53 <span class="o">[</span>5] NCCL INFO <span class="nb">comm </span>0x564e5f765c40 rank 13 nranks 16 cudaDev 5 busId c00000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-0:50:50 <span class="o">[</span>2] NCCL INFO <span class="nb">comm </span>0x55ce61480260 rank 2 nranks 16 cudaDev 2 busId 300000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-0:52:52 <span class="o">[</span>4] NCCL INFO <span class="nb">comm </span>0x5632e283bb30 rank 4 nranks 16 cudaDev 4 busId b00000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-0:48:48 <span class="o">[</span>0] NCCL INFO <span class="nb">comm </span>0x55d407b24020 rank 0 nranks 16 cudaDev 0 busId 100000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-1:50:50 <span class="o">[</span>2] NCCL INFO <span class="nb">comm </span>0x555b01ca9170 rank 10 nranks 16 cudaDev 2 busId 300000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-0:55:55 <span class="o">[</span>6] NCCL INFO <span class="nb">comm </span>0x55dc04852d60 rank 6 nranks 16 cudaDev 6 busId d00000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-0:49:49 <span class="o">[</span>1] NCCL INFO <span class="nb">comm </span>0x555ead805480 rank 1 nranks 16 cudaDev 1 busId 200000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-1:48:48 <span class="o">[</span>0] NCCL INFO <span class="nb">comm </span>0x55a22905c240 rank 8 nranks 16 cudaDev 0 busId 100000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-0:56:56 <span class="o">[</span>7] NCCL INFO <span class="nb">comm </span>0x556f8d65b050 rank 7 nranks 16 cudaDev 7 busId e00000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-1:49:49 <span class="o">[</span>1] NCCL INFO <span class="nb">comm </span>0x5590048f9910 rank 9 nranks 16 cudaDev 1 busId 200000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-1:54:54 <span class="o">[</span>6] NCCL INFO <span class="nb">comm </span>0x564950a5b020 rank 14 nranks 16 cudaDev 6 busId d00000 - Destroy COMPLETE
nccl-allreduce-job1-mpiworker-0:53:53 <span class="o">[</span>5] NCCL INFO <span class="nb">comm </span>0x556afbcbdc10 rank 5 nranks 16 cudaDev 5 busId c00000 - Destroy COMPLETE
<span class="c"># Out of bounds values : 0 OK</span>
<span class="c"># Avg bus bandwidth : 57.347</span>
<span class="c">#</span>
nccl-allreduce-job1-mpiworker-1:52:52 <span class="o">[</span>4] NCCL INFO <span class="nb">comm </span>0x55567f894360 rank 12 nranks 16 cudaDev 4 busId b00000 - Destroy COMPLETE
</code></pre></div></div>
<p>If you see ~189 GBps output then you are done with this exercise.</p>Jingchao Zhangjingczhang@microsoft.comThis write-up aims to replicate the blog Deploy NDm_v4 (A100) Kubernetes Cluster by Cormac Garvey. The original blog assumes you have an exising ACR.Azhop backbone cost analyses2023-09-02T00:00:00+00:002023-09-02T00:00:00+00:00https://jingchaozhang.github.io/Azhop%20backbone%20cost%20analyses<p>In the rapidly evolving landscape of High-Performance Computing (HPC) and Artificial Intelligence (AI), the quest for optimizing operational cost without compromising performance has become paramount. Microsoft Azure’s HPC On-Demand Platform (AzHOP) serves as an innovative solution that addresses both scale and flexibility needs. However, one area that often warrants scrutiny is the daily cost associated with the backbone infrastructure of AzHOP, which includes critical components such as Management VMs, persistent storage volumes, and more.</p>
<p>We will compare the daily backbone costs associated with different AzHOP configurations. Specifically, we will look at setups with SLURM DB and Azure Active Directory (AAD) enabled. We will explore three different storage options to examine how each impacts the overall cost and performance:</p>
<ul>
<li>4TB Azure Files</li>
<li>4TB Premium Azure NetApp Files</li>
<li>Azure Managed Lustre File System (AMLFS)</li>
</ul>
<p>The objective is to arm decision-makers and technical experts with concrete insights that can guide them in selecting the most cost-effective yet performant backbone infrastructure for their Azure HPC deployments.</p>
<h2 id="azure-files-4tb">Azure Files (4TB)</h2>
<p>The experiment was conducted over the period from September 2nd to September 4th. Cost data for both the starting day, September 2nd, and the concluding day, September 4th, are partial and therefore lower than the figures from September 3rd. In contrast, the data for September 3rd represents a complete 24-hour cycle.</p>
<p><img src="https://raw.githubusercontent.com/JingchaoZhang/JingchaoZhang.github.io/master/_posts/2023-09-02-figures/AF-daily.png" alt="Figure_1" /></p>
<p>Let’s break down the cost for 09/03 in the table below:</p>
<table>
<thead>
<tr>
<th>Date</th>
<th>Service Name</th>
<th>Cost</th>
</tr>
</thead>
<tbody>
<tr>
<td>Sep 03</td>
<td>Virtual Machines</td>
<td>9.960719999999998</td>
</tr>
<tr>
<td>Sep 03</td>
<td>Storage</td>
<td>9.644774781549</td>
</tr>
<tr>
<td>Sep 03</td>
<td>Azure Database for MariaDB</td>
<td>2.989564258064516</td>
</tr>
<tr>
<td>Sep 03</td>
<td>Virtual Network</td>
<td>0.12240000000000001</td>
</tr>
<tr>
<td>Sep 03</td>
<td>Azure DNS</td>
<td>0.006294902258064519</td>
</tr>
<tr>
<td>Sep 03</td>
<td>Bandwidth</td>
<td>0.00022860545162111526</td>
</tr>
<tr>
<td>Sep 03</td>
<td>Advanced Threat Protection</td>
<td>0.0000011999999999999997</td>
</tr>
</tbody>
</table>
<p>The primary cost components for running AzHOP include <code class="language-plaintext highlighter-rouge">Virtual Machines</code> and <code class="language-plaintext highlighter-rouge">Azure Files</code>. Specifically, this experiment allocates 4TB for Azure Files. However, this size can be scaled down to 1TB, depending on your storage requirements. An additional cost is associated with <code class="language-plaintext highlighter-rouge">Azure Database for MariaDB</code>, which serves as the database backend for SLURM accounting. If SLURM accounting is not a critical feature for your specific use case, you may opt to disable it to further reduce costs. By minimizing the Azure Files storage to 1TB and foregoing MariaDB, the estimated minimal daily expenditure stands at approximately <strong>$12.5/day</strong>.</p>
<h2 id="azure-netapp-files-4tb-premium">Azure Netapp Files (4TB Premium)</h2>
<p>Analogous to the previous experiment, the cost data for both September 2nd and September 4th are partial and not representative of a full 24-hour cycle. In contrast, the data from September 3rd is complete and spans an entire 24-hour period.</p>
<p><img src="https://raw.githubusercontent.com/JingchaoZhang/JingchaoZhang.github.io/master/_posts/2023-09-02-figures/ANF-daily.png" alt="Figure_3" /></p>
<p>Here is a table breakdown for 09/03:</p>
<table>
<thead>
<tr>
<th>Date</th>
<th>Service Name</th>
<th>Cost</th>
</tr>
</thead>
<tbody>
<tr>
<td>Sep 03</td>
<td>Azure NetApp Files</td>
<td>13.469614080000001</td>
</tr>
<tr>
<td>Sep 03</td>
<td>Virtual Machines</td>
<td>9.953625947537999</td>
</tr>
<tr>
<td>Sep 03</td>
<td>Azure Database for MariaDB</td>
<td>2.989564258064516</td>
</tr>
<tr>
<td>Sep 03</td>
<td>Storage</td>
<td>0.20762154493899998</td>
</tr>
<tr>
<td>Sep 03</td>
<td>Virtual Network</td>
<td>0.12240000000000001</td>
</tr>
<tr>
<td>Sep 03</td>
<td>Azure DNS</td>
<td>0.006294712258064519</td>
</tr>
<tr>
<td>Sep 03</td>
<td>Bandwidth</td>
<td>0.00023048860579729093</td>
</tr>
<tr>
<td>Sep 03</td>
<td>Advanced Threat Protection</td>
<td>6e-7</td>
</tr>
</tbody>
</table>
<p>With Azure NetApp Files (ANF), the smallest allowable volume size is 4TB, translating to an estimated daily cost of approximately $13.5. If you opt to run your setup without MariaDB, the projected cost increases to <strong>$25/day</strong>. This is roughly double the expense when compared to utilizing Azure Files (1TB without MariaDB).</p>
<h2 id="amlfs">AMLFS</h2>
<p>NOTE: If you want to use integrated Azure Blob storage with AMLFS, you must specify it in the Blob integration section when you create the file system. You can’t add an HSM-integrated blob container to an existing file system. Integrating blob storage when you create a file system is optional, but it’s the only way to use Lustre Hierarchical Storage Management (HSM) features. If you don’t want the benefits of Lustre HSM, you can import and export data for the Azure Managed Lustre file system by using client commands directly.</p>
<h3 id="without-blob-integration">Without Blob integration</h3>
<p>Setup</p>
<h3 id="with-blob-integration">With Blob integration</h3>
<h2 id="details-on-amlfs">Details on AMLFS</h2>
<h3 id="determining-network-size">Determining network size</h3>
<p>The size of subnet that you need depends on the size of the file system you create. The following table gives a rough estimate of the minimum subnet size for Azure Managed Lustre file systems of different sizes.</p>
<table>
<thead>
<tr>
<th>Storage capacity</th>
<th>Recommended CIDR prefix value</th>
</tr>
</thead>
<tbody>
<tr>
<td>4 TiB to 16 TiB</td>
<td>/27 or larger</td>
</tr>
<tr>
<td>20 TiB to 40 TiB</td>
<td>/26 or larger</td>
</tr>
<tr>
<td>44 TiB to 92 TiB</td>
<td>/25 or larger</td>
</tr>
<tr>
<td>96 TiB to 196 TiB</td>
<td>/24 or larger</td>
</tr>
<tr>
<td>200 TiB to 400 TiB</td>
<td>/23 or larger</td>
</tr>
</tbody>
</table>
<h3 id="steps-to-mount-amlfs-to-azhop">Steps to mount AMLFS to AzHOP</h3>
<ul>
<li>Create AMLFS resource group in the same region
AMLFS RG Details:</li>
</ul>
<table>
<thead>
<tr>
<th>Attribute</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Subscription</td>
<td>XXXX</td>
</tr>
<tr>
<td>Resource group</td>
<td>JZ-AMLFS</td>
</tr>
<tr>
<td>Region</td>
<td>South Central US</td>
</tr>
<tr>
<td>Availability zone</td>
<td>1</td>
</tr>
<tr>
<td>File system name</td>
<td>lustre</td>
</tr>
<tr>
<td>Storage capacity</td>
<td>8 TiB</td>
</tr>
<tr>
<td>Throughput per TiB</td>
<td>250 MB/s</td>
</tr>
<tr>
<td>Total Throughput</td>
<td>2000 MB/s</td>
</tr>
<tr>
<td>Virtual network</td>
<td>(New) lustre-vnet</td>
</tr>
<tr>
<td>Subnet</td>
<td>(New) default (10.4.0.0/27)</td>
</tr>
<tr>
<td>Maintenance window</td>
<td>Sunday, 12:00</td>
</tr>
</tbody>
</table>
<ul>
<li>Create AMLFS and AzHOP vnet peering.
<ul>
<li>Select <strong>Allow access to remote virtual network</strong> for both vnet</li>
<li>Select <strong>Allow traffic to remote virtual network</strong> for both vnet</li>
</ul>
</li>
<li>In AzHOP RG, edit <code class="language-plaintext highlighter-rouge">nsg-common</code>.
<ul>
<li>Change Inbound security rule 3100 to Allow</li>
<li>Change Outbound security rule 3100 to Allow</li>
</ul>
</li>
<li><a href="https://learn.microsoft.com/en-us/azure/azure-managed-lustre/client-install?source=recommendations&pivots=centos-7">Install pre-built client software on AzHOP</a></li>
<li><a href="https://learn.microsoft.com/en-us/azure/azure-managed-lustre/connect-clients">Connect clients to an AMLFS</a></li>
</ul>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>root@scheduler ~]# <span class="nb">mkdir</span> /lustre
<span class="o">[</span>root@scheduler ~]# <span class="nb">sudo </span>mount <span class="nt">-t</span> lustre <span class="nt">-o</span> noatime,flock 10.4.0.4@tcp:/lustrefs /lustre
<span class="o">[</span>root@scheduler ~]# <span class="nb">df</span> <span class="nt">-h</span>
Filesystem Size Used Avail Use% Mounted on
devtmpfs 3.9G 0 3.9G 0% /dev
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 3.9G 417M 3.5G 11% /run
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
/dev/sda2 30G 3.6G 26G 13% /
/dev/sda1 494M 74M 421M 15% /boot
/dev/sda15 495M 12M 484M 3% /boot/efi
/dev/sdb1 16G 45M 15G 1% /mnt/resource
nfsfilespya6el4wo2vwgx.file.core.windows.net:/nfsfilespya6el4wo2vwgx/nfshome 1.0T 0 1.0T 0% /clusterhome
tmpfs 783M 0 783M 0% /run/user/1000
10.4.0.4@tcp:/lustrefs 8.0T 1.3M 7.6T 1% /lustre
</code></pre></div></div>Jingchao Zhangjingczhang@microsoft.comIn the rapidly evolving landscape of High-Performance Computing (HPC) and Artificial Intelligence (AI), the quest for optimizing operational cost without compromising performance has become paramount. Microsoft Azure’s HPC On-Demand Platform (AzHOP) serves as an innovative solution that addresses both scale and flexibility needs. However, one area that often warrants scrutiny is the daily cost associated with the backbone infrastructure of AzHOP, which includes critical components such as Management VMs, persistent storage volumes, and more.Azhop add anf volume2023-07-31T00:00:00+00:002023-07-31T00:00:00+00:00https://jingchaozhang.github.io/Azhop%20add%20anf%20volume<h3 id="network-topology">Network topology</h3>
<p>When create AZHOP without NetApp volumes, a subnet for ANF will still be created as shown in figure below:<br />
<img src="https://raw.githubusercontent.com/JingchaoZhang/JingchaoZhang.github.io/master/_posts/2023-07-31-2-figures/1.png" alt="Figure_1" /><br />
You can manually add a ANF volume to the existing AZHOP cluster, which will change the network topology as below:<br />
<img src="https://raw.githubusercontent.com/JingchaoZhang/JingchaoZhang.github.io/master/_posts/2023-07-31-2-figures/1_1.png" alt="Figure_1" /></p>
<h3 id="create-azure-netapp-files">Create Azure Netapp Files</h3>
<ol>
<li>Select ANF service from MarketPlace<br />
<img src="https://raw.githubusercontent.com/JingchaoZhang/JingchaoZhang.github.io/master/_posts/2023-07-31-2-figures/2.png" alt="Figure_2" /></li>
<li>Create NetApp account<br />
<img src="https://raw.githubusercontent.com/JingchaoZhang/JingchaoZhang.github.io/master/_posts/2023-07-31-2-figures/3.png" alt="Figure_3" /></li>
<li>Create ANF capacity pool<br />
<img src="https://raw.githubusercontent.com/JingchaoZhang/JingchaoZhang.github.io/master/_posts/2023-07-31-2-figures/4.png" alt="Figure_4" /><br />
<img src="https://raw.githubusercontent.com/JingchaoZhang/JingchaoZhang.github.io/master/_posts/2023-07-31-2-figures/5.png" alt="Figure_5" /><br />
<img src="https://raw.githubusercontent.com/JingchaoZhang/JingchaoZhang.github.io/master/_posts/2023-07-31-2-figures/6.png" alt="Figure_6" /></li>
<li>Create ANF volume<br />
<img src="https://raw.githubusercontent.com/JingchaoZhang/JingchaoZhang.github.io/master/_posts/2023-07-31-2-figures/7.png" alt="Figure_7" /><br />
<img src="https://raw.githubusercontent.com/JingchaoZhang/JingchaoZhang.github.io/master/_posts/2023-07-31-2-figures/8.png" alt="Figure_8" /></li>
<li>Find the mounting instructions from the volume page<br />
<img src="https://raw.githubusercontent.com/JingchaoZhang/JingchaoZhang.github.io/master/_posts/2023-07-31-2-figures/9.png" alt="Figure_9" /><br />
<img src="https://raw.githubusercontent.com/JingchaoZhang/JingchaoZhang.github.io/master/_posts/2023-07-31-2-figures/10.png" alt="Figure_10" /></li>
</ol>
<h3 id="mount-anf-to-the-ondemand-node">Mount ANF to the ondemand node</h3>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>root@ondemand ~]# <span class="nb">cd</span> /
<span class="o">[</span>root@ondemand /]# <span class="nb">sudo mkdir </span>NewVolume
<span class="o">[</span>root@ondemand /]# <span class="nb">sudo </span>mount <span class="nt">-t</span> nfs <span class="nt">-o</span> rw,hard,rsize<span class="o">=</span>262144,wsize<span class="o">=</span>262144,vers<span class="o">=</span>3,tcp 10.107.0.36:/NewVolume NewVolume
<span class="o">[</span>root@ondemand /]# <span class="nb">df</span> <span class="nt">-h</span>
Filesystem Size Used Avail Use% Mounted on
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 0 16G 0% /dev/shm
tmpfs 16G 33M 16G 1% /run
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/sda2 30G 4.3G 25G 15% /
/dev/sda1 494M 77M 418M 16% /boot
/dev/sda15 495M 12M 484M 3% /boot/efi
nfsfiles7nz25whnhti5ox.file.core.windows.net:/nfsfiles7nz25whnhti5ox/nfshome 1.0T 4.8G 1020G 1% /clusterhome
tmpfs 3.2G 0 3.2G 0% /run/user/0
tmpfs 3.2G 0 3.2G 0% /run/user/1000
10.107.0.36:/NewVolume 4.0T 256K 4.0T 1% /NewVolume
</code></pre></div></div>
<p>You may need to edit folder permission and add user directories in a shared environment.</p>Jingchao Zhangjingczhang@microsoft.comNetwork topology When create AZHOP without NetApp volumes, a subnet for ANF will still be created as shown in figure below: You can manually add a ANF volume to the existing AZHOP cluster, which will change the network topology as below:Azhop deployment with vnet peering2023-07-31T00:00:00+00:002023-07-31T00:00:00+00:00https://jingchaozhang.github.io/Azhop%20deployment%20with%20vnet%20peering<p><a href="https://techcommunity.microsoft.com/t5/azure-high-performance-computing/az-hop-in-the-azure-marketplace/ba-p/3829838">Az-HOP in the Azure Marketplace</a></p>
<p><img src="https://raw.githubusercontent.com/JingchaoZhang/JingchaoZhang.github.io/master/_posts/2023-07-31-figures/1.png" alt="Figure_1" /><br />
<img src="https://raw.githubusercontent.com/JingchaoZhang/JingchaoZhang.github.io/master/_posts/2023-07-31-figures/2.png" alt="Figure_2" /><br />
<img src="https://raw.githubusercontent.com/JingchaoZhang/JingchaoZhang.github.io/master/_posts/2023-07-31-figures/3.png" alt="Figure_3" /><br />
<img src="https://raw.githubusercontent.com/JingchaoZhang/JingchaoZhang.github.io/master/_posts/2023-07-31-figures/4.png" alt="Figure_4" /><br />
<img src="https://raw.githubusercontent.com/JingchaoZhang/JingchaoZhang.github.io/master/_posts/2023-07-31-figures/5.png" alt="Figure_5" /></p>Jingchao Zhangjingczhang@microsoft.comAz-HOP in the Azure MarketplaceAzhop enable ssh access to ondemand node2023-07-28T00:00:00+00:002023-07-28T00:00:00+00:00https://jingchaozhang.github.io/Azhop%20enable%20ssh%20access%20to%20ondemand%20node<p>To enable ssh to the ondemand node, you need to</p>
<ol>
<li>Change the NSG rule to allow inbound traffic on port 22;</li>
<li>Edit <code class="language-plaintext highlighter-rouge">/etc/ssh/sshd_config</code> file on the ondemand node, change <code class="language-plaintext highlighter-rouge">PasswordAuthentication</code> to yes, then restart sshd.</li>
</ol>
<p>Here are the details:</p>
<ol>
<li>Find ‘network settings’ in the ondemand VM resource page. Click on it.<br />
<img src="https://raw.githubusercontent.com/JingchaoZhang/JingchaoZhang.github.io/master/_posts/2023-07-28-figures/1.png" alt="Figure_1" /></li>
<li>On the next page, click on <img src="https://raw.githubusercontent.com/JingchaoZhang/JingchaoZhang.github.io/master/_posts/2023-07-28-figures/2.png" alt="Figure_2" />, and select Inbound port rule.</li>
<li>Add a NSG rule similar to this one. Feel free to limit ‘Source’ and ‘port ranges’ as needed.<br />
<img src="https://raw.githubusercontent.com/JingchaoZhang/JingchaoZhang.github.io/master/_posts/2023-07-28-figures/3.png" alt="Figure_3" /></li>
<li>Click ‘Add’ to save the changes. Please note the NSG change may take a few minutes to become effective.</li>
<li>From the ondemand node
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>hpcadmin@ondemand ~]<span class="nv">$ </span><span class="nb">sudo </span>vim /etc/ssh/sshd_config
PasswordAuthentication <span class="nb">yes</span>
<span class="o">[</span>hpcadmin@ondemand ~]<span class="nv">$ </span><span class="nb">sudo </span>systemctl restart sshd
</code></pre></div> </div>
</li>
</ol>
<p>This should allow you to ssh into the ondemand node as any user.</p>Jingchao Zhangjingczhang@microsoft.comTo enable ssh to the ondemand node, you need to Change the NSG rule to allow inbound traffic on port 22; Edit /etc/ssh/sshd_config file on the ondemand node, change PasswordAuthentication to yes, then restart sshd.Azure nccl test on ncv42023-07-14T00:00:00+00:002023-07-14T00:00:00+00:00https://jingchaozhang.github.io/Azure%20NCCL%20Test%20on%20NCv4<p>You can setup a SLURM cluster on Azure using AZHOP. This <a href="https://techcommunity.microsoft.com/t5/azure-high-performance-computing/az-hop-in-the-azure-marketplace/ba-p/3829838">blog</a> has details on how to deploy AZHOP.</p>
<h3 id="cluster-information">Cluster information</h3>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
execute up infinite 256 idle~ execute-[1-256]
nc48v4 up infinite 1 idle~ nc48v4-pg0-2
nc48v4 up infinite 1 mix nc48v4-pg0-1
nc96v4 up infinite 1 idle~ nc96v4-pg0-1
nc96v4 up infinite 1 idle nc96v4-pg0-2
ncrv3 up infinite 2 idle~ ncrv3-[1-2]
ncv3 up infinite 1 idle~ ncv3-2
ncv3 up infinite 1 idle ncv3-1
ndv4<span class="k">*</span> up infinite 1 comp% ndv4-pg0-1
ndv4<span class="k">*</span> up infinite 1 idle% ndv4-pg0-2
</code></pre></div></div>
<p>The image used in all N-series VMs is <code class="language-plaintext highlighter-rouge">microsoft-dsvm:ubuntu-hpc:2004:20.04.2023031501</code>.</p>
<h3 id="this-post-will-compare-nc48v4-and-nc96v4-which-have-2-and-4-80g-a100-gpus-respectively">This post will compare nc48v4 and nc96v4, which have 2 and 4 80G A100 GPUs, respectively.</h3>
<p><strong>NC48v4</strong></p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>scontrol show node nc48v4-pg0-1
<span class="nv">NodeName</span><span class="o">=</span>nc48v4-pg0-1 <span class="nv">Arch</span><span class="o">=</span>x86_64 <span class="nv">CoresPerSocket</span><span class="o">=</span>1
<span class="nv">CPUAlloc</span><span class="o">=</span>0 <span class="nv">CPUTot</span><span class="o">=</span>48 <span class="nv">CPULoad</span><span class="o">=</span>0.00
<span class="nv">AvailableFeatures</span><span class="o">=</span>cloud
<span class="nv">ActiveFeatures</span><span class="o">=</span>cloud
<span class="nv">Gres</span><span class="o">=</span>gpu:2
<span class="nv">NodeAddr</span><span class="o">=</span>nc48v4-pg0-1 <span class="nv">NodeHostName</span><span class="o">=</span>nc48v4-pg0-1 <span class="nv">Version</span><span class="o">=</span>20.11.9
<span class="nv">OS</span><span class="o">=</span>Linux 5.15.0-1034-azure <span class="c">#41~20.04.1-Ubuntu SMP Sat Feb 11 17:02:42 UTC 2023</span>
<span class="nv">RealMemory</span><span class="o">=</span>414515 <span class="nv">AllocMem</span><span class="o">=</span>0 <span class="nv">FreeMem</span><span class="o">=</span>438726 <span class="nv">Sockets</span><span class="o">=</span>48 <span class="nv">Boards</span><span class="o">=</span>1
<span class="nv">State</span><span class="o">=</span>IDLE+CLOUD <span class="nv">ThreadsPerCore</span><span class="o">=</span>1 <span class="nv">TmpDisk</span><span class="o">=</span>0 <span class="nv">Weight</span><span class="o">=</span>1 <span class="nv">Owner</span><span class="o">=</span>N/A <span class="nv">MCS_label</span><span class="o">=</span>N/A
<span class="nv">Partitions</span><span class="o">=</span>nc48v4
<span class="nv">BootTime</span><span class="o">=</span>2023-07-14T18:41:10 <span class="nv">SlurmdStartTime</span><span class="o">=</span>2023-07-14T18:41:11
<span class="nv">CfgTRES</span><span class="o">=</span><span class="nv">cpu</span><span class="o">=</span>48,mem<span class="o">=</span>414515M,billing<span class="o">=</span>48
<span class="nv">AllocTRES</span><span class="o">=</span>
<span class="nv">CapWatts</span><span class="o">=</span>n/a
<span class="nv">CurrentWatts</span><span class="o">=</span>0 <span class="nv">AveWatts</span><span class="o">=</span>0
<span class="nv">ExtSensorsJoules</span><span class="o">=</span>n/s <span class="nv">ExtSensorsWatts</span><span class="o">=</span>0 <span class="nv">ExtSensorsTemp</span><span class="o">=</span>n/s
<span class="nv">Comment</span><span class="o">=(</span>null<span class="o">)</span>
</code></pre></div></div>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>clusteradmin@nc48v4-pg0-1:~/NCCL_test<span class="nv">$ </span>nvidia-smi
Fri Jul 14 20:10:11 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02 Driver Version: 510.85.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|<span class="o">===============================</span>+<span class="o">======================</span>+<span class="o">======================</span>|
| 0 NVIDIA A100 80G... Off | 00000001:00:00.0 Off | 0 |
| N/A 37C P0 53W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80G... Off | 00000002:00:00.0 Off | 0 |
| N/A 38C P0 54W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|<span class="o">=============================================================================</span>|
| No running processes found |
+-----------------------------------------------------------------------------+
</code></pre></div></div>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>clusteradmin@nc48v4-pg0-1:~/NCCL_test<span class="nv">$ </span>nvidia-smi topo <span class="nt">-m</span>
GPU0 GPU1 CPU Affinity NUMA Affinity
GPU0 X NV12 0-1 0-1
GPU1 NV12 X 0-1 0-1
Legend:
X <span class="o">=</span> Self
SYS <span class="o">=</span> Connection traversing PCIe as well as the SMP interconnect between NUMA nodes <span class="o">(</span>e.g., QPI/UPI<span class="o">)</span>
NODE <span class="o">=</span> Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB <span class="o">=</span> Connection traversing PCIe as well as a PCIe Host Bridge <span class="o">(</span>typically the CPU<span class="o">)</span>
PXB <span class="o">=</span> Connection traversing multiple PCIe bridges <span class="o">(</span>without traversing the PCIe Host Bridge<span class="o">)</span>
PIX <span class="o">=</span> Connection traversing at most a single PCIe bridge
NV# <span class="o">=</span> Connection traversing a bonded <span class="nb">set </span>of <span class="c"># NVLinks</span>
</code></pre></div></div>
<p><strong>NC96v4</strong></p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>scontrol show node nc96v4-pg0-2
<span class="nv">NodeName</span><span class="o">=</span>nc96v4-pg0-2 <span class="nv">Arch</span><span class="o">=</span>x86_64 <span class="nv">CoresPerSocket</span><span class="o">=</span>1
<span class="nv">CPUAlloc</span><span class="o">=</span>0 <span class="nv">CPUTot</span><span class="o">=</span>96 <span class="nv">CPULoad</span><span class="o">=</span>0.00
<span class="nv">AvailableFeatures</span><span class="o">=</span>cloud
<span class="nv">ActiveFeatures</span><span class="o">=</span>cloud
<span class="nv">Gres</span><span class="o">=</span>gpu:4
<span class="nv">NodeAddr</span><span class="o">=</span>nc96v4-pg0-2 <span class="nv">NodeHostName</span><span class="o">=</span>nc96v4-pg0-2 <span class="nv">Version</span><span class="o">=</span>20.11.9
<span class="nv">OS</span><span class="o">=</span>Linux 5.15.0-1034-azure <span class="c">#41~20.04.1-Ubuntu SMP Sat Feb 11 17:02:42 UTC 2023</span>
<span class="nv">RealMemory</span><span class="o">=</span>829030 <span class="nv">AllocMem</span><span class="o">=</span>0 <span class="nv">FreeMem</span><span class="o">=</span>879666 <span class="nv">Sockets</span><span class="o">=</span>96 <span class="nv">Boards</span><span class="o">=</span>1
<span class="nv">State</span><span class="o">=</span>IDLE+CLOUD <span class="nv">ThreadsPerCore</span><span class="o">=</span>1 <span class="nv">TmpDisk</span><span class="o">=</span>0 <span class="nv">Weight</span><span class="o">=</span>1 <span class="nv">Owner</span><span class="o">=</span>N/A <span class="nv">MCS_label</span><span class="o">=</span>N/A
<span class="nv">Partitions</span><span class="o">=</span>nc96v4
<span class="nv">BootTime</span><span class="o">=</span>2023-07-14T04:57:25 <span class="nv">SlurmdStartTime</span><span class="o">=</span>2023-07-14T04:57:28
<span class="nv">CfgTRES</span><span class="o">=</span><span class="nv">cpu</span><span class="o">=</span>96,mem<span class="o">=</span>829030M,billing<span class="o">=</span>96
<span class="nv">AllocTRES</span><span class="o">=</span>
<span class="nv">CapWatts</span><span class="o">=</span>n/a
<span class="nv">CurrentWatts</span><span class="o">=</span>0 <span class="nv">AveWatts</span><span class="o">=</span>0
<span class="nv">ExtSensorsJoules</span><span class="o">=</span>n/s <span class="nv">ExtSensorsWatts</span><span class="o">=</span>0 <span class="nv">ExtSensorsTemp</span><span class="o">=</span>n/s
<span class="nv">Comment</span><span class="o">=(</span>null<span class="o">)</span>
</code></pre></div></div>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>clusteradmin@nc96v4-pg0-2:~/NCCL_test<span class="nv">$ </span>nvidia-smi
Fri Jul 14 20:13:54 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02 Driver Version: 510.85.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|<span class="o">===============================</span>+<span class="o">======================</span>+<span class="o">======================</span>|
| 0 NVIDIA A100 80G... Off | 00000001:00:00.0 Off | 0 |
| N/A 38C P0 54W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80G... Off | 00000002:00:00.0 Off | 0 |
| N/A 38C P0 58W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100 80G... Off | 00000003:00:00.0 Off | 0 |
| N/A 37C P0 52W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100 80G... Off | 00000004:00:00.0 Off | 0 |
| N/A 39C P0 55W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|<span class="o">=============================================================================</span>|
| No running processes found |
+-----------------------------------------------------------------------------+
</code></pre></div></div>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>clusteradmin@nc96v4-pg0-2:~/NCCL_test<span class="nv">$ </span>nvidia-smi topo <span class="nt">-m</span>
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity
GPU0 X NV12 SYS SYS 0 0-3
GPU1 NV12 X SYS SYS 0 0-3
GPU2 SYS SYS X NV12 0 0-3
GPU3 SYS SYS NV12 X 0 0-3
Legend:
X <span class="o">=</span> Self
SYS <span class="o">=</span> Connection traversing PCIe as well as the SMP interconnect between NUMA nodes <span class="o">(</span>e.g., QPI/UPI<span class="o">)</span>
NODE <span class="o">=</span> Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB <span class="o">=</span> Connection traversing PCIe as well as a PCIe Host Bridge <span class="o">(</span>typically the CPU<span class="o">)</span>
PXB <span class="o">=</span> Connection traversing multiple PCIe bridges <span class="o">(</span>without traversing the PCIe Host Bridge<span class="o">)</span>
PIX <span class="o">=</span> Connection traversing at most a single PCIe bridge
NV# <span class="o">=</span> Connection traversing a bonded <span class="nb">set </span>of <span class="c"># NVLinks</span>
</code></pre></div></div>
<h2 id="nccl-benchmark">NCCL benchmark</h2>
<p>There are two preset NCCL environment variables, <code class="language-plaintext highlighter-rouge">NCCL_TOPO_FILE</code> and <code class="language-plaintext highlighter-rouge">NCCL_GRAPH_FILE</code>, in the <code class="language-plaintext highlighter-rouge">/etc/nccl.conf</code> file on the compute VM.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">cat</span> /etc/nccl.conf
<span class="nv">NCCL_TOPO_FILE</span><span class="o">=</span>/opt/microsoft/ncv4/topo.xml
<span class="nv">NCCL_GRAPH_FILE</span><span class="o">=</span>/opt/microsoft/ncv4/graph.xml
</code></pre></div></div>
<p><strong>In order to run NCCL test with SLURM, you need to install pmix following the instructions <a href="https://github.com/Azure/azurehpc/blob/df46027e0380aee06a292b54ed6a1d90a6f5a1db/experimental/deploy_cycle_slurm_ndv4/scripts/install-pmix.sh">here</a> on the compute node.</strong></p>
<h3 id="nc96v4">NC96v4</h3>
<h4 id="test-with-both-nccl_topo_file-and-nccl_graph_file-being-set">Test with both <code class="language-plaintext highlighter-rouge">NCCL_TOPO_FILE</code> and <code class="language-plaintext highlighter-rouge">NCCL_GRAPH_FILE</code> being set</h4>
<p>SLURM script</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="c">#SBATCH -t 00:20:00</span>
<span class="c">#SBATCH -p nc96v4</span>
<span class="c">#SBATCH -w nc96v4-pg0-2</span>
<span class="c">#SBATCH --ntasks-per-node=4</span>
<span class="c">#SBATCH --cpus-per-task=24</span>
<span class="c">#SBATCH --gpus-per-node=4</span>
<span class="c">#SBATCH --mem=0</span>
<span class="c">#SBATCH -o job.%J.out</span>
<span class="c">#SBATCH --error=job.%J.err</span>
<span class="nv">BASE_DIR</span><span class="o">=</span>/opt
<span class="nv">NCCL_TESTS_EXE</span><span class="o">=</span>all_reduce_perf
<span class="nb">export </span><span class="nv">NCCL_DEBUG</span><span class="o">=</span>INFO
<span class="nb">export </span><span class="nv">NCCL_IB_DISABLE</span><span class="o">=</span>1
<span class="nb">source</span> /etc/profile.d/modules.sh
module load mpi/openmpi
<span class="nv">PIN_MASK</span><span class="o">=</span><span class="s1">'0xffffff,0xffffff000000,0xffffff000000000000,0xffffff000000000000000000'</span>
srun <span class="nt">--mpi</span><span class="o">=</span>pmix <span class="nt">--cpu-bind</span><span class="o">=</span>mask_cpu:<span class="nv">$PIN_MASK</span> <span class="nt">--gpus-per-node</span><span class="o">=</span>4 <span class="se">\</span>
<span class="nt">--ntasks-per-node</span><span class="o">=</span>4 <span class="se">\</span>
<span class="k">${</span><span class="nv">BASE_DIR</span><span class="k">}</span>/nccl-tests/build/<span class="nv">$NCCL_TESTS_EXE</span> <span class="nt">-b8</span> <span class="nt">-f</span> 2 <span class="nt">-g</span> 1 <span class="nt">-e</span> 8G <span class="nt">-c</span> 1
</code></pre></div></div>
<p>NCCL results</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># out-of-place in-place</span>
<span class="c"># size count type redop root time algbw busbw #wrong time algbw busbw #wrong</span>
<span class="c"># (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)</span>
nc96v4-pg0-2:89444:89486 <span class="o">[</span>1] NCCL INFO <span class="nb">comm </span>0x55a7dcf2de60 rank 1 nranks 4 cudaDev 1 busId 200000 - Init COMPLETE
nc96v4-pg0-2:89445:89488 <span class="o">[</span>2] NCCL INFO <span class="nb">comm </span>0x559348ec5c50 rank 2 nranks 4 cudaDev 2 busId 300000 - Init COMPLETE
8 2 float <span class="nb">sum</span> <span class="nt">-1</span> 13.14 0.00 0.00 0 13.17 0.00 0.00 0
16 4 float <span class="nb">sum</span> <span class="nt">-1</span> 13.23 0.00 0.00 0 13.43 0.00 0.00 0
32 8 float <span class="nb">sum</span> <span class="nt">-1</span> 13.29 0.00 0.00 0 13.07 0.00 0.00 0
64 16 float <span class="nb">sum</span> <span class="nt">-1</span> 13.37 0.00 0.01 0 13.15 0.00 0.01 0
128 32 float <span class="nb">sum</span> <span class="nt">-1</span> 13.51 0.01 0.01 0 13.31 0.01 0.01 0
256 64 float <span class="nb">sum</span> <span class="nt">-1</span> 13.64 0.02 0.03 0 13.81 0.02 0.03 0
512 128 float <span class="nb">sum</span> <span class="nt">-1</span> 13.72 0.04 0.06 0 13.82 0.04 0.06 0
1024 256 float <span class="nb">sum</span> <span class="nt">-1</span> 15.17 0.07 0.10 0 14.78 0.07 0.10 0
2048 512 float <span class="nb">sum</span> <span class="nt">-1</span> 16.16 0.13 0.19 0 16.05 0.13 0.19 0
4096 1024 float <span class="nb">sum</span> <span class="nt">-1</span> 17.09 0.24 0.36 0 16.90 0.24 0.36 0
8192 2048 float <span class="nb">sum</span> <span class="nt">-1</span> 17.92 0.46 0.69 0 17.50 0.47 0.70 0
16384 4096 float <span class="nb">sum</span> <span class="nt">-1</span> 19.73 0.83 1.25 0 18.89 0.87 1.30 0
32768 8192 float <span class="nb">sum</span> <span class="nt">-1</span> 20.93 1.57 2.35 0 20.83 1.57 2.36 0
65536 16384 float <span class="nb">sum</span> <span class="nt">-1</span> 21.88 3.00 4.49 0 21.50 3.05 4.57 0
131072 32768 float <span class="nb">sum</span> <span class="nt">-1</span> 31.54 4.16 6.23 0 31.33 4.18 6.28 0
262144 65536 float <span class="nb">sum</span> <span class="nt">-1</span> 64.60 4.06 6.09 0 64.06 4.09 6.14 0
524288 131072 float <span class="nb">sum</span> <span class="nt">-1</span> 73.72 7.11 10.67 0 73.69 7.11 10.67 0
1048576 262144 float <span class="nb">sum</span> <span class="nt">-1</span> 93.18 11.25 16.88 0 92.97 11.28 16.92 0
2097152 524288 float <span class="nb">sum</span> <span class="nt">-1</span> 136.8 15.33 23.00 0 136.2 15.40 23.10 0
4194304 1048576 float <span class="nb">sum</span> <span class="nt">-1</span> 225.1 18.63 27.94 0 227.1 18.47 27.71 0
8388608 2097152 float <span class="nb">sum</span> <span class="nt">-1</span> 437.5 19.17 28.76 0 435.1 19.28 28.92 0
16777216 4194304 float <span class="nb">sum</span> <span class="nt">-1</span> 865.0 19.39 29.09 0 872.7 19.22 28.84 0
33554432 8388608 float <span class="nb">sum</span> <span class="nt">-1</span> 1761.5 19.05 28.57 0 1747.0 19.21 28.81 0
67108864 16777216 float <span class="nb">sum</span> <span class="nt">-1</span> 3362.9 19.96 29.93 0 3374.3 19.89 29.83 0
134217728 33554432 float <span class="nb">sum</span> <span class="nt">-1</span> 6646.9 20.19 30.29 0 6668.2 20.13 30.19 0
268435456 67108864 float <span class="nb">sum</span> <span class="nt">-1</span> 13144 20.42 30.63 0 13206 20.33 30.49 0
536870912 134217728 float <span class="nb">sum</span> <span class="nt">-1</span> 26266 20.44 30.66 0 26160 20.52 30.78 0
1073741824 268435456 float <span class="nb">sum</span> <span class="nt">-1</span> 52288 20.53 30.80 0 52474 20.46 30.69 0
2147483648 536870912 float <span class="nb">sum</span> <span class="nt">-1</span> 105840 20.29 30.43 0 104302 20.59 30.88 0
4294967296 1073741824 float <span class="nb">sum</span> <span class="nt">-1</span> 216222 19.86 29.80 0 215370 19.94 29.91 0
8589934592 2147483648 float <span class="nb">sum</span> <span class="nt">-1</span> 459314 18.70 28.05 0 459949 18.68 28.01 0
nc96v4-pg0-2:89444:89444 <span class="o">[</span>1] NCCL INFO <span class="nb">comm </span>0x55a7dcf2de60 rank 1 nranks 4 cudaDev 1 busId 200000 - Destroy COMPLETE
nc96v4-pg0-2:89446:89446 <span class="o">[</span>3] NCCL INFO <span class="nb">comm </span>0x556a7b4d7bf0 rank 3 nranks 4 cudaDev 3 busId 400000 - Destroy COMPLETE
nc96v4-pg0-2:89443:89443 <span class="o">[</span>0] NCCL INFO <span class="nb">comm </span>0x56323adcad80 rank 0 nranks 4 cudaDev 0 busId 100000 - Destroy COMPLETE
nc96v4-pg0-2:89445:89445 <span class="o">[</span>2] NCCL INFO <span class="nb">comm </span>0x559348ec5c50 rank 2 nranks 4 cudaDev 2 busId 300000 - Destroy COMPLETE
<span class="c"># Out of bounds values : 0 OK</span>
<span class="c"># Avg bus bandwidth : 13.7946</span>
</code></pre></div></div>
<h4 id="test-with-only-nccl_topo_file-comment-out-nccl_graph_fileoptmicrosoftncv4graphxml-in-etcncclconf">Test with only <code class="language-plaintext highlighter-rouge">NCCL_TOPO_FILE</code>. Comment out <code class="language-plaintext highlighter-rouge">NCCL_GRAPH_FILE=/opt/microsoft/ncv4/graph.xml</code> in <code class="language-plaintext highlighter-rouge">/etc/nccl.conf</code>.</h4>
<p>NCCL results</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># out-of-place in-place</span>
<span class="c"># size count type redop root time algbw busbw #wrong time algbw busbw #wrong</span>
<span class="c"># (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)</span>
nc96v4-pg0-2:89786:89819 <span class="o">[</span>2] NCCL INFO <span class="nb">comm </span>0x55fb36d03b10 rank 2 nranks 4 cudaDev 2 busId 300000 - Init COMPLETE
8 2 float <span class="nb">sum</span> <span class="nt">-1</span> 13.38 0.00 0.00 0 13.11 0.00 0.00 0
16 4 float <span class="nb">sum</span> <span class="nt">-1</span> 13.26 0.00 0.00 0 13.02 0.00 0.00 0
32 8 float <span class="nb">sum</span> <span class="nt">-1</span> 13.30 0.00 0.00 0 13.18 0.00 0.00 0
64 16 float <span class="nb">sum</span> <span class="nt">-1</span> 13.37 0.00 0.01 0 13.46 0.00 0.01 0
128 32 float <span class="nb">sum</span> <span class="nt">-1</span> 13.43 0.01 0.01 0 13.37 0.01 0.01 0
256 64 float <span class="nb">sum</span> <span class="nt">-1</span> 13.65 0.02 0.03 0 13.36 0.02 0.03 0
512 128 float <span class="nb">sum</span> <span class="nt">-1</span> 13.69 0.04 0.06 0 13.71 0.04 0.06 0
1024 256 float <span class="nb">sum</span> <span class="nt">-1</span> 15.31 0.07 0.10 0 15.12 0.07 0.10 0
2048 512 float <span class="nb">sum</span> <span class="nt">-1</span> 16.19 0.13 0.19 0 15.73 0.13 0.20 0
4096 1024 float <span class="nb">sum</span> <span class="nt">-1</span> 17.17 0.24 0.36 0 16.72 0.25 0.37 0
8192 2048 float <span class="nb">sum</span> <span class="nt">-1</span> 18.11 0.45 0.68 0 17.35 0.47 0.71 0
16384 4096 float <span class="nb">sum</span> <span class="nt">-1</span> 19.63 0.83 1.25 0 19.23 0.85 1.28 0
32768 8192 float <span class="nb">sum</span> <span class="nt">-1</span> 21.48 1.53 2.29 0 20.84 1.57 2.36 0
65536 16384 float <span class="nb">sum</span> <span class="nt">-1</span> 21.87 3.00 4.49 0 21.64 3.03 4.54 0
131072 32768 float <span class="nb">sum</span> <span class="nt">-1</span> 31.87 4.11 6.17 0 31.61 4.15 6.22 0
262144 65536 float <span class="nb">sum</span> <span class="nt">-1</span> 64.36 4.07 6.11 0 64.34 4.07 6.11 0
524288 131072 float <span class="nb">sum</span> <span class="nt">-1</span> 74.00 7.09 10.63 0 73.69 7.11 10.67 0
1048576 262144 float <span class="nb">sum</span> <span class="nt">-1</span> 93.75 11.19 16.78 0 93.54 11.21 16.82 0
2097152 524288 float <span class="nb">sum</span> <span class="nt">-1</span> 137.2 15.29 22.93 0 137.1 15.30 22.95 0
4194304 1048576 float <span class="nb">sum</span> <span class="nt">-1</span> 228.5 18.36 27.54 0 228.3 18.37 27.56 0
8388608 2097152 float <span class="nb">sum</span> <span class="nt">-1</span> 436.9 19.20 28.80 0 435.6 19.26 28.89 0
16777216 4194304 float <span class="nb">sum</span> <span class="nt">-1</span> 866.6 19.36 29.04 0 870.8 19.27 28.90 0
33554432 8388608 float <span class="nb">sum</span> <span class="nt">-1</span> 1731.9 19.37 29.06 0 1736.4 19.32 28.99 0
67108864 16777216 float <span class="nb">sum</span> <span class="nt">-1</span> 3360.6 19.97 29.95 0 3330.4 20.15 30.23 0
134217728 33554432 float <span class="nb">sum</span> <span class="nt">-1</span> 6599.3 20.34 30.51 0 6616.6 20.28 30.43 0
268435456 67108864 float <span class="nb">sum</span> <span class="nt">-1</span> 13043 20.58 30.87 0 13134 20.44 30.66 0
536870912 134217728 float <span class="nb">sum</span> <span class="nt">-1</span> 26168 20.52 30.77 0 26043 20.61 30.92 0
1073741824 268435456 float <span class="nb">sum</span> <span class="nt">-1</span> 51970 20.66 30.99 0 51754 20.75 31.12 0
2147483648 536870912 float <span class="nb">sum</span> <span class="nt">-1</span> 104730 20.50 30.76 0 103974 20.65 30.98 0
4294967296 1073741824 float <span class="nb">sum</span> <span class="nt">-1</span> 214739 20.00 30.00 0 214882 19.99 29.98 0
8589934592 2147483648 float <span class="nb">sum</span> <span class="nt">-1</span> 456716 18.81 28.21 0 457441 18.78 28.17 0
nc96v4-pg0-2:89784:89784 <span class="o">[</span>0] NCCL INFO <span class="nb">comm </span>0x55ed0c7d34a0 rank 0 nranks 4 cudaDev 0 busId 100000 - Destroy COMPLETE
nc96v4-pg0-2:89785:89785 <span class="o">[</span>1] NCCL INFO <span class="nb">comm </span>0x55919eac19b0 rank 1 nranks 4 cudaDev 1 busId 200000 - Destroy COMPLETE
nc96v4-pg0-2:89787:89787 <span class="o">[</span>3] NCCL INFO <span class="nb">comm </span>0x556a167eb1d0 rank 3 nranks 4 cudaDev 3 busId 400000 - Destroy COMPLETE
<span class="c"># Out of bounds values : 0 OK</span>
<span class="c"># Avg bus bandwidth : 13.8362</span>
<span class="c">#</span>
nc96v4-pg0-2:89786:89786 <span class="o">[</span>2] NCCL INFO <span class="nb">comm </span>0x55fb36d03b10 rank 2 nranks 4 cudaDev 2 busId 300000 - Destroy COMPLETE
</code></pre></div></div>
<h3 id="nc96v4-1">NC96v4</h3>
<h4 id="test-with-both-nccl_topo_file-and-nccl_graph_file-being-set-1">Test with both <code class="language-plaintext highlighter-rouge">NCCL_TOPO_FILE</code> and <code class="language-plaintext highlighter-rouge">NCCL_GRAPH_FILE</code> being set</h4>
<p>SLURM script</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="c">#SBATCH -t 00:20:00</span>
<span class="c">#SBATCH -p nc48v4</span>
<span class="c">#SBATCH -w nc48v4-pg0-1</span>
<span class="c">#SBATCH --ntasks-per-node=2</span>
<span class="c">#SBATCH --cpus-per-task=24</span>
<span class="c">#SBATCH --gpus-per-node=2</span>
<span class="c">#SBATCH --mem=0</span>
<span class="c">#SBATCH -o job.%J.out</span>
<span class="c">#SBATCH --error=job.%J.err</span>
<span class="nv">BASE_DIR</span><span class="o">=</span>/opt
<span class="nv">NCCL_TESTS_EXE</span><span class="o">=</span>all_reduce_perf
<span class="nb">export </span><span class="nv">NCCL_DEBUG</span><span class="o">=</span>INFO
<span class="nb">export </span><span class="nv">NCCL_IB_DISABLE</span><span class="o">=</span>1
<span class="nb">source</span> /etc/profile.d/modules.sh
module load mpi/openmpi
<span class="nv">PIN_MASK</span><span class="o">=</span><span class="s1">'0xffffff,0xffffff000000'</span>
srun <span class="nt">--mpi</span><span class="o">=</span>pmix <span class="nt">--cpu-bind</span><span class="o">=</span>mask_cpu:<span class="nv">$PIN_MASK</span> <span class="nt">--gpus-per-node</span><span class="o">=</span>2 <span class="se">\</span>
<span class="nt">--ntasks-per-node</span><span class="o">=</span>2 <span class="se">\</span>
<span class="k">${</span><span class="nv">BASE_DIR</span><span class="k">}</span>/nccl-tests/build/<span class="nv">$NCCL_TESTS_EXE</span> <span class="nt">-b8</span> <span class="nt">-f</span> 2 <span class="nt">-g</span> 1 <span class="nt">-e</span> 8G <span class="nt">-c</span> 1
</code></pre></div></div>
<p>NCCL results</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># out-of-place in-place</span>
<span class="c"># size count type redop root time algbw busbw #wrong time algbw busbw #wrong</span>
<span class="c"># (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)</span>
8 2 float <span class="nb">sum</span> <span class="nt">-1</span> 8.71 0.00 0.00 0 8.66 0.00 0.00 0
16 4 float <span class="nb">sum</span> <span class="nt">-1</span> 8.76 0.00 0.00 0 8.71 0.00 0.00 0
32 8 float <span class="nb">sum</span> <span class="nt">-1</span> 8.77 0.00 0.00 0 8.77 0.00 0.00 0
64 16 float <span class="nb">sum</span> <span class="nt">-1</span> 8.86 0.01 0.01 0 8.75 0.01 0.01 0
128 32 float <span class="nb">sum</span> <span class="nt">-1</span> 8.91 0.01 0.01 0 8.69 0.01 0.01 0
256 64 float <span class="nb">sum</span> <span class="nt">-1</span> 8.86 0.03 0.03 0 8.71 0.03 0.03 0
512 128 float <span class="nb">sum</span> <span class="nt">-1</span> 9.06 0.06 0.06 0 8.73 0.06 0.06 0
1024 256 float <span class="nb">sum</span> <span class="nt">-1</span> 9.55 0.11 0.11 0 9.22 0.11 0.11 0
2048 512 float <span class="nb">sum</span> <span class="nt">-1</span> 9.37 0.22 0.22 0 9.26 0.22 0.22 0
4096 1024 float <span class="nb">sum</span> <span class="nt">-1</span> 9.57 0.43 0.43 0 9.28 0.44 0.44 0
8192 2048 float <span class="nb">sum</span> <span class="nt">-1</span> 10.32 0.79 0.79 0 10.12 0.81 0.81 0
16384 4096 float <span class="nb">sum</span> <span class="nt">-1</span> 11.13 1.47 1.47 0 10.80 1.52 1.52 0
32768 8192 float <span class="nb">sum</span> <span class="nt">-1</span> 11.21 2.92 2.92 0 10.97 2.99 2.99 0
65536 16384 float <span class="nb">sum</span> <span class="nt">-1</span> 13.35 4.91 4.91 0 12.80 5.12 5.12 0
131072 32768 float <span class="nb">sum</span> <span class="nt">-1</span> 30.14 4.35 4.35 0 30.03 4.36 4.36 0
262144 65536 float <span class="nb">sum</span> <span class="nt">-1</span> 32.36 8.10 8.10 0 32.42 8.09 8.09 0
524288 131072 float <span class="nb">sum</span> <span class="nt">-1</span> 37.54 13.97 13.97 0 37.00 14.17 14.17 0
1048576 262144 float <span class="nb">sum</span> <span class="nt">-1</span> 47.17 22.23 22.23 0 46.85 22.38 22.38 0
2097152 524288 float <span class="nb">sum</span> <span class="nt">-1</span> 67.65 31.00 31.00 0 66.91 31.34 31.34 0
4194304 1048576 float <span class="nb">sum</span> <span class="nt">-1</span> 102.7 40.83 40.83 0 101.9 41.18 41.18 0
8388608 2097152 float <span class="nb">sum</span> <span class="nt">-1</span> 170.6 49.17 49.17 0 170.5 49.21 49.21 0
16777216 4194304 float <span class="nb">sum</span> <span class="nt">-1</span> 307.9 54.49 54.49 0 305.5 54.92 54.92 0
33554432 8388608 float <span class="nb">sum</span> <span class="nt">-1</span> 599.0 56.01 56.01 0 592.3 56.65 56.65 0
67108864 16777216 float <span class="nb">sum</span> <span class="nt">-1</span> 1185.5 56.61 56.61 0 1171.0 57.31 57.31 0
134217728 33554432 float <span class="nb">sum</span> <span class="nt">-1</span> 2344.4 57.25 57.25 0 2326.3 57.69 57.69 0
268435456 67108864 float <span class="nb">sum</span> <span class="nt">-1</span> 4681.4 57.34 57.34 0 4637.0 57.89 57.89 0
536870912 134217728 float <span class="nb">sum</span> <span class="nt">-1</span> 9346.6 57.44 57.44 0 9257.0 58.00 58.00 0
1073741824 268435456 float <span class="nb">sum</span> <span class="nt">-1</span> 18693 57.44 57.44 0 18532 57.94 57.94 0
2147483648 536870912 float <span class="nb">sum</span> <span class="nt">-1</span> 37361 57.48 57.48 0 37038 57.98 57.98 0
4294967296 1073741824 float <span class="nb">sum</span> <span class="nt">-1</span> 74642 57.54 57.54 0 74055 58.00 58.00 0
8589934592 2147483648 float <span class="nb">sum</span> <span class="nt">-1</span> 149286 57.54 57.54 0 147984 58.05 58.05 0
nc48v4-pg0-1:74653:74653 <span class="o">[</span>1] NCCL INFO <span class="nb">comm </span>0x563243d4a050 rank 1 nranks 2 cudaDev 1 busId 200000 - Destroy COMPLETE
nc48v4-pg0-1:74652:74652 <span class="o">[</span>0] NCCL INFO <span class="nb">comm </span>0x558b00385b60 rank 0 nranks 2 cudaDev 0 busId 100000 - Destroy COMPLETE
<span class="c"># Out of bounds values : 0 OK</span>
<span class="c"># Avg bus bandwidth : 24.2941</span>
<span class="c">#</span>
</code></pre></div></div>
<h4 id="test-with-only-nccl_topo_file-comment-out-nccl_graph_fileoptmicrosoftncv4graphxml-in-etcncclconf-1">Test with only <code class="language-plaintext highlighter-rouge">NCCL_TOPO_FILE</code>. Comment out <code class="language-plaintext highlighter-rouge">NCCL_GRAPH_FILE=/opt/microsoft/ncv4/graph.xml</code> in <code class="language-plaintext highlighter-rouge">/etc/nccl.conf</code>.</h4>
<p>NCCL results</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># out-of-place in-place</span>
<span class="c"># size count type redop root time algbw busbw #wrong time algbw busbw #wrong</span>
<span class="c"># (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)</span>
8 2 float <span class="nb">sum</span> <span class="nt">-1</span> 8.74 0.00 0.00 0 8.59 0.00 0.00 0
16 4 float <span class="nb">sum</span> <span class="nt">-1</span> 8.49 0.00 0.00 0 8.43 0.00 0.00 0
32 8 float <span class="nb">sum</span> <span class="nt">-1</span> 8.56 0.00 0.00 0 8.50 0.00 0.00 0
64 16 float <span class="nb">sum</span> <span class="nt">-1</span> 8.66 0.01 0.01 0 8.52 0.01 0.01 0
128 32 float <span class="nb">sum</span> <span class="nt">-1</span> 8.62 0.01 0.01 0 8.43 0.02 0.02 0
256 64 float <span class="nb">sum</span> <span class="nt">-1</span> 8.70 0.03 0.03 0 8.48 0.03 0.03 0
512 128 float <span class="nb">sum</span> <span class="nt">-1</span> 8.61 0.06 0.06 0 8.51 0.06 0.06 0
1024 256 float <span class="nb">sum</span> <span class="nt">-1</span> 9.22 0.11 0.11 0 8.87 0.12 0.12 0
2048 512 float <span class="nb">sum</span> <span class="nt">-1</span> 9.27 0.22 0.22 0 9.94 0.21 0.21 0
4096 1024 float <span class="nb">sum</span> <span class="nt">-1</span> 9.48 0.43 0.43 0 9.21 0.44 0.44 0
8192 2048 float <span class="nb">sum</span> <span class="nt">-1</span> 10.49 0.78 0.78 0 10.00 0.82 0.82 0
16384 4096 float <span class="nb">sum</span> <span class="nt">-1</span> 11.07 1.48 1.48 0 10.86 1.51 1.51 0
32768 8192 float <span class="nb">sum</span> <span class="nt">-1</span> 11.26 2.91 2.91 0 11.82 2.77 2.77 0
65536 16384 float <span class="nb">sum</span> <span class="nt">-1</span> 11.54 5.68 5.68 0 11.53 5.68 5.68 0
131072 32768 float <span class="nb">sum</span> <span class="nt">-1</span> 12.16 10.78 10.78 0 11.90 11.02 11.02 0
262144 65536 float <span class="nb">sum</span> <span class="nt">-1</span> 14.07 18.64 18.64 0 13.74 19.08 19.08 0
524288 131072 float <span class="nb">sum</span> <span class="nt">-1</span> 17.20 30.48 30.48 0 17.21 30.47 30.47 0
1048576 262144 float <span class="nb">sum</span> <span class="nt">-1</span> 33.18 31.60 31.60 0 32.99 31.78 31.78 0
2097152 524288 float <span class="nb">sum</span> <span class="nt">-1</span> 40.34 51.99 51.99 0 40.17 52.21 52.21 0
4194304 1048576 float <span class="nb">sum</span> <span class="nt">-1</span> 50.02 83.85 83.85 0 49.50 84.73 84.73 0
8388608 2097152 float <span class="nb">sum</span> <span class="nt">-1</span> 79.09 106.07 106.07 0 77.09 108.82 108.82 0
16777216 4194304 float <span class="nb">sum</span> <span class="nt">-1</span> 117.2 143.12 143.12 0 115.9 144.81 144.81 0
33554432 8388608 float <span class="nb">sum</span> <span class="nt">-1</span> 209.2 160.39 160.39 0 208.4 161.03 161.03 0
67108864 16777216 float <span class="nb">sum</span> <span class="nt">-1</span> 374.7 179.11 179.11 0 374.1 179.37 179.37 0
134217728 33554432 float <span class="nb">sum</span> <span class="nt">-1</span> 724.9 185.16 185.16 0 724.0 185.38 185.38 0
268435456 67108864 float <span class="nb">sum</span> <span class="nt">-1</span> 1393.9 192.58 192.58 0 1394.1 192.55 192.55 0
536870912 134217728 float <span class="nb">sum</span> <span class="nt">-1</span> 2718.0 197.53 197.53 0 2722.0 197.24 197.24 0
1073741824 268435456 float <span class="nb">sum</span> <span class="nt">-1</span> 5196.2 206.64 206.64 0 5206.0 206.25 206.25 0
2147483648 536870912 float <span class="nb">sum</span> <span class="nt">-1</span> 9985.7 215.06 215.06 0 9954.4 215.73 215.73 0
4294967296 1073741824 float <span class="nb">sum</span> <span class="nt">-1</span> 19344 222.03 222.03 0 19362 221.82 221.82 0
8589934592 2147483648 float <span class="nb">sum</span> <span class="nt">-1</span> 38177 225.00 225.00 0 38158 225.11 225.11 0
nc48v4-pg0-1:74928:74928 <span class="o">[</span>1] NCCL INFO <span class="nb">comm </span>0x561a508c4cf0 rank 1 nranks 2 cudaDev 1 busId 200000 - Destroy COMPLETE
nc48v4-pg0-1:74927:74927 <span class="o">[</span>0] NCCL INFO <span class="nb">comm </span>0x562aaaf43cc0 rank 0 nranks 2 cudaDev 0 busId 100000 - Destroy COMPLETE
<span class="c"># Out of bounds values : 0 OK</span>
<span class="c"># Avg bus bandwidth : 73.4005</span>
<span class="c">#</span>
</code></pre></div></div>
<h3 id="conclusion">Conclusion</h3>
<p>With <code class="language-plaintext highlighter-rouge">NCCL_GRAPH_FILE</code>, NC96v4 does not have NCCL performance difference. But on NC48v4, disabling <code class="language-plaintext highlighter-rouge">NCCL_GRAPH_FILE</code> will 4x NCCL_allreduce BW.</p>Jingchao Zhangjingczhang@microsoft.comYou can setup a SLURM cluster on Azure using AZHOP. This blog has details on how to deploy AZHOP.Disk io test2023-06-30T00:00:00+00:002023-06-30T00:00:00+00:00https://jingchaozhang.github.io/Disk%20IO%20test<h3 id="disk-details">Disk Details</h3>
<table>
<thead>
<tr>
<th>Disk name</th>
<th>Storage type</th>
<th>Size (GiB)</th>
<th>Max IOPS</th>
<th>Max throughput (MBps)</th>
<th>Encryption</th>
<th>Host caching</th>
</tr>
</thead>
<tbody>
<tr>
<td>V100_OsDisk_1_9d9847279b55436b92dba603cd7bd366</td>
<td>Premium SSD LRS</td>
<td>1024</td>
<td>5000</td>
<td>200</td>
<td>SSE with PMK</td>
<td>Read/Write</td>
</tr>
</tbody>
</table>
<h3 id="file-system-overview">File system overview</h3>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>base<span class="o">)</span> jingchao@V100:~<span class="nv">$ </span><span class="nb">df</span> <span class="nt">-h</span>
Filesystem Size Used Avail Use% Mounted on
/dev/root 993G 509G 484G 52% /
devtmpfs 221G 0 221G 0% /dev
tmpfs 221G 0 221G 0% /dev/shm
tmpfs 45G 1.5M 45G 1% /run
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 221G 0 221G 0% /sys/fs/cgroup
/dev/loop0 64M 64M 0 100% /snap/core20/1822
/dev/sda15 105M 6.1M 99M 6% /boot/efi
/dev/loop1 64M 64M 0 100% /snap/core20/1950
/dev/loop2 92M 92M 0 100% /snap/lxd/24061
/dev/loop3 54M 54M 0 100% /snap/snapd/19457
/dev/loop4 50M 50M 0 100% /snap/snapd/18357
/dev/sdb1 2.9T 192M 2.9T 1% /mnt
</code></pre></div></div>
<p>There are several tests and tools you can use on Ubuntu to help determine the bottleneck in your download speed.</p>
<ol>
<li><strong>Network Speed Tests:</strong> Tools like <code class="language-plaintext highlighter-rouge">speedtest-cli</code> can be used to test your network speed. Install it using pip:
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip install speedtest-cli
</code></pre></div> </div>
<p>Then, simply run <code class="language-plaintext highlighter-rouge">speedtest-cli</code> in your terminal. This will give you an indication of your download and upload speeds.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>base<span class="o">)</span> jingchao@V100:~<span class="nv">$ </span>speedtest-cli <span class="nt">--bytes</span>
Retrieving speedtest.net configuration...
Testing from Microsoft Corporation <span class="o">(</span>20.225.51.252<span class="o">)</span>...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by United Cooperative Services – DA3 <span class="o">(</span>Burleson, TX<span class="o">)</span> <span class="o">[</span>589.57 km]: 15.037 ms
Testing download speed................................................................................
Download: 218.03 Mbyte/s
Testing upload speed......................................................................................................
Upload: 166.72 Mbyte/s
</code></pre></div> </div>
</li>
<li><strong>Disk Speed Tests:</strong> Tools like <code class="language-plaintext highlighter-rouge">dd</code> and <code class="language-plaintext highlighter-rouge">hdparm</code> can be used to test the speed of your disk. Here is a simple command using <code class="language-plaintext highlighter-rouge">dd</code> to test your write speed:
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc
</code></pre></div> </div>
<p>This will create a file named “tempfile” in your current directory and measure how long it takes to write it. The <code class="language-plaintext highlighter-rouge">bs</code> parameter is the block size (1M in this case), and <code class="language-plaintext highlighter-rouge">count</code> is the number of blocks. So this command writes a 1GB file. You can adjust these values to test with different file sizes.</p>
<p><code class="language-plaintext highlighter-rouge">/dev/zero</code> is a special file in Unix-like operating systems that produces as many null bytes (bytes with a value of zero) as are read from it.</p>
<p>If you want to write random data instead of null bytes, you can use <code class="language-plaintext highlighter-rouge">/dev/urandom</code> as the input file. <code class="language-plaintext highlighter-rouge">/dev/urandom</code> is another special file that produces random bytes when read. Here’s how you would modify the <code class="language-plaintext highlighter-rouge">dd</code> command:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">dd </span><span class="k">if</span><span class="o">=</span>/dev/urandom <span class="nv">of</span><span class="o">=</span>tempfile <span class="nv">bs</span><span class="o">=</span>1M <span class="nv">count</span><span class="o">=</span>1024 <span class="nv">conv</span><span class="o">=</span>fdatasync,notrunc
</code></pre></div> </div>
<p>Keep in mind that writing random data will generally be slower than writing null bytes, because generating random data requires some computational effort. As a result, this test might give a lower write speed, but it might also be a more accurate representation of real-world disk write performance. After you’ve done the test, don’t forget to remove the tempfile with <code class="language-plaintext highlighter-rouge">rm tempfile</code>.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>base<span class="o">)</span> jingchao@V100:~<span class="nv">$ </span><span class="nb">dd </span><span class="k">if</span><span class="o">=</span>/dev/zero <span class="nv">of</span><span class="o">=</span>tempfile <span class="nv">bs</span><span class="o">=</span>1M <span class="nv">count</span><span class="o">=</span>1024 <span class="nv">conv</span><span class="o">=</span>fdatasync,notrunc
1024+0 records <span class="k">in
</span>1024+0 records out
1073741824 bytes <span class="o">(</span>1.1 GB, 1.0 GiB<span class="o">)</span> copied, 5.38708 s, 199 MB/s
<span class="o">(</span>base<span class="o">)</span> jingchao@V100:~<span class="nv">$ </span>
</code></pre></div> </div>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>base<span class="o">)</span> jingchao@V100:~<span class="nv">$ </span><span class="nb">dd </span><span class="k">if</span><span class="o">=</span>/dev/urandom <span class="nv">of</span><span class="o">=</span>tempfile <span class="nv">bs</span><span class="o">=</span>1M <span class="nv">count</span><span class="o">=</span>1024 <span class="nv">conv</span><span class="o">=</span>fdatasync,notrunc
1024+0 records <span class="k">in
</span>1024+0 records out
1073741824 bytes <span class="o">(</span>1.1 GB, 1.0 GiB<span class="o">)</span> copied, 8.58883 s, 125 MB/s
<span class="o">(</span>base<span class="o">)</span> jingchao@V100:~<span class="nv">$ </span>
</code></pre></div> </div>
<p>After the test, don’t forget to remove the tempfile using:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rm tempfile
</code></pre></div> </div>
<p>Alternatively, you can use <code class="language-plaintext highlighter-rouge">hdparm</code> to test your read speed:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo hdparm -Tt /dev/sda
</code></pre></div> </div>
<p>Replace <code class="language-plaintext highlighter-rouge">/dev/sda</code> with the path to your SSD. This will give you buffered and cached read speeds.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>base<span class="o">)</span> jingchao@V100:~<span class="nv">$ </span><span class="nb">sudo </span>hdparm <span class="nt">-Tt</span> /dev/sda
/dev/sda:
Timing cached reads: 20018 MB <span class="k">in </span>1.98 seconds <span class="o">=</span> 10092.45 MB/sec
SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0a 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Timing buffered disk reads: 746 MB <span class="k">in </span>3.01 seconds <span class="o">=</span> 247.88 MB/sec
<span class="o">(</span>base<span class="o">)</span> jingchao@V100:~<span class="nv">$ </span>
</code></pre></div> </div>
</li>
<li>
<p><strong>CPU and Memory Utilization:</strong> Tools like <code class="language-plaintext highlighter-rouge">top</code>, <code class="language-plaintext highlighter-rouge">htop</code>, or <code class="language-plaintext highlighter-rouge">vmstat</code> can be used to monitor CPU and memory usage. Simply run <code class="language-plaintext highlighter-rouge">top</code> or <code class="language-plaintext highlighter-rouge">htop</code> in your terminal to see a live view of resource usage. The <code class="language-plaintext highlighter-rouge">vmstat</code> command can also be useful for monitoring system performance.</p>
</li>
<li><strong>Network Monitoring:</strong> Tools like <code class="language-plaintext highlighter-rouge">nethogs</code> can be used to monitor network usage by process. Install it using:
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo apt-get install nethogs
</code></pre></div> </div>
<p>Then run <code class="language-plaintext highlighter-rouge">sudo nethogs</code> to see a live view of network usage. This can help you see if other processes are using a lot of bandwidth.</p>
</li>
<li><strong>HuggingFace Specifics:</strong> If you’re using HuggingFace’s <code class="language-plaintext highlighter-rouge">datasets</code> library, you could use Python’s built-in <code class="language-plaintext highlighter-rouge">cProfile</code> module to profile your script and see where the most time is being spent. This could help identify if the bottleneck is in the download, the disk write, the decompression, or some other part of the process.</li>
</ol>
<p>Remember, these tests will give you raw numbers, and it’s the interpretation of these numbers that will help you identify the bottleneck. For example, if your network speed is very high but your disk write speed is low, the disk might be the bottleneck. On the other hand, if both your network and disk speeds are high, but you’re seeing high CPU usage, the bottleneck might be the CPU.</p>Jingchao Zhangjingczhang@microsoft.comDisk DetailsAzhop e2e deployment2023-04-28T00:00:00+00:002023-04-28T00:00:00+00:00https://jingchaozhang.github.io/AZHOP%20E2E%20deployment<p>Azure HPC On-Demand Platform (az-hop) is a tool that provides an end-to-end deployment mechanism for a base HPC infrastructure on Azure. It uses industry standard tools like Terraform, Ansible and Packer to provision and configure a complete HPC cluster solution that is ready for users to run applications. It also includes features such as an HPC OnDemand Portal, an Active Directory, a Job Scheduler, dynamic resources provisioning and autoscaling, a Jumpbox, and various storage options <a href="https://azure.github.io/az-hop/">ref</a>.</p>
<p>Clone the repo</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone <span class="nt">--recursive</span> https://github.com/Azure/az-hop.git
</code></pre></div></div>
<p>Create the <code class="language-plaintext highlighter-rouge">config.yml</code> file</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">---</span>
project_name: az-hop
location: eastus
resource_group: JZ-azhop_v2
use_existing_rg: <span class="nb">false
</span>tags:
<span class="nb">env</span>: dev
project: azhop
log_analytics:
create: <span class="nb">false
</span>monitoring:
install_agent: <span class="nb">false
</span>alerting:
enabled: <span class="nb">false
</span>admin_email: email@email.com
local_volume_threshold: 80
anf:
create: <span class="nb">false
</span>homefs_size_tb: 4
homefs_service_level: Standard
dual_protocol: <span class="nb">false</span> <span class="c"># true to enable SMB support. false by default</span>
alert_threshold: 80 <span class="c"># alert when ANF volume reaches this threshold</span>
azurefiles:
create: <span class="nb">true
</span>size_gb: 1024
mounts:
home: <span class="c"># This home name can't be changed</span>
<span class="nb">type</span>: azurefiles <span class="c"># anf or azurefiles, default to anf. One of the two should be defined in order to mount the home directory</span>
mountpoint: /anfhome <span class="c"># /sharedhome for example</span>
server: <span class="s1">''</span> <span class="c"># Specify an existing NFS server name or IP, when using the ANF built in use ''</span>
<span class="nb">export</span>: <span class="s1">''</span> <span class="c"># Specify an existing NFS export directory, when using the ANF built in use ''</span>
options: <span class="s1">'vers=4,minorversion=1,sec=sys'</span> <span class="c">#'' # Specify the mount options. Default to rw,hard,rsize=262144,wsize=262144,vers=3,tcp,_netdev</span>
admin_user: hpcadmin
network:
create_nsg: <span class="nb">true
</span>vnet:
name: hpcvnet <span class="c"># Optional - default to hpcvnet</span>
address_space: <span class="s2">"10.101.0.0/23"</span>
subnets: <span class="c"># all subnets are optionals</span>
frontend:
name: frontend
address_prefixes: <span class="s2">"10.101.0.0/29"</span>
create: <span class="nb">true</span> <span class="c"># create the subnet if true. default to true when not specified, default to false if using an existing VNET when not specified</span>
admin:
name: admin
address_prefixes: <span class="s2">"10.101.0.16/28"</span>
create: <span class="nb">true
</span>ad:
name: ad
address_prefixes: <span class="s2">"10.101.0.8/29"</span>
create: <span class="nb">true
</span>netapp:
name: netapp
address_prefixes: <span class="s2">"10.101.0.32/28"</span>
create: <span class="nb">true
</span>compute:
name: compute
address_prefixes: <span class="s2">"10.101.1.0/24"</span>
create: <span class="nb">true
</span>locked_down_network:
enforce: <span class="nb">false
</span>public_ip: <span class="nb">true</span> <span class="c"># Enable public IP creation for Jumpbox, OnDemand and create images. Default to true</span>
linux_base_image: <span class="s2">"OpenLogic:CentOS:7_9-gen2:latest"</span>
windows_base_image: <span class="s2">"MicrosoftWindowsServer:WindowsServer:2019-Datacenter-smalldisk:latest"</span> <span class="c"># publisher:offer:sku:version or image_id</span>
deployer:
vm_size: Standard_B2ms
ad:
vm_size: Standard_B2ms
ondemand:
vm_size: Standard_D4s_v5
generate_certificate: <span class="nb">true</span> <span class="c"># Generate an SSL certificate for the OnDemand portal. Default to true</span>
grafana:
vm_size: Standard_B2ms
guacamole:
vm_size: Standard_B2ms
scheduler:
vm_size: Standard_B2ms
cyclecloud:
vm_size: Standard_B2ms
<span class="nb">users</span>:
- <span class="o">{</span> name: hpcuser, uid: 10001 <span class="o">}</span>
- <span class="o">{</span> name: adminuser, uid: 10002, <span class="nb">groups</span>: <span class="o">[</span>5001, 5002] <span class="o">}</span>
- <span class="o">{</span> name: john.john, uid: 10003 <span class="o">}</span>
usergroups:
- name: Domain Users <span class="c"># All users will be added to this one by default</span>
gid: 5000
- name: az-hop-admins
gid: 5001
description: <span class="s2">"For users with azhop admin privileges"</span>
- name: az-hop-localadmins
gid: 5002
description: <span class="s2">"For users with sudo right or local admin right on nodes"</span>
cvmfs_eessi:
enabled: <span class="nb">false
</span>queue_manager: slurm
slurm:
accounting_enabled: <span class="nb">false
</span>slurm_version: 20.11.9
enroot:
enroot_version: 3.4.1
database:
user: sqladmin
bastion:
create: <span class="nb">false
</span>vpn_gateway:
create: <span class="nb">false
</span>authentication:
httpd_auth: basic <span class="c"># oidc or basic</span>
autoscale:
idle_timeout: 1800 <span class="c"># Idle time in seconds before shutting down VMs - default to 1800 like in CycleCloud</span>
queues:
- name: execute
vm_size: Standard_F2s_v2
max_core_count: 20
image: azhpc:azhop-compute:ubuntu-2004:latest
spot: <span class="nb">false
</span>ColocateNodes: <span class="nb">false
</span>enable_remote_winviz: <span class="nb">false</span> <span class="c"># Set to true to enable windows remote visualization</span>
remoteviz:
- name: winviz <span class="c"># This name is fixed and can't be changed</span>
vm_size: Standard_NV12s_v3 <span class="c"># Standard_NV8as_v4 Only NVsv3 and NVsV4 are supported</span>
max_core_count: 48
image: <span class="s2">"MicrosoftWindowsDesktop:Windows-10:21h1-pron:latest"</span>
ColocateNodes: <span class="nb">false
</span>spot: <span class="nb">false
</span>EnableAcceleratedNetworking: <span class="nb">false
</span>applications:
bc_codeserver:
enabled: <span class="nb">true
</span>bc_jupyter:
enabled: <span class="nb">true
</span>bc_amlsdk:
enabled: <span class="nb">false
</span>bc_rstudio:
enabled: <span class="nb">true
</span>bc_ansys_workbench:
enabled: <span class="nb">false
</span>bc_vmd:
enabled: <span class="nb">false
</span>bc_paraview:
enabled: <span class="nb">false
</span>bc_vizer:
enabled: <span class="nb">false</span>
</code></pre></div></div>
<p>Install dependences</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo</span> ./toolset/scripts/install.sh
</code></pre></div></div>
<p>Build the backbone using bicep</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>./build.sh <span class="nt">-a</span> apply <span class="nt">-l</span> bicep
</code></pre></div></div>
<p>Find the deployer VM ip from the Azure portal. Connect to the deployer VM</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>ssh <span class="nt">-i</span> hpcadmin_id_rsa hpcadmin@20.231.50.26
The authenticity of host <span class="s1">'20.231.50.26 (20.231.50.26)'</span> can<span class="s1">'t be established.
ECDSA key fingerprint is SHA256:lPf4I4nZmZ7hzuxif9RZOVdMmGC6zMvSylTE79Tapwk.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '</span>20.231.50.26<span class="s1">' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 5.15.0-1036-azure x86_64)
</span></code></pre></div></div>
<p>Monitor the ansible installation</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>hpcadmin@deployer:~<span class="nv">$ </span><span class="nb">sudo</span> <span class="nt">-i</span>
root@deployer:~# <span class="nb">cd</span> /var/log/
root@deployer:/var/log# <span class="nb">ls
</span>apt azure chrony cloud-init.log dmesg journal landscape private ubuntu-advantage.log waagent.log
auth.log btmp cloud-init-output.log dist-upgrade dpkg.log kern.log lastlog syslog unattended-upgrades wtmp
root@deployer:/var/log# <span class="nb">tail</span> <span class="nt">-f</span> cloud-init-output.log
</code></pre></div></div>
<p>The end of a successful deployment:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>PLAY RECAP <span class="k">*********************************************************************</span>
ccportal : <span class="nv">ok</span><span class="o">=</span>3 <span class="nv">changed</span><span class="o">=</span>2 <span class="nv">unreachable</span><span class="o">=</span>0 <span class="nv">failed</span><span class="o">=</span>0 <span class="nv">skipped</span><span class="o">=</span>1 <span class="nv">rescued</span><span class="o">=</span>0 <span class="nv">ignored</span><span class="o">=</span>0
grafana : <span class="nv">ok</span><span class="o">=</span>3 <span class="nv">changed</span><span class="o">=</span>2 <span class="nv">unreachable</span><span class="o">=</span>0 <span class="nv">failed</span><span class="o">=</span>0 <span class="nv">skipped</span><span class="o">=</span>1 <span class="nv">rescued</span><span class="o">=</span>0 <span class="nv">ignored</span><span class="o">=</span>0
ondemand : <span class="nv">ok</span><span class="o">=</span>3 <span class="nv">changed</span><span class="o">=</span>2 <span class="nv">unreachable</span><span class="o">=</span>0 <span class="nv">failed</span><span class="o">=</span>0 <span class="nv">skipped</span><span class="o">=</span>1 <span class="nv">rescued</span><span class="o">=</span>0 <span class="nv">ignored</span><span class="o">=</span>0
scheduler : <span class="nv">ok</span><span class="o">=</span>3 <span class="nv">changed</span><span class="o">=</span>2 <span class="nv">unreachable</span><span class="o">=</span>0 <span class="nv">failed</span><span class="o">=</span>0 <span class="nv">skipped</span><span class="o">=</span>1 <span class="nv">rescued</span><span class="o">=</span>0 <span class="nv">ignored</span><span class="o">=</span>0
Saturday 29 April 2023 03:58:01 +0000 <span class="o">(</span>0:00:01.488<span class="o">)</span> 0:00:03.239 <span class="k">********</span>
<span class="o">===============================================================================</span>
chrony <span class="nt">------------------------------------------------------------------</span> 3.06s
include_role <span class="nt">------------------------------------------------------------</span> 0.11s
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
total <span class="nt">-------------------------------------------------------------------</span> 3.17s
Command succeeded!
Cloud-init v. 23.1.1-0ubuntu0~20.04.1 running <span class="s1">'modules:final'</span> at Sat, 29 Apr 2023 03:16:47 +0000. Up 27.90 seconds.
Cloud-init v. 23.1.1-0ubuntu0~20.04.1 finished at Sat, 29 Apr 2023 03:58:01 +0000. Datasource DataSourceAzure <span class="o">[</span><span class="nv">seed</span><span class="o">=</span>/dev/sr0]. Up 2501.89 seconds
</code></pre></div></div>
<p>Get the FQDN, and username/password:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@deployer:/az-hop# <span class="nb">cd</span> /az-hop/
root@deployer:/az-hop# <span class="nb">pwd</span>
/az-hop
root@deployer:/az-hop# <span class="nb">grep </span>ondemand_fqdn playbooks/group_vars/all.yml
ondemand_fqdn: ondemandmxsmmtrkr6ehsx.eastus.cloudapp.azure.com
root@deployer:/az-hop# ./bin/get_secret john.john
<span class="nv">j5hTzIyBXqVExNpB35pJGVra5sM</span><span class="o">=</span>
</code></pre></div></div>Jingchao Zhangjingczhang@microsoft.comAzure HPC On-Demand Platform (az-hop) is a tool that provides an end-to-end deployment mechanism for a base HPC infrastructure on Azure. It uses industry standard tools like Terraform, Ansible and Packer to provision and configure a complete HPC cluster solution that is ready for users to run applications. It also includes features such as an HPC OnDemand Portal, an Active Directory, a Job Scheduler, dynamic resources provisioning and autoscaling, a Jumpbox, and various storage options ref.A100 white paper2023-04-24T00:00:00+00:002023-04-24T00:00:00+00:00https://jingchaozhang.github.io/A100%20white%20paper<ul>
<li>Third generation Tensor Cores
<ul>
<li>Tensor Cores are specialized high-performance compute cores that perform mixed-precision matrix multiply and accumulate calculations in a single operation.</li>
<li>Supported data types:
<ul>
<li>INT4</li>
<li>binary</li>
<li>TensorFloat32 (TF32)</li>
<li>IEEE Compliant FP64</li>
<li>BFloat16 (BF16) (BF16/FP32 mixed-precision Tensor Core operations perform at the same speed as FP16/FP32 mixed-precision Tensor Core operations)</li>
</ul>
</li>
</ul>
</li>
<li>TF32
<ul>
<li>The new TF32 operations run <strong>10X faster</strong> than the FP32 FMA operations available with the previous generation data center GPU</li>
<li>TF32 combines the range of FP32 with the precision of FP16</li>
<li>Compared to FP32 on V100, TF32 on A100 provides over <strong>6X speedup</strong> for training the BERT-Large model</li>
<li>TF32 is the default mode for TensorFlow, PyTorch and MXNet, starting with NGC Deep Learning Container 20.06 Release</li>
</ul>
</li>
<li>Fine-grained Structured Sparsity
<ul>
<li>fine-grained structured sparsity and the 2:4 pattern</li>
<li>balanced workload distribution and even utilization of compute nodes</li>
<li>structured sparse matrices can be efficiently compressed</li>
<li>With fine-grained structured sparsity, <em>INT8</em> Tensor Core operations on A100 offer <strong>20X more performance</strong> than on V100, and <em>FP16</em> Tensor Core operations are <strong>5X faster</strong> than on V100</li>
</ul>
</li>
<li>Multi-instance GPU (MIG)
<ul>
<li>spatial partitioning</li>
<li>each GPU instance has its own memory, cache, and streaming multiprocessor (<strong>isolated GPU memory and physical GPU resources</strong>)</li>
</ul>
</li>
<li>NVIDIA® NVSwitch
<ul>
<li>Six second-generation nvswitch</li>
<li>GPU to GPU communication to peak at <strong>600 GB/s</strong></li>
<li>If all GPUs are communicating with each other, the total amount of data transferred peaks at <strong>4.8 TB/s</strong> for both directions.</li>
</ul>
</li>
</ul>
<table>
<thead>
<tr>
<th>Specs</th>
<th>1st Generation</th>
<th>2nd Generation</th>
<th>3rd Generation</th>
</tr>
</thead>
<tbody>
<tr>
<td>Number of GPUs with direct connection / node</td>
<td>Up to 8</td>
<td>Up to 8</td>
<td>Up to 8</td>
</tr>
<tr>
<td>NVSwitch GPU-to-GPU bandwidth</td>
<td>300GB/s</td>
<td>600GB/s</td>
<td>900GB/s</td>
</tr>
<tr>
<td>Total aggregate bandwidth</td>
<td>2.4TB/s</td>
<td>4.8TB/s</td>
<td>7.2TB/s</td>
</tr>
<tr>
<td>Supported NVIDIA Architectures</td>
<td>Volta</td>
<td>Ampere</td>
<td>Hopper</td>
</tr>
</tbody>
</table>
<ul>
<li>NVIDIA NVLink®
<ul>
<li>Third-generation nvlink</li>
<li>Each A100 GPU uses <strong>twelve NVLink interconnects</strong> to communicate with all six NVSwitches (two links from each GPU to each switch)</li>
</ul>
</li>
</ul>
<table>
<thead>
<tr>
<th>Specs</th>
<th>2nd Generation</th>
<th>3rd Generation</th>
<th>4th Generation</th>
</tr>
</thead>
<tbody>
<tr>
<td>NVLink bandwidth per GPU</td>
<td>300GB/s</td>
<td>600GB/s</td>
<td>900GB/s</td>
</tr>
<tr>
<td>Maximum Number of Links per GPU</td>
<td>6</td>
<td>12</td>
<td>18</td>
</tr>
<tr>
<td>Supported NVIDIA Architectures</td>
<td>Volta</td>
<td>Ampere</td>
<td>Hopper</td>
</tr>
</tbody>
</table>
<ul>
<li>Mellanox ConnectX-6 HDR
<ul>
<li>200 Gb/s per port (4 data lanes operating at 50 Gb/s or 200 Gb/s total)</li>
<li><strong>8</strong> single-port Mellanox ConnectX-6 <strong>200Gb/s HDR</strong> InfiniBand ports (also configurable as 200Gb/s Ethernet ports) providing <strong>3.2 Tb/s</strong> of peak bandwidth</li>
<li>DGX A100 incorporates a one-to-one relationship between the IO cards and the GPUs, which means each GPU can communicate directly with external sources without blocking other GPUs’ access to the network.</li>
<li>DGX A100 includes an <strong>additional dual-port</strong> ConnectX-6 card that can be used for high-speed connection to <strong>external storage</strong></li>
</ul>
</li>
<li>PCIe Gen4
<ul>
<li>NVIDIA A100 GPUs are connected to the PCI switch infrastructure over x16 PCI Express Gen 4 (PCIe Gen4) buses that provide 31.5 Gb/s each for a total of 252 Gb/s</li>
<li>These are the links that provide access to the <strong>Mellanox ConnectX-6, the NVMe storage, and the CPUs</strong>.</li>
</ul>
</li>
</ul>Jingchao Zhangjingczhang@microsoft.comThird generation Tensor Cores Tensor Cores are specialized high-performance compute cores that perform mixed-precision matrix multiply and accumulate calculations in a single operation. Supported data types: INT4 binary TensorFloat32 (TF32) IEEE Compliant FP64 BFloat16 (BF16) (BF16/FP32 mixed-precision Tensor Core operations perform at the same speed as FP16/FP32 mixed-precision Tensor Core operations)