28
Feb
2026
GPU Stress Tests Degraded InfiniBand Performance
The Problem
HPC/AI @Microsoft
The Problem
Introduction Azure’s H100 GPU VMs (Standard_ND96isr_H100_v5) come equipped with 8× 400 Gb/s NDR InfiniBand — 3...
The Problem: Every Node Needs the Same Data When you scale a GPU cluster beyond a single node, you immediately hit a...
Introduction In a previous post, I ran Qwen2.5-72B inference on Azure H100 nodes and showed how NVLink’s 900 GB/s ba...
Introduction When serving a large language model across multiple GPUs, the choice of parallelism strategy directly d...
Profiling AI/ML models on single/multi-GPUs using AzureHPC images