15
Mar
2026
NCCL Ring vs Tree for Multi-Node LLM Fine-Tuning
TL;DR — We benchmarked NCCL Ring, Tree, and Default allreduce algorithms across 1–8 nodes (8–64 H100 GPUs) on Azure N...
HPC/AI @Microsoft
TL;DR — We benchmarked NCCL Ring, Tree, and Default allreduce algorithms across 1–8 nodes (8–64 H100 GPUs) on Azure N...
Introduction A single thermally throttled GPU — one out of sixteen — can cut your distributed training throughput by...
Introduction In my previous post, I showed that Mixtral 8x7B (MoE) requires 56× more effective interconnect bandwidt...
Introduction In my previous post, I showed that InfiniBand delivers 27–57× higher multi-node throughput than Etherne...
The Problem