Impact of GPU Thermal Throttling on LLM Training

5 minute read · Updated: March 03, 2026

Introduction

A single thermally throttled GPU — one out of sixteen — can cut your distributed training throughput by 5×. Not because it crashes. Not because it throws an error. It just runs slower, and FSDP’s synchronous gradient reduction forces every other GPU to wait.

In this post, I show exactly how this plays out on an Azure ND H100 v5 cluster. I stress-test 10 nodes with dcgmproftester, identify the 2 nodes with thermal issues, then run Qwen 2.5-7B FSDP fine-tuning across three configurations: 2 healthy nodes, 1 healthy + 1 thermal, and 2 thermal nodes. Per-step throughput logging reveals the progressive degradation as GPUs heat up during training.

Test Environment

Component	Detail
VM SKU	Standard_ND96isr_H100_v5
GPUs per node	8× NVIDIA H100 80 GB HBM3
Inter-node	8× 400 Gb/s NDR InfiniBand (ConnectX-7)
Cluster	10 nodes provisioned
Container	`nvcr.io/nvidia/pytorch:24.12-py3` (PyTorch 2.6, NCCL 2.23.4)
Shared FS	Azure Managed Lustre (`/lustre`)

Step 1: Detecting Thermal Issues with dcgmproftester

Before running any benchmarks, I stress-test every node in the cluster to identify hardware problems. The approach is simple:

Run dcgmproftester13 (target 1004 — FP64 tensor core stress) for 120 seconds on all nodes simultaneously
Monitor nvidia-smi for thermal slowdown flags during the test
Parse results to find any failing nodes

# Stress test parameters
DURATION=120
TARGET=1004
CURRENT_DATE=$(date +"%Y-%m-%d.%Hh%Mm%Ss")
LOGDIR=$(pwd)

# Launch GPU stress test
dcgmproftester13 --no-dcgm-validation --max-processes 0 \
    -t $TARGET -d $DURATION >> "$LOGPATH" 2>&1 &

# Monitor for thermal throttling during the test
nvidia-smi -q -d PERFORMANCE | grep -i slowdown

The HW Slowdown and HW Thermal Slowdown fields in nvidia-smi change from Not Active to Active when a GPU is being throttled due to temperature. I run this check in a loop during the stress test and flag any node where throttling occurs.

Results: 2 out of 10 Nodes Failed

Out of 10 nodes, 8 passed and 2 failed with thermal throttling:

thermal_results.vmssB3GCE6.1004.120.log:
  [RESULT] vmssB3GCE6 | FAILED due to thermal throttling: GPU11 (HW Thermal Slowdown)

thermal_results.vmssTSERJB.1004.120.log:
  [RESULT] vmssTSERJB | FAILED due to thermal throttling: GPU5 (HW Thermal Slowdown)

Each bad node had only 1 GPU out of 8 with a thermal problem. This is a subtle failure mode — the node doesn’t crash, NHC (Node Health Check) may not catch it in a quick scan, and the job will launch successfully. The only symptom is degraded performance that gets worse over time.

Step 2: Measuring the Impact

To quantify the performance cost, I ran Qwen 2.5-7B FSDP fine-tuning (2 nodes × 8 GPUs = 16 GPUs) with per-step timing across three configurations:

Scenario	Nodes
2 Healthy	vmss72VTKQ + vmssNTAV11
1 Healthy + 1 Thermal	vmss72VTKQ + vmssB3GCE6
2 Thermal	vmssB3GCE6 + vmssTSERJB

Training configuration:

Model: Qwen 2.5-7B (7.6B parameters)
FSDP with FULL_SHARD, BF16 mixed precision
Sequence length: 2048, micro batch size: 2
50 benchmark steps after 5 warmup steps
IB enabled (NCCL_IB_DISABLE=0)

The key modification from a standard benchmark: I log timing for every individual step rather than reporting cumulative averages, which would hide the progressive degradation pattern.

Results

Per-Step Throughput

Thermal Throttling Impact on Training Performance

The degradation follows a clear three-phase pattern on the thermal nodes:

Phase	Steps	Throughput	What’s Happening
Cold start	1-2	~95K tok/s	GPUs are still cool from being idle; near-healthy performance
Initial throttle	3-30	~38-42K tok/s	Bad GPU hits thermal limit, clock frequency drops, all other GPUs wait at NCCL barrier
Severe throttle	31-50	~25-26K tok/s	Sustained load drives temperature higher, deeper clock reduction

Summary Table

Metric	2 Healthy	1 Healthy + 1 Thermal	2 Thermal
Step 1 throughput	131K tok/s	96K tok/s	93K tok/s
First 5 steps avg	131K tok/s	55K tok/s	48K tok/s
Last 5 steps avg	131K tok/s	25K tok/s	26K tok/s
Final slowdown	1.0×	5.1×	5.1×
Step time	~500 ms	~2,560 ms	~2,550 ms

Why 1 Bad Node = 2 Bad Nodes

The most striking finding: 1 healthy + 1 thermal performs identically to 2 thermal nodes at steady state. This happens because FSDP’s all-reduce synchronization creates a hard dependency on the slowest participant:

Each training step ends with a gradient all-reduce across all 16 GPUs
NCCL all-reduce is a collective operation — every GPU must participate
The 1 throttled GPU (out of 16) holds up the all-reduce
All 15 healthy GPUs sit idle waiting for the straggler
Adding a second bad node doesn’t make things meaningfully worse — the bottleneck was already established

This is why detecting and excluding even a single bad node matters. The damage is not proportional to the fraction of bad GPUs (1/16 = 6%). It’s catastrophic to the entire job.

Why This Is Easy to Miss

Thermal throttling is insidious because:

No errors in logs. The job runs to completion. Training loss decreases normally. Nothing looks wrong except wall-clock time.
Warmup hides it. The first few steps look fine because GPUs haven’t heated up yet. If your benchmark is short or only measures the first few iterations, you’ll miss it entirely.
Cumulative averages dilute it. Reporting average throughput over the full run shows ~33K tok/s for the mixed scenario — a 4× slowdown. But the actual final steady-state is 25K tok/s — a 5.1× slowdown masked by the fast early steps.
Standard health checks miss it. NHC and basic nvidia-smi checks show 8 GPUs present, all with memory allocated, driver loaded. You need a sustained stress test to trigger the thermal condition.

Recommendations

Run thermal stress tests before any benchmark or training job. Use dcgmproftester with target 1004 for at least 120 seconds. Monitor nvidia-smi -q -d PERFORMANCE for HW Thermal Slowdown during the test.
Exclude failing nodes automatically. Maintain a hostfile of healthy nodes and use it for all distributed jobs. One bad node out of 100 will throttle your entire training run.
Log per-step timing, not just averages. Cumulative averages hide progressive degradation. Per-step timing reveals whether your cluster is thermally stable or slowly deteriorating.
Monitor GPU temperature in production. Tools like DCGM, Prometheus exporters, or Azure Monitor can alert when GPU junction temperatures approach throttling thresholds (~83°C for H100).
Report failures to your cloud provider. Thermal issues are typically hardware problems (failed fans, degraded thermal interface material) that require physical intervention.

Key Takeaways

1 bad GPU out of 16 causes a 5.1× slowdown for the entire distributed training job
The degradation is progressive — starts at ~1.4× and worsens to 5.1× as GPUs heat up
1 bad node performs the same as 2 bad nodes due to synchronous all-reduce
Standard health checks won’t catch it — you need sustained stress testing
Always log per-step timing to catch thermal drift in long-running jobs

This is a personal blog. Opinions and recommendations are my own, not Microsoft’s.

Impact of GPU Thermal Throttling on LLM Training

Introduction

Test Environment

Step 1: Detecting Thermal Issues with dcgmproftester

Results: 2 out of 10 Nodes Failed

Step 2: Measuring the Impact

Results

Per-Step Throughput

Summary Table

Why 1 Bad Node = 2 Bad Nodes

Why This Is Easy to Miss

Recommendations

Key Takeaways

Share on

Leave a Comment

Introduction

Test Environment

Step 1: Detecting Thermal Issues with dcgmproftester

Results: 2 out of 10 Nodes Failed

Step 2: Measuring the Impact

Results

Per-Step Throughput

Summary Table

Why 1 Bad Node = 2 Bad Nodes

Why This Is Easy to Miss

Recommendations

Key Takeaways

Related Posts

Share on

Leave a Comment