H100 white paper

H100 is NVIDIA’s 9th-generation data center GPU. For today’s mainstream AI and HPC models, H100 with InfiniBand interconnect delivers up to 30 times the performance of A100.

Specs H100 A100 (80GB) V100
Transistor Count 80B 54.2B 21.1B
TDP 700W 400W 300/350W
Manufacturing Process TSMC 4N TSMC 7N TSMC 12nm FFN
Form Factor SXM5 SXM4 SXM2/SXM3
Architecture Hopper Ampere Volta
FP32 CUDA Cores 16896 6912 5120
Tensor Cores 528 432 640
Boost Clock (GHz) 1.78 1.41 1.53
Memory Clock (Gbps) 4.8 HBM3 3.2 HBM2e 1.75 HBM2
Memory Bus Width 5120-bit 5120-bit 4096-bit
Memory Bandwidth (TB/s) 3 2 0.9
GPU Memory Capacity (GB) 80 80 16/32
FP32 Vector 60 TFLOPS 19.5 TFLOPS 15.7 TFLOPS
FP64 Vector 30 TFLOPS 9.7 TFLOPS 7.8 TFLOPS
INT8 Tensor 2000 TOPS 624 TOPS NA
FP16 Tensor 1000 TFLOPS 312 TFLOPS 125 TFLOPS
TF32 Tensor 500 TFLOPS 156 TFLOPS NA
FP64 Tensor 60 TFLOPS 19.5 TFLOPS NA
Interconnect NVLink4 18 Links (900 GB/s) NVLink3 12 Links (600 GB/s) NVLink2 6 Links (300 GB/s)
  • Fourth-generation Tensor Cores
    • 6x faster chip-to-chip compared to A100, including per-SM speedup, additional SM count, and higher clocks
    • 3x faster IEEE FP64 and FP32 processing rates chip-to-chip compared to A100
    • New Thread Block Cluster feature, adding another level to the programming hierarchy to now include Threads, Thread Blocks, Thread Block Clusters, and Grids.
  • Transformer Engine
    • a combination of software and custom Hopper Tensor Core technology
    • Transformer Engine intelligently manages and dynamically chooses between FP8 and 16-bit calculations
    • 9x faster AI training and up to 30x faster AI inference speedups on large language models compared to the prior generation A100.
  • HBM3
    • 2x bandwidth increase over the previous generation
    • 3 TB/sec of memory bandwidth
  • 50 MB L2 cache
  • Second-generation Multi-Instance GPU (MIG) technology
    • 3x more compute capacity and nearly 2x more memory bandwidth per GPU Instance compared to A100
    • Confidential Computing capability with MIG-level Trusted Execution Environments (TEE)
    • Up to seven individual GPU Instances are supported, each with dedicated NVDEC and NVJPG units
  • Fourth-generation NVIDIA NVLink
    • 900 GB/sec total bandwidth for multi-GPU IO operating
    • 7x the bandwidth of PCIe Gen 5
  • Third-generation NVSwitch
    • NVSwitches residing both inside and outside of nodes to connect multiple GPUs in servers, clusters, and data center environments
  • NVLink Switch System
    • new second-level NVLink Switches based on third-gen NVSwitch technology
    • up to 32 nodes or 256 GPUs to be connected over NVLink in a 2:1 tapered, fat tree topology
  • PCIe Gen 5
    • 128 GB/sec total bandwidth (64 GB/sec in each direction)
    • 64 GB/sec total bandwidth (32 GB/sec in each direction) in Gen 4 PCIe

Updated: