Gpu utilization check for multinode slurm job
Get a snapshot of GPU stats without DCGM.
GPU query command to get card utilization, temperature, fan speed, power consumption etc.
nvidia-smi --format=csv --query-gpu=power.draw,utilization.gpu,fan.speed,temperature.gpu,memory.used,memory.free
A complete list of query options
nvidia-smi --help-query-gpu
ssh and check ultilization
NODES=$(scontrol show hostname `squeue -j JOBID --noheader -o %N`)
for ssh_host in $NODES
do
echo $ssh_host
ssh -q $ssh_host "nvidia-smi --format=csv --query-gpu=utilization.gpu,utilization.memory"
done