Nvidia modulus singularity run on gpu

Get the docker image from NVIDIA website. You need to register and login. Download file is modulus_image_v21.06.tar.gz (5.7G). Build a singularity image.

ml load singularity/3.7.4 
singularity build --sandbox modulus docker-archive://modulus_image_v21.06.tar.gz
srun --pty -N 1 -n 1 --mem=80G --partition=gpu --gpus=a100:1 --time=108:00:00 --mpi=pmi2 bash
cd /home/jingchao.zhang/red/modulus/sif

ml load singularity/3.7.4 cuda/11.4.3
singularity shell --nv --writable --bind /home/jingchao.zhang/red/modulus:/mnt modulus

From within the sif shell

cd /mnt/simple_cubic/sif/
python simple_cubic.py

Run in batch mode

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=500GB
#SBATCH --partition=gpu
#SBATCH --gpus=a100:1
#SBATCH --time=72:00:00
#SBATCH --output=job.%J.out
#SBATCH --error=job.%J.err

ml load singularity/3.7.4 cuda/11.4.3
cd /home/jingchao.zhang/red/modulus/simple_cubic/sif
#add srun here to pass the mpi flag
srun --mpi=pmi2 singularity exec --nv --writable --bind .:/mnt /home/jingchao.zhang/red/modulus/sif/modulus python -u /mnt/simple_cubic.py

CUDA OOM Error when running on multi-GPUs

2022-03-02 00:26:00.281061: I tensorflow/stream_executor/cuda/cuda_driver.cc:745] failed to allocate 76.74G (82399395840 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-03-02 00:26:00.284245: I tensorflow/stream_executor/cuda/cuda_driver.cc:745] failed to allocate 69.07G (74159456256 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-03-02 00:26:00.287373: I tensorflow/stream_executor/cuda/cuda_driver.cc:745] failed to allocate 62.16G (66743508992 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-03-02 00:26:00.290562: I tensorflow/stream_executor/cuda/cuda_driver.cc:745] failed to allocate 55.94G (60069154816 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-03-02 00:26:00.293741: I tensorflow/stream_executor/cuda/cuda_driver.cc:745] failed to allocate 50.35G (54062239744 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-03-02 00:26:00.296883: I tensorflow/stream_executor/cuda/cuda_driver.cc:745] failed to allocate 45.31G (48656015360 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

Fix memory explosion issue. Need to edit source code in the container.

singularity shell --nv --writable --bind .:/mnt /home/jingchao.zhang/red/modulus/sif/modulus
vim /usr/local/lib/python3.8/dist-packages/modulus-21.6-py3.8.egg/modulus/solver.py
#add the following two lines after line 224 "config = tf.ConfigProto()"
      config.gpu_options.allow_growth = True
      config.gpu_options.visible_device_list = str(hvd.local_rank())
#save the file and quit the container      

Sample submission file for 8 GPUs on a single node

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --mem=500GB
#SBATCH --partition=gpu
#SBATCH --gpus=a100:8
#SBATCH --reservation=monai
#SBATCH --time=72:00:00
#SBATCH --output=job.%J.out
#SBATCH --error=job.%J.err

ml load singularity/3.7.4 cuda/11.4.3

srun --mpi=pmi2 singularity exec --nv --writable --bind .:/mnt /home/jingchao.zhang/red/modulus/sif/modulus horovodrun -np 8 python -u /mnt/simple_cubic_multiGPU.py

After the fix, all GPUs are fully utilized

Wed Mar  2 01:01:36 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   42C    P0   158W / 400W |  18749MiB / 81251MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   40C    P0   163W / 400W |  19901MiB / 81251MiB |     80%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   42C    P0   157W / 400W |  19901MiB / 81251MiB |     98%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off |                    0 |
| N/A   42C    P0   165W / 400W |  19901MiB / 81251MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off |                    0 |
| N/A   55C    P0   173W / 400W |  19901MiB / 81251MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off |                    0 |
| N/A   55C    P0   178W / 400W |  19901MiB / 81251MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   53C    P0   143W / 400W |  19901MiB / 81251MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off |                    0 |
| N/A   56C    P0   194W / 400W |  18749MiB / 81251MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

Updated: