Running GPU jobs on Pod

Status
Yes

To run GPU jobs, the slurm files are slightly different. A basic example is below (runs a simple CUDA example job, matrlxMul)

#!/bin/bash
##  Asking for 1 node, and 6 cores - the next line asks for 1 GPU
#SBATCH -N 1 --ntasks-per-node=6 --partition=gpu
#SBATCH --gres=gpu:1

cd $SLURM_SUBMIT_DIR

/bin/hostname
./matrixMul
sleep 30

There's a few key things to keep in mind. Essentially the first two SBATCH lines request one GPU on one node. If you leave out the '--partition-gpu' you can just include it on the submit line, e.g. sbatch -p gpu myfile.job

Pod has only one type of GPU (NVIDIA V100/40GB), but on Braid2, there's several types of GPUs, so you'll want to specify the exact GPU you want.  The choices are

  • k80  (there are 18 K80 GPUs )
  • a40 (there are 8 A40/24GB GPUs )
  • a100 (there are 24 A100/40GB GPUs)
  • a100-80 (there are 4 A100/80GB GPUs)

So the 'gres' line is slightly different on Braid2, i.e. add in the type of GPU

#SBATCH --gres=gpu:a100:1

Note that there is a development node so you can test out your code before submitting it to the gpu queue. It's called pod-gpu and the various cuda's are installed in /usr/local/ - just

ssh pod-gpu

once you're logged in to pod.  It has a T4 GPU and a V100, you can select which one you want to use/develop on by setting the environment variable

CUDA_VISIBLE_DEVICES, e.g.

export CUDA_VISIBLE_DEVICES=1 for the T4 (or set to 0 for the V100)

Note that jobs running for more than an hour or so may be killed without warning - the pod-gpu, and braid2-gpu nodes are for development, not long-running jobs that should be in the queue!