Running GPU jobs on Pod

To run GPU jobs, the slurm files are slightly different.  A basic example is below (that simply queries the GPU card)

#SBATCH -N 1 --partition=gpu --ntasks-per-node=1
#SBATCH --gres=gpu:1


srun --gres=gpu:1 /usr/bin/nvidia-smi
sleep 120


There's a few key things to keep in mind.  Essentially the first two SBATCH lines request one GPU on one node.   If you leave out the '--partition-gpu' you can just include it on the submit line, e.g. sbatch -p gpu myfile.job

Then to run the actual GPU job, you need to put the 'srun --gres=gpu:1' in front of the command.  This ensures that you are getting exclusive access to the GPU that the system has reserved for you.

Note that there is a development node so you can test out your code before submitting it to the gpu queue.   It's called  pod-gpu  and the various cuda's are installed in /usr/local/ - just

ssh pod-gpu

once you're logged in to pod.