Running GPU jobs on Pod

To run GPU jobs, the slurm files are slightly different.  A basic example is below (that simply queries the GPU card)

#!/bin/bash
#SBATCH -N 1 --partition=gpu --ntasks-per-node=1
#SBATCH --gres=gpu:1

cd $SLURM_SUBMIT_DIR

/bin/hostname
srun --gres=gpu:1 /usr/bin/nvidia-smi
sleep 120

 

There's a few key things to keep in mind.  Essentially the first two SBATCH lines request one GPU on one node.   If you leave out the '--partition-gpu' you can just include it on the submit line, e.g. sbatch -p gpu myfile.job

Then to run the actual GPU job, you need to put the 'srun --gres=gpu:1' in front of the command.  This ensures that you are getting exclusive access to the GPU that the system has reserved for you.