Tensorflow on Pod

Status
Yes

To use Tensorflow on pod, it's recommended to use conda to solve your environment. So first, install anaconda (if you haven't already) from https://www.anaconda.com/download/#linux .

One time step:

Once it's installed you'll need to request an interactive GPU session so that conda sees that you've got GPUs available (I'm not positive this is necessary, but it worked for me). To do that issue a:

srun -N 1 -n 1 -p gpu --time=1:00:00  --pty bash -i

That will give you an interactive shell on a GPU node. You can verify that by typing

hostname

and it should return either node111,node112, or node113 (and maybe soon node114).

For compiling you can use the pod-gpu node...

ssh pod-gpu

and use the cudas in /usr/local/ or use the interactive node. Either way you’ll need to set a couple of environment variables depending on which cuda you want to use- so…

export PATH=/usr/local/cuda-10.0/bin:$PATH 
export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64:$LD_LIBRARY_PATH

That sets you up for cuda 10.0. You can see the various cuda versions by issuing an ls /usr/local on a gpu node.

You'll need to know one other thing, type

nvcc --version

it will say something like "Cuda compilation tools, release 9.2, V9.2.88" - you need to know the 'release number' (9.2 in this case).

Next have conda solve your tensorflow environment for you by issuing a

conda create --name tf_gpu tensorflow-gpu   cudatoolkit=9.2

(Later, if running the program you're getting errors about driver mismatch, etc., it may be because you have the wrong cudatoolkit= version.)

That will create an environment named tf_gpu for use with your python scripts. Note that it will take a while for conda to work its magic. Once finished you should exit your interactive node by typing 'exit'. That's it.

To run routine jobs, now that you have your environment

A job script would look something like...

#!/bin/bash
#SBATCH -N 1 --partition=gpu --ntasks-per-node=1
#SBATCH --time=30:00
#SBATCH --gres=gpu:1
#SBATCH --nodes=1

cd $SLURM_SUBMIT_DIR

/bin/hostname
source /home/fuz/.bashrc
source activate tf_gpu
srun --gres=gpu:1 python /home/fuz/classify_image.py

The above referenced classify_image.py is available at http://thumper.mrl.ucsb.edu/~fuz/classify_image.py - note you'll have to change the path in that file. Search for 'CHANGE THE PATH' and you'll see where to modify it. The pic I used is available at http://thumper.mrl.ucsb.edu/~fuz/pic.jpg . You'll have to create the directory 'imagenet' in your home directory. The classify_image.py will download a pre-trained data set and then interpret the pic.jpg you give it. Note that this is an old example.