Nautilus for TensorFlow example

Status

Yes

See Fuz' page about general info about how to set up and run Nautilus jobs. This page will describe a way to use Nautilus to run production TensorFlow jobs as an example, and how to transfer the data back and forth to your local systems. This will create a persistent storage for you, and then it will transfer your files via rclone to google. The Ceph system is probably a better model, but until we work that out and write it up, this works pretty well.

First, create some persistent storage as described in the intro document. In my case, it's called 'pcwvol' - once you've successfully done it, you can see it via kubectl get pvc

kubectl get pvc
NAME                            STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
fuzvol                          Bound     pvc-f9434cf7-15bd-11e9-b6d3-0cc47a6be994   2Gi        RWO            rook-block     224d
pcwvol                          Bound     pvc-b65a6c1c-48dd-11e9-86c5-0cc47a6be994   4Gi        RWO            rook-block     159d

Now create a simple job that connects back to it, say called Paul-interactive-pod.yaml. This will mount your volume at /pcwvol (name it what you want!) and allow the pod to run for 600 seconds for you to get in and set things up (on the volume).

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
  namespace: ucsb-csc
spec:
      restartPolicy: Never
      containers:
      - name: fs-container
        image:  nginx:1.9.1
        command: ["/bin/bash"]
        args: ["-c", "chmod 777 /pcwvol; sleep 600;"]
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "64Mi"
            cpu: "250m"
        volumeMounts:
        - name: pcwvol
          mountPath: /pcwvol
      volumes:
        - name: pcwvol
          persistentVolumeClaim:
           claimName: pcwvol

Launch this pod (basically, think booting up a system) with kubectl create -f Paul-interactive-pod.yaml. You can see it's running with

$ kubectl get pods
NAME      READY     STATUS    RESTARTS   AGE
gpu-pod   1/1       Running   0          17s

You can now connect to it with kubectl exec -it gpu-pod -- /bin/bash which will open a shell to it. You can then 'cd /pcwvol' to see your files on the persistent filesytem (there won't be any yet).

Create a file in /pcwvol named setup.sh with the following (but change the /pcwvol to whatever your volume is named):

export RCLONE_CONFIG=/pcwvol/.rclone/rclone.conf
export PATH=/pcwvol/bin:$PATH

Exit, and kill the pod with 'kubectl delete pod gpu-pod'. Then restart it (the 'create' command above). Login (with the 'exec bash' line) - check in /pcwvol to see your file is still there. If you type 'history' you won't have a history - since it's rebooted to create a new system (pod) - but the persistent storage has saved your files.

Now, we'll finish setting up, by getting a copy of the 'rclone' program, and also set it up to see your Google Drive.

cd /pcwvol/bin

wget http://admin3.cnsi.ucsb.edu/kube/rclone

.  /pcwvol/setup.sh

rclone config

At this point it will ask you a bunch of questions. The important ones are, start with 'n' (new remote), say named 'Google'. Pick #11 (google drive). Give it 'scope 1'. Just click return for all the other questions you don't know about (service_account, etc.). Don't do advanced config. When it as's 'use auto config' say N (since you're on a 'headless' system). It will give you a long URL that you cut and paste into a browser, the browser gives a code back that you enter into 'rclone' (it will be sitting and waiting for it). Once you do that, you're set!

You can test it with (say to show directories)

root@gpu-pod:/pcwvol# rclone lsd Google:
          -1 2019-05-28 18:59:34        -1 AWS-accounting-data
          -1 2017-12-14 00:36:16        -1 CI-committee
          -1 2018-02-28 19:42:23        -1 GlobusFiles
          -1 2017-11-27 19:57:17        -1 IntelCompilers
          -1 2019-06-11 17:00:19        -1 KV

I like to make a separate directory that all my Nautlius stuff lives in at Google - I don't want my script having a typo and accidentally deleting all my google files!

rclone mkdir Google:Nautilus

then, to look at a listing of this directory

rclone lsl Google:Nautilus

Set up 'rclone' on your own computer too that you're running 'kubectl' on (or just use google drive through the browser). Go ahead and shut down this pod that was really just for setting up everything. kubectl delete pod gpu-pod.

What we'll do next is a 'job' - basically it will boot up a system and run a job (script) for us. You don't want to just launch 'pods' and leave them running with a long sleep command, since if you're not using it, that ties up the hardware(e.g. a GPU) so that someone else can't use it. Using a 'job' will start the 'pod' (a system) and then when it's done, shut down that pod, freeing up the resources again.

Now with our volume what can happen is you just drop files in your Google drive, Nautilus folder, and the first step in your YAML file is you can sync it. So now let's try Fuz' example of doing a tensorflow job. First - in google drive, drop the files you need for tensorflow, e.g. 'classify_image.py' and some jpg files. Then you can just use the following YAML to run TF on say 4 different jpgs. Note that you chain all of your commands together - and don't forget the semicolon on the end of each line!!!! This will clone your google directory, run the jobs, and then clone it back to google.

When it's done, be sure to clean up the job, i.e. kubectl delete job gpu-job

apiVersion: batch/v1
kind: Job
metadata:
  name: gpu-job
  namespace: ucsb-csc
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: fs-container
        image: fuzzrence/tf-gpu2
        command: ["/bin/bash", "-c"]
        args:
          -  echo start;
             . /pcwvol/setup.sh;
             chmod 777 /pcwvol;
             echo 'helloworld' > /pcwvol/test4.txt;
             nvidia-smi >> /pcwvol/test4.txt;
             cd /pcwvol/tf;
             echo 'getting ready to rclone' >> /pcwvol/test4.txt;
             rclone sync Google:Nautilus . >> /pcwvol/test4.txt;
             python ./classify_image.py --image imagenet/basic-cat.jpg > log.basic.cat;
             python ./classify_image.py --image hal.jpg  > log.hal;
             python ./classify_image.py --image dog1.jpg  > dog1.log;
             python ./classify_image.py --image dog2.jpg  > dog2.log;
             python ./classify_image.py --image IMG_6216.jpg > whitney.log;
             echo 'done' >> /pcwvol/test4.txt;
             /bin/cp /pcwvol/test4.txt /pcwvol/tf/test4.txt;
             rclone sync . Google:Nautilus > /pcwvol/sync.log;
             sleep 30;
        resources:
          requests:
            nvidia.com/gpu: 1
            memory: "1024Mi"
            cpu: "1000m"
          limits:
            nvidia.com/gpu: 1
            memory: "2048Mi"
            cpu: "1000m"
        volumeMounts:
        - name: pcwvol
          mountPath: /pcwvol
      volumes:
        - name: pcwvol
          persistentVolumeClaim:
            claimName: pcwvol

Nautilus for TensorFlow example

Contact

Website