Nautilus for TensorFlow example
See Fuz' page about general info about how to set up and run Nautilus jobs. This page will describe a way to use Nautilus to run production TensorFlow jobs as an example, and how to transfer the data back and forth to your local systems. This will create a persistent storage for you, and then it will transfer your files via rclone to google. The Ceph system is probably a better model, but until we work that out and write it up, this works pretty well.
First, create some persistent storage as described in the intro document. In my case, it's called 'pcwvol' - once you've successfully done it, you can see it via kubectl get pvc
kubectl get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE fuzvol Bound pvc-f9434cf7-15bd-11e9-b6d3-0cc47a6be994 2Gi RWO rook-block 224d pcwvol Bound pvc-b65a6c1c-48dd-11e9-86c5-0cc47a6be994 4Gi RWO rook-block 159d
Now create a simple job that connects back to it, say called Paul-interactive-pod.yaml. This will mount your volume at /pcwvol (name it what you want!) and allow the pod to run for 600 seconds for you to get in and set things up (on the volume).
apiVersion: v1 kind: Pod metadata: name: gpu-pod namespace: ucsb-csc spec: restartPolicy: Never containers: - name: fs-container image: nginx:1.9.1 command: ["/bin/bash"] args: ["-c", "chmod 777 /pcwvol; sleep 600;"] resources: limits: nvidia.com/gpu: 1 memory: "64Mi" cpu: "250m" volumeMounts: - name: pcwvol mountPath: /pcwvol volumes: - name: pcwvol persistentVolumeClaim: claimName: pcwvol
Launch this pod (basically, think booting up a system) with kubectl create -f Paul-interactive-pod.yaml. You can see it's running with
$ kubectl get pods NAME READY STATUS RESTARTS AGE gpu-pod 1/1 Running 0 17s
You can now connect to it with kubectl exec -it gpu-pod -- /bin/bash which will open a shell to it. You can then 'cd /pcwvol' to see your files on the persistent filesytem (there won't be any yet).
Create a file in /pcwvol named setup.sh with the following (but change the /pcwvol to whatever your volume is named):
export RCLONE_CONFIG=/pcwvol/.rclone/rclone.conf export PATH=/pcwvol/bin:$PATH
Exit, and kill the pod with 'kubectl delete pod gpu-pod'. Then restart it (the 'create' command above). Login (with the 'exec bash' line) - check in /pcwvol to see your file is still there. If you type 'history' you won't have a history - since it's rebooted to create a new system (pod) - but the persistent storage has saved your files.
Now, we'll finish setting up, by getting a copy of the 'rclone' program, and also set it up to see your Google Drive.
cd /pcwvol/bin wget http://admin3.cnsi.ucsb.edu/kube/rclone . /pcwvol/setup.sh rclone config
At this point it will ask you a bunch of questions. The important ones are, start with 'n' (new remote), say named 'Google'. Pick #11 (google drive). Give it 'scope 1'. Just click return for all the other questions you don't know about (service_account, etc.). Don't do advanced config. When it as's 'use auto config' say N (since you're on a 'headless' system). It will give you a long URL that you cut and paste into a browser, the browser gives a code back that you enter into 'rclone' (it will be sitting and waiting for it). Once you do that, you're set!
You can test it with (say to show directories)
root@gpu-pod:/pcwvol# rclone lsd Google: -1 2019-05-28 18:59:34 -1 AWS-accounting-data -1 2017-12-14 00:36:16 -1 CI-committee -1 2018-02-28 19:42:23 -1 GlobusFiles -1 2017-11-27 19:57:17 -1 IntelCompilers -1 2019-06-11 17:00:19 -1 KV
I like to make a separate directory that all my Nautlius stuff lives in at Google - I don't want my script having a typo and accidentally deleting all my google files!
rclone mkdir Google:Nautilus
then, to look at a listing of this directory
rclone lsl Google:Nautilus
Set up 'rclone' on your own computer too that you're running 'kubectl' on (or just use google drive through the browser). Go ahead and shut down this pod that was really just for setting up everything. kubectl delete pod gpu-pod.
What we'll do next is a 'job' - basically it will boot up a system and run a job (script) for us. You don't want to just launch 'pods' and leave them running with a long sleep command, since if you're not using it, that ties up the hardware(e.g. a GPU) so that someone else can't use it. Using a 'job' will start the 'pod' (a system) and then when it's done, shut down that pod, freeing up the resources again.
Now with our volume what can happen is you just drop files in your Google drive, Nautilus folder, and the first step in your YAML file is you can sync it. So now let's try Fuz' example of doing a tensorflow job. First - in google drive, drop the files you need for tensorflow, e.g. 'classify_image.py' and some jpg files. Then you can just use the following YAML to run TF on say 4 different jpgs. Note that you chain all of your commands together - and don't forget the semicolon on the end of each line!!!! This will clone your google directory, run the jobs, and then clone it back to google.
When it's done, be sure to clean up the job, i.e. kubectl delete job gpu-job
apiVersion: batch/v1 kind: Job metadata: name: gpu-job namespace: ucsb-csc spec: template: spec: restartPolicy: Never containers: - name: fs-container image: fuzzrence/tf-gpu2 command: ["/bin/bash", "-c"] args: - echo start; . /pcwvol/setup.sh; chmod 777 /pcwvol; echo 'helloworld' > /pcwvol/test4.txt; nvidia-smi >> /pcwvol/test4.txt; cd /pcwvol/tf; echo 'getting ready to rclone' >> /pcwvol/test4.txt; rclone sync Google:Nautilus . >> /pcwvol/test4.txt; python ./classify_image.py --image imagenet/basic-cat.jpg > log.basic.cat; python ./classify_image.py --image hal.jpg > log.hal; python ./classify_image.py --image dog1.jpg > dog1.log; python ./classify_image.py --image dog2.jpg > dog2.log; python ./classify_image.py --image IMG_6216.jpg > whitney.log; echo 'done' >> /pcwvol/test4.txt; /bin/cp /pcwvol/test4.txt /pcwvol/tf/test4.txt; rclone sync . Google:Nautilus > /pcwvol/sync.log; sleep 30; resources: requests: nvidia.com/gpu: 1 memory: "1024Mi" cpu: "1000m" limits: nvidia.com/gpu: 1 memory: "2048Mi" cpu: "1000m" volumeMounts: - name: pcwvol mountPath: /pcwvol volumes: - name: pcwvol persistentVolumeClaim: claimName: pcwvol