Skip to content

epfl-ml4ed/runai-tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RunAI Tutorial

Before following this tutorial, if you are not familiar with docker we highly recommend that you get familiar with docker.

You do not need to be an expert but you need to know:

  1. What is a Docker image
  2. What is a Docker container
  3. How to read a Dockerfile

This video might help.

You also need to setup runai by following the instructions here.

Disclaimer

Disclaimer

This tutorial has been made on windows with WSL 2 (ubuntu).

If you are on Mac, Windows or another distribution and some of the commands are not recognized, you might need to change them. For example 'sudo service docker start' will not work on Mac or on the Powershell of Windows (on Mac, you can instead open Docker Desktop and then wait for the Docker engine to start).

Remember to use a search engine or a chatbot to help.

Overview

Here are the main steps to run a job on the cluster using RunAI:

  1. Write your scripts (train, eval, preprocessed, etc...)
  2. Write and build a docker image that can run your scripts
  3. Upload your image on EPFL's ic registry (it will be available on the cloud)
  4. Run the image on the cluster using RunAI

Remember to make sure that your scripts and docker are working locally before submitting anything to the cluster (think twice, compute once).

Basic docker image

Basic image

In this section, we will see how to build and run a simple docker image that saves a text file on you local machine using python.

Below is the Dockerfile

# Use the minimalistic Python Alpine image for smaller size.
FROM python:3.9-alpine

# Set the working directory in docker
WORKDIR /app

# Create a directory for the data volume
RUN mkdir /data

# Copy the Python script into the container at /app
COPY write_text.py .

# Always use the Python script as the entry point
ENTRYPOINT ["python", "write_text.py"]

# By default, write "hello world" to the file.
CMD ["--text", "hello world"]

Starting docker (as said before, Mac users can also just start the Docker Desktop app and then wait for the Docker Engine to be started)

sudo service docker start

Build a Docker image with the tag helloworld-image from the current directory (indicated by the . at the end).

docker build -t helloworld-image .

Run the image. Will execute the ENTRYPOINT with the default parameter in CMD.

docker run helloworld-image

Nothing is created on our machine.

To deal with this: option -v maps a directory from your local machine (host) to a directory inside the container.

docker run -v $(pwd):/data helloworld-image

But our python script has an argument: "--text"

If we specify it in when running the container, it will override CMD (the default value)

docker run -v $(pwd):/data helloworld-image --text="New Hello Word"

If you want to remove all your docker images

docker system prune -a

RunAI with basic docker image

Run the docker image with RunAI

Running image with RunAI

First let us login to RunAI

runai login

You should be prompted with a link to get a password.

If you receive "Fail to get cluster version" or "configmaps is forbidden" warnings, you should ask another lab member who already has access to RunAI to give you the necessary rights to push to your lab's project (for ML4ED, it's d-vet) on ic-registry. If you don't follow this step, you will receive the "namespaces is forbidden" error when pushing your image later.

If you receive the error below, make sure that you have properly set up runai by following the instructions here..

ERRO[0000] 404 Not Found: {"error":"Realm does not exist","error_description":"For more on this error consult the server log at the debug level."}

Now let us login to the registry. (try with sudo if does not work)

docker login ic-registry.epfl.ch

Use your Tequila credentials.

Tag your image to the ic-registry, replace d-vet by your lab, otherwise, you will not be able to push.

docker tag helloworld-image ic-registry.epfl.ch/d-vet/helloworld-image

If you forgot the name of your image:

docker images

Now we can push our image:

docker push ic-registry.epfl.ch/d-vet/helloworld-image

Checking the existing RunAI projects

runai list project

If you receive an access denied error after running the command above, run runai config project ml4ed-frej (replace frej with your Gaspar username) and try again. If the config command itself leads to an access denied error, before running the config command, you may need to replace your Kubeconfig at ~/.kube/config with the recommended version that you can find here (remember to keep a backup of the old file somewhere safe before replacing!). After replacing the config file, do the steps from runai login again.

Submit your job. After -p put your project name.

runai submit --name hello1 -p ml4ed-frej -i ic-registry.epfl.ch/d-vet/helloworld-image --cpu-limit 1 --gpu 0

How to check the job:

runai describe job hello1 -p ml4ed-frej

Checking the logs:

 kubectl logs hello1-0-0 -n runai-ml4ed-frej

How to get all jobs

runai list jobs -p ml4ed-frej

How to delete the job:

runai delete job -p ml4ed-frej hello1

How to pass the arguments ? Separate them with --

runai submit --name hello1 -p ml4ed-frej -i ic-registry.epfl.ch/d-vet/helloworld-image --cpu-limit 1 --gpu 0 -- --text="hahaha"

How do we get our file ?: Persistent Volumes.

Using PVC to connect your docker image on the cluster to your lab's server

PVC

Check the name of the Persistent Volumes you lab has access to:

kubectl get pvc -n runai-ml4ed-frej

Launch with the pvc

runai submit --name hello1 -p ml4ed-frej -i ic-registry.epfl.ch/d-vet/helloworld-image --cpu-limit 1 --gpu 0 --pvc runai-ml4ed-frej-ml4eddata1:/data

It fails.

Why?

Security.

New way of launching a job on runai (change the yaml file with your IDs):

kubectl create -f runai-job-default.yaml
apiVersion: run.ai/v2alpha1  # Specifies the version of the Run.ai API this resource is written against.
kind: TrainingWorkload  # Specifies the kind of resource, in this case, a Run.ai Job.
metadata:
  name: hello1  # The name of the job.
  namespace: runai-ml4ed-frej  # The namespace in which the job will be created.
  labels:
    user: frej  # REPLACE
spec:
  image:
    value: ic-registry.epfl.ch/d-vet/helloworld-image  # The Docker image to use for the job.
  name:
    value: hello1  # name prefix of Pod
  arguments:  # Arguments passed to the container, space-separated, if the argument has spaces, use quotes as below.
    value: "--text \"Goodbye World\""
  imagePullPolicy:
    value: Always  # The image pull policy for the job.
  runAsUser:
    value: true
  allowPrivilegeEscalation:  # allow sudo
    value: true
  cpu:
    value: "1"
  cpuLimit:
    value: "1"
  memory:
    value: 256Mi
  memoryLimit:
    value: 512Mi
  gpu:
    value: "0"
  nodePools:
    value: "default" # default is the node type S8 without GPUs
  pvcs:
    items:
      pvc--0:  # First is "pvc--0", second "pvc--1", etc.
        value:
          claimName: runai-ml4ed-frej-ml4eddata1 # REPLACE
          existingPvc: true
          path: /results

Where is my file? Where can I access it? Need to see with your lab or with IC where is the PVC connected to.

Specific details for members of the ML4ED lab

ML4ED

For ML4ED (ask me for the password):

ssh root@icvm0018.xaas.epfl.ch

and then it should be in: /mnt/ic1files_epfl_ch_u13722_ic_ml4ed_001_files_nfs

Bonus: on the jumpbox icvm0018.xaas.epfl.ch, our lab server is also mounted.

It is located in /mnt/ic1files_epfl_ch_D-VET

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published