This repo covers Kubeflow Environment with LABs: Kubeflow GUI, Jupyter Notebooks running on Kubernetes Pod, Kubeflow Pipeline, KALE (Kubeflow Automated PipeLines Engine), KATIB (AutoML: Finding Best Hyperparameter Values), KFServe (Model Serving), Training Operators (Distributed Training), Projects, etc. Possible usage scenarios are aimed to update over time.
Kubeflow is powerful tool that runs on Kubernetes (K8s) with containers (process isolation, scaling, distributed and parallel training). Kubeflow can be installed on-premise (WSL2 or MiniKF), and Cloud (AWS, Azure, GCP; ref: https://www.kubeflow.org/docs/started/installing-kubeflow/)
This repo makes easy to learn and apply projects on your local machine with MiniKF, Virtualbox 6.1.40 and Vagrant without any FEE (Min: 16GB RAM, 6 CPU cores, 70-80 GB Disk space).
- Have a knowledge of
- Container Technology (Docker). You can learn it from here => Fast-Docker
- Container Orchestration Technology (Kubernetes). You can learn it from here => Fast-Kubernetes
Keywords: Kubeflow, Pipeline, MLOps, AIOps, Distributed Training, Model Serving, ML Containers.
- LAB: Creating LAB Environment (WSL2), Installing Kubeflow with MicroK8s, Juju on Ubuntu 20.04
- LAB: Creating LAB Environment, Installing MiniKF with Vagrant (Preffered for Easy Usage)
- LAB/Project: Kubeflow Pipeline (From Scratch) with Kubeflow SDK (DSL Compiler) and Custom Docker Images (Decision Tree, Logistic Regression, SVM, Naive Bayes, Xg Boost)
- LAB/Project: KALE (Kubeflow Automated PipeLines Engine) and KATIB (AutoML: Finding Best Hyperparameter Values)
- LAB/Project: KALE (Kubeflow Automated PipeLines Engine) and KServe (Model Serving) for Model Prediction
- LAB/Project: Distributed Training with Tensorflow (MNIST data)
- Motivation
- What is Kubelow?
- How Kubeflow Works?
- What is Container (Docker)?
- What is Kubernetes?
- Installing Kubeflow
- Kubeflow Basics
- Kubeflow Jupyter Notebook
- Kubeflow Pipeline
- KALE (Kubeflow Automated PipeLines Engine)
- KATIB (AutoML: Finding Best Hyperparameter Values)
- KServe (Model Serving)
- Training-Operators (Distributed Training)
- Minio (Object Storage) and ROK (Data Management Platform)
- Project 1: Creating ML Pipeline with Custom Docker Images (Decision Tree, Logistic Regression, SVM, Naive Bayes, Xg Boost)
- Project 2: KALE (Kubeflow Automated PipeLines Engine) and KATIB (AutoML: Finding Best Hyperparameter Values)
- Project 3: KALE (Kubeflow Automated PipeLines Engine) and KServe (Model Serving) for Model Prediction
- Project 4: Distributed Training with Training Operator
- Other Useful Resources Related Kubeflow
- References
Why should we use / learn Kubeflow?
- Kubeflow uses containers on Kubernetes to run steps of Machine Learning and Deep Learning algorithms on the computer clusters.
- Kubeflow provides Machine Learning (ML) data pipeline.
- It saves pipelines, experiments, runs (experiment tracking on Kubeflow), models (model deployment).
- It provides easy, repeatable, portable deployments on a diverse infrastructure (for example, experimenting on a laptop, then moving to an on-premises cluster or to the cloud).
- Kubeflow provides deploying and managing loosely-coupled microservices and scaling based on demand.
- Kubeflow is free, open source platform that runs on on-premise or any cloud (AWS, Google Cloud, Azure, etc.).
- It includes Jupyter Notebook to develop ML algorithms, user interface to show pipeline.
- "Kubeflow started as an open sourcing of the way Google ran TensorFlow internally, based on a pipeline called TensorFlow Extended. It began as just a simpler way to run TensorFlow jobs on Kubernetes, but has since expanded to be a multi-architecture, multi-cloud framework for running entire machine learning pipelines." (ref: kubeflow.org)
- Kubeflow applies to become a CNCF incubating project, it is announced on 24 October 2022 (ref: opensource.googleblog.com).
- Distributed and Parallel training become more important day by day, because the number of the parameters is increasing (especially deep learning models: billions to trillion parameters). Increasing parameter provides better results but it also causes the longer training and it needs more computing power. With Kubeflow, Kubernetes and containers, distributed learning is achieved with many GPUs. Please have look Training-Operators (Distributed Training) part for details.
- CERN uses Kubeflow and Training operators to speed up the training (3D-GAN) on parallel multiple GPUs (1 single training time: From 2.5 days = 60 hours to 30 minutes, video/presentation: https://www.youtube.com/watch?v=HuWt1N8NFzU)
- "The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable." (ref: kubeflow.org)
- "Kubeflow has developed into an end-to-end, extendable ML platform, with multiple distinct components to address specific stages of the ML lifecycle: model development (Kubeflow Notebooks), model training (Kubeflow Pipelines and Kubeflow Training Operator), model serving (KServe), and automated machine learning (Katib)" (ref: opensource.googleblog.com).
- Kubeflow is a type of ML data pipeline application that provides to create ML data pipeline (saving model and artifacts, running multiple times) like Airflow
-
Kubeflow works on Kubernetes platform with Docker Containers.
-
Kubernetes creates the node clusters with many servers and PCs. Kubeflow is a distributed application (~35 pods) running on the Kubernetes platform. Kubeflow pods are running on the different nodes if there are several nodes connected to the Kubernetes cluster.
-
Containers include Python Machine learning (ML) codes that are each step of the ML pipeline (e.g. Dowloading data function, decision tree classifier, linear regression classifier, evaluation part, etc.)
-
Containers' outputs can be able to connect to the other containers' inputs. With this feature, it is possible to create DAG (Directed Acyclic Graph) with containers. Each function can be able to run on the seperate containers.
-
If you want to learn the details of the working of Kubeflow, you should learn:
-
- Docker Containers
-
- Kubernetes
-
-
Docker is a tool that reduces the gap between Development/Deployment phase of a software development cycle.
-
Docker is like VM but it has more features than VMs (no kernel, only small app and file systems, portable)
- On Linux Kernel (2000s) two features are added (these features support Docker):
- Namespaces: Isolate process.
- Control Groups: Resource usage (CPU, Memory) isolation and limitation for each process.
- On Linux Kernel (2000s) two features are added (these features support Docker):
-
Without Docker containers, each VM consumes 30% resources (Memory, CPU)
-
To learn about Docker and Containers, please go to this repo: Fast-Docker
-
"Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation. It has a large, rapidly growing ecosystem. Kubernetes services, support, and tools are widely available." (Ref: Kubernetes.io)
-
To learn about Kubernetes, please go to this repo: https://github.com/omerbsezer/Fast-Kubernetes
-
How to install Kubeflow on WSL2 with Juju:
-
To get more features like KALE, and to install in easy way: Use Kubeflow with MiniKF below (preferred)
-
Kubeflow with MiniKF: How to install MiniKF with Vagrant and VirtualBox:
- Kubeflow is an ML distributed application that contains following parts:
- Kubeflow Jupyter Notebook (creating multiple notebook pods)
- Kubeflow Pipelines
- KALE (Kubeflow Automated PipeLines Engine)
- Kubeflow Runs and Experiment (which store all run and experiment)
- KATIB (AutoML: Finding Best Hyperparameter Values)
- KFServe (Model Serving)
- Training-Operators (Distributed Training)
-
Kubeflow creates Notebook using containers and K8s pod.
-
When user wants to run new notebook, user can configure:
- which image should be base image under the notebook pod,
- how many CPU core and RAM the notebook pod should use,
- if there is GPU in the K8s cluster, should this use or not for the notebook pod,
- how much volume space (workspace volume) should be use for this notebook pod,
- should the existing volume space be shared with other notebook pods,
- should persistent volume be used (PV, PVC with NFS volume),
- which environment variables or secrets should be reachable from notebook pod,
- should this notebook pod run on which server in the cluster, with which pods (K8s affinity, tolerations)
-
After launching notebook pod, it creates pod and we can connect it to open the notebook.
-
After creating notebook pod, in MiniKF, it triggers to create volume automatically (with ROK storage class), user can reach files and even downloads the files.
-
Kubeflow Pipelines is based on Argo Workflows which is a container-native workflow engine for kubernetes.
-
Kubeflow Pipelines consists of (ref: Kubeflow-Book):
- Python SDK: allows you to create and manipulate pipelines and their components using Kubeflow Pipelines domain-specific language.
- DSL compiler: allows you to transform your pipeline defined in python code into a static configuration reflected in a YAML file.
- Pipeline Service: creates a pipeline run from the static configuration or YAML file.
- Kubernetes Resources: the pipeline service connects to kubernetes API in order to define the resources needed to run the pipeline defined in the YAML file.
- Artifact Storage: Kubeflow Pipelines storages metadata and artifacts. Metadata such as experiments, jobs, runs and metrics are stored in a MySQL database. Artifacts such as pipeline packages, large scale metrics and views are stored in an artifact store such as MinIO server.
-
Have a look it:
-
KALE (Kubeflow Automated pipeLines Engine) is a project that aims at simplifying the Data Science experience of deploying Kubeflow Pipelines workflows.
-
Kale bridges this gap by providing a simple UI to define Kubeflow Pipelines workflows directly from you JupyterLab interface, without the need to change a single line of code (ref: https://github.com/kubeflow-kale/kale).
-
With KALE, each cells are tagged and worklow can be created by connecting cells, then after compiling, Kubeflow Pipeline is created and run.
-
KALE feature helps data scientist to run on Kubeflow quickly without creating any container manually.
-
Have a look to KALE and KATIB Project:
-
Katib is a Kubernetes-native project for automated machine learning (AutoML). Katib supports Hyperparameter Tuning, Early Stopping and Neural Architecture Search.
-
Katib has search methods (ref: https://github.com/kubeflow/katib):
- Hyperparameter Tuning: Random Search, Grid Search, Bayesian Optimization, TPE, Multivariate TPE, CMA-ES, Sobol's Quasirandom Sequence, HyperBand, Population Based Training.
- Neural Architecture Search: ENAS, DARTS
- Early Stopping: Median Stop
-
Have a look it:
-
KServe enables serverless inferencing on Kubernetes and provides performant, high abstraction interfaces for common machine learning (ML) frameworks like TensorFlow, XGBoost, scikit-learn, PyTorch, and ONNX to solve production model serving use cases (ref: https://github.com/kserve/kserve).
-
Have a look it:
-
It is great advantage to run distributed and parallel jobs (training) on Kubernetes with Training-Operators. User can determine the number of worker trainer pods.
-
Training operator provides Kubernetes custom resources that makes it easy to run distributed or non-distributed TensorFlow / PyTorch / Apache MXNet / XGBoost / MPI jobs on Kubernetes (ref: https://github.com/kubeflow/training-operator).
-
Distributed Training become more important day by day, because the number of the parameters is increasing (especially deep learning, deep neural networks). Increasing parameter provides better results but it also causes the longer training and it needs more computing power.
- How is the number of the parameters calculated? => https://stackoverflow.com/questions/28232235/how-to-calculate-the-number-of-parameters-of-convolutional-neural-networks
- Common DL models parameters: VGG => 138 Million, AlexNet => 62 Million, ResNet-152: 60.3 Million.
- OpenAI ChatGPT (GPT-3.5) and GPT-3 have 175 billion parameters (ref: https://www.sciencefocus.com/future-technology/gpt-3/).
- The Chinese tech giant Huawei built a 200-billion-parameter language model called PanGu (ref: https://www.technologyreview.com/2021/12/21/1042835/2021-was-the-year-of-monster-ai-models/).
- Inspur, another Chinese firm, built Yuan 1.0, a 245-billion-parameter model.
- Baidu and Peng Cheng Laboratory, a research institute in Shenzhen, announced PCL-BAIDU Wenxin, a model with 280 billion parameters.
- The Beijing Academy of AI announced Wu Dao 2.0, which has 1.75 trillion parameters.
- South Korean internet search firm Naver announced a model called HyperCLOVA, with 204 billion parameters.
- Microsoft's Megatron-Turing language model has 530 billion parameters (ref: https://www.technologyreview.com/2021/12/08/1041557/deepmind-language-model-beat-others-25-times-size-gpt-3-megatron/)
- DeepMind built a large language model called Gopher, with 280 billion parameters.
-
CERN uses Kubeflow and Training operators to speed up the training (3D-GAN) on parallel multiple GPUs (1 single training time: From 2.5 days = 60 hours to 30 minutes):
-
Have a look it:
-
Minio is object storage (like AWS S3, Azure Blob Storage), but it also works on-premise, Kubeflow uses minio to save Kubeflow object data. To get more info and printscreen of the Minio, please have a look below:
-
ROK is data management platform on Kubernetes that is developed by Arrikto. In the MiniKF, ROK makes easy to use K8s data management (e.g. automatically managed PV, storage classes)
- More info, please have a look: https://www.arrikto.com/rok-data-management-platform/
Project 1: Creating ML Pipeline with Custom Docker Images (Decision Tree, Logistic Regression, SVM, Naive Bayes, Xg Boost)
- Have a look it:
Project 2: KALE (Kubeflow Automated PipeLines Engine) and KATIB (AutoML: Finding Best Hyperparameter Values)
- Have a look it:
Project 3: KALE (Kubeflow Automated PipeLines Engine) and KServe (Model Serving) for Model Prediction
- Have a look it:
- Have a look it:
- https://www.kubeflow.org/
- https://www.kubeflow.org/docs/components/central-dash/overview/
- https://github.com/kubeflow/
- kubeflow.org: (kubeflow documentation) https://v0-7.kubeflow.org/docs/
- opensource.googleblog.com: https://opensource.googleblog.com/2022/10/kubeflow-applies-to-become-a-cncf-incubating-project.html
- kubeflow-pipelines towardsdatascience: https://towardsdatascience.com/kubeflow-pipelines-how-to-build-your-first-kubeflow-pipeline-from-scratch-2424227f7e5
- Kubernetes.io: https://kubernetes.io/docs/concepts/overview/
- docs.docker.com: https://docs.docker.com/get-started/overview/
- Argo Worflow: https://github.com/argoproj/argo-workflows
- Kubeflow-Book: https://www.amazon.com.mx/Kubeflow-Machine-Learning-Lab-Production/dp/1492050121
- KALE: https://github.com/kubeflow-kale/kale
- KATIB: https://github.com/kubeflow/katib,
- KALE Tags: https://medium.com/kubeflow/automating-jupyter-notebook-deployments-to-kubeflow-pipelines-with-kale-a4ede38bea1f
- KServe: https://github.com/kserve/kserve
- https://www.technologyreview.com/2021/12/21/1042835/2021-was-the-year-of-monster-ai-models/
- https://www.technologyreview.com/2021/12/08/1041557/deepmind-language-model-beat-others-25-times-size-gpt-3-megatron/
- https://indico.cern.ch/event/924283/contributions/4105328/attachments/2153724/3632143/2020-12-01-Kubeflow-FastML.pdf
- CERN Distributed Training Video: https://www.youtube.com/watch?v=HuWt1N8NFzU