Submarine is a new subproject of Apache Hadoop.
Submarine is a project which allows infra engineer / data scientist to run unmodified Tensorflow or PyTorch programs on YARN or Kubernetes.
Goals of Submarine:
- It allows jobs easy access data/models in HDFS and other storages.
- Can launch services to serve Tensorflow/PyTorch models.
- Support run distributed Tensorflow jobs with simple configs.
- Support run user-specified Docker images.
- Support specify GPU and other resources.
- Support launch tensorboard for training jobs if user specified.
- Support customized DNS name for roles (like tensorboard.$user.$domain:6006)
There is no complete and easy to understand example for beginner, and Submarine support many open source infrastructure, it's hard to deploy each runtime environment for engineer, not to mention data sciences
This repo is aim to let user easily deploy container orchestrations (like Hadoop Yarn, k8s) by docker container, support full distributed deep learning example for each runtimes, and step by step tutorial for beginner.
- Ubuntu 18.04+
- Docker
- Memory > 8G
A fast and easy way to deploy Submarine on your laptop.
With just a few clicks, you are up for experimentation, and for running complete Submarine experiment.
mini-submarine includes:
- Standalone Hadoop v2.9.2
- Standalone Zookeeper v3.4.14
- Latest version of Apache Submarine
- TensorFlow example (MNIST handwritten digit)
docker build --tag hello-submarine ./mini-submarine
docker run -it -h submarine-dev --name mini-submarine --net=bridge --privileged -P hello-submarine /bin/bash
docker pull pingsutw/hello-submarine
docker run -it -h submarine-dev --name mini-submarine --net=bridge --privileged -P pingsutw/hello-submarine /bin/bash
pwd # /home/yarn/submarine
. ./venv/bin/activate
# change directory
cd ..
cd tests
# run locally
python run_deepfm.py -conf deepfm.json -task train
python run_deepfm.py -conf deepfm.json -task evaluate
# Model metrics : {'auc': 0.64110434, 'loss': 0.4406755, 'global_step': 12}
# run distributedly
export SUBMARINE_VERSION=0.6.0-SNAPSHOT
export SUBMARINE_HADOOP_VERSION=2.9
export SUBMARINE_JAR=/opt/submarine-dist-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}/submarine-dist-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar
java -cp $(${HADOOP_COMMON_HOME}/bin/hadoop classpath --glob):${SUBMARINE_JAR}:${HADOOP_CONF_PATH} \
org.apache.submarine.client.cli.Cli job run --name deepfm-job-001 \
--framework tensorflow \
--verbose \
--input_path "" \
--num_workers 2 \
--worker_resources memory=2G,vcores=4 \
--num_ps 1 \
--ps_resources memory=2G,vcores=4 \
--worker_launch_cmd "myvenv.zip/venv/bin/python run_deepfm.py -conf=deepfm_distributed.json" \
--ps_launch_cmd "myvenv.zip/venv/bin/python run_deepfm.py -conf=deepfm_distributed.json" \
--insecure \
--conf tony.containers.resources=../submarine/myvenv.zip#archive,${SUBMARINE_JAR},deepfm_distributed.json,run_deepfm.py
Deploy all component on K8s, including
- Mysql Database
- Submarine Server
- Submarine Workbench
- tf-operator
- pytorch-operator
curl -Lo ./kind "https://github.com/kubernetes-sigs/kind/releases/download/v0.7.0/kind-$(uname)-amd64"
chmod +x ./kind
mv ./kind /some-dir-in-your-PATH/kind
kind create cluster --image kindest/node:v1.15.6 --name k8s-submarine
kubectl create namespace submarine
# set submarine as default namspace
kubectl config set-context --current --namespace=submarine
curl -LO https://storage.googleapis.com/kubernetes-release/release/`curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt`/bin/linux/amd64/kubectl
chmod +x ./kubectl
sudo mv ./kubectl /usr/local/bin/kubectl
kubectl version --client
curl https://helm.baltorepo.com/organization/signing.asc | sudo apt-key add -
sudo apt-get install apt-transport-https --yes
echo "deb https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt-get update
sudo apt-get install helm
helm install submarine ./helm-charts/submarine
kubectl port-forward svc/submarine-server 8080:8080
# open workbench http://localhsot:8080
# Account: admin
# Password: admin
curl -X POST -H "Content-Type: application/json" -d '
{
"meta": {
"name": "tf-mnist-json",
"namespace": "submarine",
"framework": "TensorFlow",
"cmd": "python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/log --learning_rate=0.01 --batch_size=150",
"envVars": {
"ENV_1": "ENV1"
}
},
"environment": {
"image": "gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0"
},
"spec": {
"Ps": {
"replicas": 1,
"resources": "cpu=1,memory=512M"
},
"Worker": {
"replicas": 1,
"resources": "cpu=1,memory=512M"
}
}
}
' http://127.0.0.1:32080/api/v1/experiment
TBD