Adding a Kube Node to an existing cluster
From the dgx-setup repo
Edit the inventory file in the dgx-setup repo to add the new node(s)
- New nodes should be defined with IP addresses under the [all] section and then added to the [node] section
ansible-playbook -i ./inventory -l lambda ./cluster-setup.yaml
From the deepops repo
- deepops/config/inventory Add new nodes to All with ip and ansible_host. Add names to kube-nodes
Add to all:vars
dploy_container_engine=False
download_run_once=True
From the deepops repo
ansible-playbook -i ./config/inventory -l k8s-cluster ./submodules/kubespray/scale.yml
If sucessfull, check the node status with
kubectl get nodes
There may be some issues running the scale.yml playbook. Here are some common issues and resolutions.
Kubespray requires at least jinja 2.11. If you hit an error with setting up kubeadm do the following. Note this will uninstall the openshift client as well, so we are reinstalling it.
sudo yum remove python-jinja2
sudo pip uninstall ansible Jinja2
sudo pip install --upgrade ansible==2.9.5
sudo pip install --upgrade Jinja2==2.11.1
sudo pip install --upgrade setuptools
sudo pip install --upgrade openshift==0.11.2
See the following https://github.com/kubernetes-sigs/kubespray/issues/5958kub
Label each node as a compute node.
kubectl label node NODE_NAME node-role.kubernetes.io/node=
Apply Custom node-type labels.
kubectl label node NODE_NAME system_type=lambda
Apply DCGM Exporter Label
kubectl label nodes <node-name> hardware-type=NVIDIAGPU
These are some steps to validate the cluster is working
Validate that the system and monitoring pods are running on the new node
Run this for each new node to ensure all pods are running and not restarting
kubectl get pods -A -o wide |grep NODE_NAME
Check to make sure we can see the Nvidia GPUs on each node
kubectl describe node NODE_NAME