-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-11618. Enable HA modes for OM and SCM #10
base: main
Are you sure you want to change the base?
Conversation
@dnskr and/or @adoroszlai do you have time to review the changes? |
Also, you may find existing Kubernetes examples (without Helm) useful: |
Yes this is one part I used for all the env variables and the official Doc to implement the patch ^^ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @pyttel for this PR. I have a few questions and comments.
- name: ratis-ipc | ||
port: 9858 | ||
- name: ipc | ||
port: 9859 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the port numbers in all the services and statefulsets be referenced from the values.yaml
file? We would like to avoid hardcoding them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes definitely! I just proposed this. This was also my personal question ^^
{{- if gt (int .Values.om.replicas) 1 }} | ||
- name: ratis | ||
port: 9872 | ||
{{- end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this port being exposed for one Ozone Manager(OM) to communicate with another OM when they form a Ratis ring?
My understanding is that the current selector
will match this service with all OM pods. Consequently, the messages sent to this service will be forwarded to a random OM pod instead of a specific OM pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the number of OM pods are manually modified by kubectl scale
then this port will perhaps never be exposed. We should think if there is a downside to always exposing the port.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per OM documentation
This logical name is called serviceId and can be configured in the ozone-site.xml
The defined serviceId can be used instead of a single OM host using client interfaces
Perhaps there should be a different service which maps and groups pods as per the serviceId
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this port being exposed for one Ozone Manager(OM) to communicate with another OM when they form a Ratis ring?
yes this was my reason
My understanding is that the current selector will match this service with all OM pods. Consequently, the messages sent to this service will be forwarded to a random OM pod instead of a specific OM pod.
ok if this is the case maybe we can remove this export. I found and used the following Ticket for the port configuration: https://issues.apache.org/jira/browse/HDDS-4677. I'm not really familiar with the architecture. If we can remove it I will do so :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the number of OM pods are manually modified by kubectl scale then this port will perhaps never be exposed. We should think if there is a downside to always exposing the port.
Great point! I hadn't considered that scenario. What could be the downside of exposing the port within an internal Kubernetes network? Perhaps we can use a Helm lookup mechanism to check the current replica count as an alternative, but the simplest approach is to always expose the port.
{{- if gt (int .Values.scm.replicas) 1 }} | ||
- name: bootstrap | ||
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}" | ||
args: ["ozone", "scm", "--bootstrap"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the SCM HA docs:
The initialization of the first SCM-HA node is the same as a non-HA SCM:
ozone scm --init
Second and third nodes should be bootstrapped instead of init
ozone scm --bootstrap
Here, we call init
and then bootstrap
for every SCM pod (re)start.
We would instead have to perform pod-id specific actions.
It is not clear from the documentation whether init
and bootstrap
should only be performed once during the lifetime of the pod or upon every pod restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears from the documentation that init
on pod-1 needs to complete before bootstrap
on pod-x. This would perhaps require changing podManagementPolicy: Parallel
to OrderedReady
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the SCM HA docs:
The initialization of the first SCM-HA node is the same as a non-HA SCM: ozone scm --init Second and third nodes should be bootstrapped instead of init ozone scm --bootstrap
Here, we call init and then bootstrap for every SCM pod (re)start.
We would instead have to perform pod-id specific actions.It is not clear from the documentation whether init and bootstrap should only be performed once during the lifetime of the pod or upon every pod restart.
I used the following doc:
Auto-bootstrap
In some environments (e.g. Kubernetes) we need to have a common, unified way to initialize SCM HA quorum. As a reminder, the standard initialization flow is the following:
On the first, “primordial” node: ozone scm --init On second/third nodes: ozone scm --bootstrap
This can be improved: primordial SCM can be configured by setting ozone.scm.primordial.node.id in the config to one of the nodes.
ozone.scm.primordial.node.id scm1With this configuration both scm --init and scm --bootstrap can be safely executed on all SCM nodes. Each node will only perform the action applicable to it based on the ozone.scm.primordial.node.id and its own node ID.
Note: SCM still needs to be started after the init/bootstrap process.
ozone scm --init
ozone scm --bootstrap
ozone scm --daemon startFor Docker/Kubernetes, use ozone scm to start it in the foreground.
This can be found in the Auto-bootstrap section here. I understood that we can run this commands on every instance and it will automatically detect what to do if ozone.scm.primordial.node.id
is set correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears from the documentation that init on pod-1 needs to complete before bootstrap on pod-x. This would perhaps require changing podManagementPolicy: Parallel to OrderedReady.
I used this because of the ratis ring. The problem with the other configuration is that the hosts of unstated nodes from the headless service cannot be resolved. So the first SCM node cannot start because it depends on the resolution of the other ones, which again cannot start because they start only if first node has finished successfully
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears from the documentation that init on pod-1 needs to complete before bootstrap on pod-x. This would perhaps require changing podManagementPolicy: Parallel to OrderedReady.
I used this because of the ratis ring. The problem with the other configuration is that the hosts of unstated nodes from the headless service cannot be resolved. So the first SCM node cannot start because it depends on the resolution of the other ones, which again cannot start because they start only if first node has finished successfully
However, this is only relevant for the bootstrap process. So, you might be right. We probably only need the bootstrap once in persistent mode. Maybe someone from the Ozone contributors can provide a definitive answer to this question.
@@ -31,6 +31,7 @@ metadata: | |||
app.kubernetes.io/component: scm | |||
spec: | |||
replicas: {{ .Values.scm.replicas }} | |||
podManagementPolicy: Parallel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a Parallel
policy disruptive when we use kubectl scale
up/down.
It would be interesting to see if the Ratis rings are disrupted. Perhaps a PreStop
hook for graceful shutdown of OM/SCM/Datanodes is required in a new Jira.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a different problem: If we use kubectl scale
the configuration from helpers of cluster ids and so on is also not correct. Puhhh no idea how to solve this at the moment...
{{- if gt (int .Values.scm.replicas) 1 }} | ||
- name: ratis | ||
containerPort: 9894 | ||
- name: grpc | ||
containerPort: 9895 | ||
{{- end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are exposing ports, we should also have a readiness probe to toggle access to these ports in a separate Jira.
@@ -31,6 +31,7 @@ metadata: | |||
app.kubernetes.io/component: om | |||
spec: | |||
replicas: {{ .Values.om.replicas }} | |||
podManagementPolicy: Parallel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per OM documentation
To convert a non-HA OM to be HA or to add new OM nodes to existing HA OM ring, new OM node(s) need to be bootstrapped.
Shouldn't there be an init container which calls --bootstrap
for OM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think you are right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, I get stuck in the bootstrap init container. I believe this step is only necessary for new cluster nodes. Thus, the problem reoccurs when we scale (once again, the scaling issue).
@ptlrs Thank you for the great review and participation! 😊 |
Hello again, I've conducted numerous tests and discovered quite a few insights ^^. It turned out to be more challenging than I initially expected. I've successfully set up Ozone Manager HA with proper leadership transfer, decommissioning, and bootstrap detection. Over the next few days, I'll write a detailed description and push the code. I used some Helm hooks and jobs for this setup. It took some time to configure everything correctly. Currently, I'm focusing on the Storage Container Manager (SCM). The existing solution, which utilizes two init containers, is not very effective because if more than one pod is deleted, the cluster doesn't start up properly due to DNS resolution, pod deployment order, readiness probes, etc. I intend to replicate the approach I used for the Ozone Manager. Currently, I'm facing challenges with the leadership transfer.
So it seems to be an issue with the admin transfer for SCM. This works for OM fine. The it seems to be an missmatch between UUID and node id. The admin transfer seems to look for peernodes with |
Thanks a lot @pyttel for continuing work on this. You are right, there seems to be a mismatch between OM and SCM in how nodes are identified for
Reported HDDS-11839. |
@pyttel Would it be a good idea to split the PR into two separate PRs for OM and SCM cases? |
@dnskr Yes might be a good idea. So after work I will write the doc and push the changes for OM HA I made so far. I have used logs to determine if bootstraps or decommissions are ready. It works fine if you use at least INFO log level. But thats not for production. is there some other criteria we can use? Or do we need to write some files like To the point ArgoCD: I do not see any problems. This should make life easy. You can just deploy the chart and everything is managed by the chart. So if you decrease replicaCount for example and make a new chart revision by upgrade ( |
OM HA ideasThe main features for Helm managed Ozone Manager in HA mode is based on
This solution seems to be fail-safe, dynamic and works without the My ordered testing cases:
|
What changes were proposed in this pull request?
HDDS-11618. The changes enable HA modes for OM and SCM over replica count.
Please describe your PR in detail:
perspective not just for the reviewer.
In Kubernetes clusters, redundancy is crucial. However, using more than one instance of OM or SCM results in multiple errors with the current configuration. To address this, the HA configuration described in the official documentation has been integrated into this Helm chart.
the Jira's description if the jira is well defined.
The main purpose is to enable Ratis over replica counts and to enable bootstrap for SCM by adding a new init container. Additionally, proper cluster configuration has been introduced. When the replica count is set to 1, a standalone configuration is maintained to ensure backwards compatibility.
issue investigation, github discussion, etc.
Please refere OM HA DOC and SCM HA DOC
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11618
How was this patch tested?
This patch was tested using the Git workflow and manual cluster tests with a Rancher Kubernetes cluster. It was evaluated both as a standalone and an HA version. Additionally, it was tested in a plain new Kubernetes cluster and as a dependency chart.