Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11618. Enable HA modes for OM and SCM #10

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

pyttel
Copy link
Contributor

@pyttel pyttel commented Nov 2, 2024

What changes were proposed in this pull request?

HDDS-11618. The changes enable HA modes for OM and SCM over replica count.

Please describe your PR in detail:

  • What changes are proposed in the PR? and Why? It would be better if it is written from third person's
    perspective not just for the reviewer.

In Kubernetes clusters, redundancy is crucial. However, using more than one instance of OM or SCM results in multiple errors with the current configuration. To address this, the HA configuration described in the official documentation has been integrated into this Helm chart.

  • Provide as much context and rationale for the pull request as possible. It could be copy-paste from
    the Jira's description if the jira is well defined.

The main purpose is to enable Ratis over replica counts and to enable bootstrap for SCM by adding a new init container. Additionally, proper cluster configuration has been introduced. When the replica count is set to 1, a standalone configuration is maintained to ensure backwards compatibility.

  • If it is complex code, describe the approach used to solve the issue. If possible attach design doc,
    issue investigation, github discussion, etc.

Please refere OM HA DOC and SCM HA DOC

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11618

How was this patch tested?

This patch was tested using the Git workflow and manual cluster tests with a Rancher Kubernetes cluster. It was evaluated both as a standalone and an HA version. Additionally, it was tested in a plain new Kubernetes cluster and as a dependency chart.

@pyttel
Copy link
Contributor Author

pyttel commented Nov 5, 2024

@dnskr and/or @adoroszlai do you have time to review the changes?

@adoroszlai
Copy link
Contributor

Thanks @pyttel for the patch. I don't have time to review it, but will try to find some Ozone developer to do so. Would be nice if @dnskr could review it from Helm perspective.

@adoroszlai
Copy link
Contributor

Also, you may find existing Kubernetes examples (without Helm) useful:
https://github.com/apache/ozone/tree/master/hadoop-ozone/dist/src/main/k8s/examples/ozone-ha

@pyttel
Copy link
Contributor Author

pyttel commented Nov 5, 2024

Yes this is one part I used for all the env variables and the official Doc to implement the patch ^^

Copy link

@ptlrs ptlrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @pyttel for this PR. I have a few questions and comments.

Comment on lines +31 to +34
- name: ratis-ipc
port: 9858
- name: ipc
port: 9859
Copy link

@ptlrs ptlrs Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the port numbers in all the services and statefulsets be referenced from the values.yaml file? We would like to avoid hardcoding them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes definitely! I just proposed this. This was also my personal question ^^

Comment on lines 31 to 34
{{- if gt (int .Values.om.replicas) 1 }}
- name: ratis
port: 9872
{{- end }}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this port being exposed for one Ozone Manager(OM) to communicate with another OM when they form a Ratis ring?
My understanding is that the current selector will match this service with all OM pods. Consequently, the messages sent to this service will be forwarded to a random OM pod instead of a specific OM pod.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the number of OM pods are manually modified by kubectl scale then this port will perhaps never be exposed. We should think if there is a downside to always exposing the port.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per OM documentation

This logical name is called serviceId and can be configured in the ozone-site.xml
The defined serviceId can be used instead of a single OM host using client interfaces

Perhaps there should be a different service which maps and groups pods as per the serviceId.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this port being exposed for one Ozone Manager(OM) to communicate with another OM when they form a Ratis ring?

yes this was my reason

My understanding is that the current selector will match this service with all OM pods. Consequently, the messages sent to this service will be forwarded to a random OM pod instead of a specific OM pod.

ok if this is the case maybe we can remove this export. I found and used the following Ticket for the port configuration: https://issues.apache.org/jira/browse/HDDS-4677. I'm not really familiar with the architecture. If we can remove it I will do so :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the number of OM pods are manually modified by kubectl scale then this port will perhaps never be exposed. We should think if there is a downside to always exposing the port.

Great point! I hadn't considered that scenario. What could be the downside of exposing the port within an internal Kubernetes network? Perhaps we can use a Helm lookup mechanism to check the current replica count as an alternative, but the simplest approach is to always expose the port.

{{- if gt (int .Values.scm.replicas) 1 }}
- name: bootstrap
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
args: ["ozone", "scm", "--bootstrap"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the SCM HA docs:

The initialization of the first SCM-HA node is the same as a non-HA SCM:
ozone scm --init
Second and third nodes should be bootstrapped instead of init
ozone scm --bootstrap

Here, we call init and then bootstrap for every SCM pod (re)start.
We would instead have to perform pod-id specific actions.

It is not clear from the documentation whether init and bootstrap should only be performed once during the lifetime of the pod or upon every pod restart.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears from the documentation that init on pod-1 needs to complete before bootstrap on pod-x. This would perhaps require changing podManagementPolicy: Parallel to OrderedReady.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the SCM HA docs:

The initialization of the first SCM-HA node is the same as a non-HA SCM:
ozone scm --init
Second and third nodes should be bootstrapped instead of init
ozone scm --bootstrap

Here, we call init and then bootstrap for every SCM pod (re)start.
We would instead have to perform pod-id specific actions.

It is not clear from the documentation whether init and bootstrap should only be performed once during the lifetime of the pod or upon every pod restart.

I used the following doc:

Auto-bootstrap

In some environments (e.g. Kubernetes) we need to have a common, unified way to initialize SCM HA quorum. As a reminder, the standard initialization flow is the following:

On the first, “primordial” node: ozone scm --init
On second/third nodes: ozone scm --bootstrap

This can be improved: primordial SCM can be configured by setting ozone.scm.primordial.node.id in the config to one of the nodes.

ozone.scm.primordial.node.id scm1

With this configuration both scm --init and scm --bootstrap can be safely executed on all SCM nodes. Each node will only perform the action applicable to it based on the ozone.scm.primordial.node.id and its own node ID.

Note: SCM still needs to be started after the init/bootstrap process.

ozone scm --init
ozone scm --bootstrap
ozone scm --daemon start

For Docker/Kubernetes, use ozone scm to start it in the foreground.

This can be found in the Auto-bootstrap section here. I understood that we can run this commands on every instance and it will automatically detect what to do if ozone.scm.primordial.node.id is set correctly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears from the documentation that init on pod-1 needs to complete before bootstrap on pod-x. This would perhaps require changing podManagementPolicy: Parallel to OrderedReady.

I used this because of the ratis ring. The problem with the other configuration is that the hosts of unstated nodes from the headless service cannot be resolved. So the first SCM node cannot start because it depends on the resolution of the other ones, which again cannot start because they start only if first node has finished successfully

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears from the documentation that init on pod-1 needs to complete before bootstrap on pod-x. This would perhaps require changing podManagementPolicy: Parallel to OrderedReady.

I used this because of the ratis ring. The problem with the other configuration is that the hosts of unstated nodes from the headless service cannot be resolved. So the first SCM node cannot start because it depends on the resolution of the other ones, which again cannot start because they start only if first node has finished successfully

However, this is only relevant for the bootstrap process. So, you might be right. We probably only need the bootstrap once in persistent mode. Maybe someone from the Ozone contributors can provide a definitive answer to this question.

@@ -31,6 +31,7 @@ metadata:
app.kubernetes.io/component: scm
spec:
replicas: {{ .Values.scm.replicas }}
podManagementPolicy: Parallel
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a Parallel policy disruptive when we use kubectl scale up/down.
It would be interesting to see if the Ratis rings are disrupted. Perhaps a PreStop hook for graceful shutdown of OM/SCM/Datanodes is required in a new Jira.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a different problem: If we use kubectl scale the configuration from helpers of cluster ids and so on is also not correct. Puhhh no idea how to solve this at the moment...

Comment on lines +110 to +115
{{- if gt (int .Values.scm.replicas) 1 }}
- name: ratis
containerPort: 9894
- name: grpc
containerPort: 9895
{{- end }}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are exposing ports, we should also have a readiness probe to toggle access to these ports in a separate Jira.

@@ -31,6 +31,7 @@ metadata:
app.kubernetes.io/component: om
spec:
replicas: {{ .Values.om.replicas }}
podManagementPolicy: Parallel
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per OM documentation

To convert a non-HA OM to be HA or to add new OM nodes to existing HA OM ring, new OM node(s) need to be bootstrapped.

Shouldn't there be an init container which calls --bootstrap for OM?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you are right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, I get stuck in the bootstrap init container. I believe this step is only necessary for new cluster nodes. Thus, the problem reoccurs when we scale (once again, the scaling issue).

@pyttel
Copy link
Contributor Author

pyttel commented Nov 7, 2024

@ptlrs Thank you for the great review and participation! 😊

@pyttel
Copy link
Contributor Author

pyttel commented Dec 2, 2024

Hello again,

I've conducted numerous tests and discovered quite a few insights ^^. It turned out to be more challenging than I initially expected. I've successfully set up Ozone Manager HA with proper leadership transfer, decommissioning, and bootstrap detection. Over the next few days, I'll write a detailed description and push the code. I used some Helm hooks and jobs for this setup. It took some time to configure everything correctly. Currently, I'm focusing on the Storage Container Manager (SCM). The existing solution, which utilizes two init containers, is not very effective because if more than one pod is deleted, the cluster doesn't start up properly due to DNS resolution, pod deployment order, readiness probes, etc. I intend to replicate the approach I used for the Ozone Manager. Currently, I'm facing challenges with the leadership transfer.

Target test-scm-0 not found in group [cc965017-abb3-4693-a016-fa8fe34be6dc, 1bbfa1c7-69f3-42bd-845e-bf8dc0fc7fe1, b64a194c-2921-4fe3-a92c-08f9a187894e, 78e0f9e0-9f3b-429b-9bbe-db8666ada092, b9e47090-175c-4e32-952b-2df989364483, 7e3b1a64-052c-4b5f-8be2-648226d765c4].

So it seems to be an issue with the admin transfer for SCM. This works for OM fine. The it seems to be an missmatch between UUID and node id. The admin transfer seems to look for peernodes with node id not uuid. Is this known or can somebody help me with this? Maybe it is a bug?

@adoroszlai
Copy link
Contributor

Thanks a lot @pyttel for continuing work on this. You are right, there seems to be a mismatch between OM and SCM in how nodes are identified for transfer command.

$ ozone admin scm transfer --new-leader-id scm1
Target scm1 not found in group [891cd056-3cf1-4e52-9cac-eff0af7bfefe, b25f513f-ddf7-4382-a3f0-fe5b209a7a26, ac63c8ae-1f4c-4fa7-9eed-e279d0b573ff].

$ ozone admin om transfer --new-leader-id om1
Transfer leadership successfully to om1.

Reported HDDS-11839.

@dnskr
Copy link
Contributor

dnskr commented Dec 2, 2024

@pyttel Would it be a good idea to split the PR into two separate PRs for OM and SCM cases?
You mentioned that the issue is quite challenging and implementation includes Helm hooks and jobs, so the review process and testing might be difficult. For instance, I'm curious how jobs and Helm hooks will work with continuous delivery tools like ArgoCD.

@pyttel
Copy link
Contributor Author

pyttel commented Dec 3, 2024

@dnskr Yes might be a good idea. So after work I will write the doc and push the changes for OM HA I made so far. I have used logs to determine if bootstraps or decommissions are ready. It works fine if you use at least INFO log level. But thats not for production. is there some other criteria we can use? Or do we need to write some files like .bootstrapped or .decommissioned to detect the states?

To the point ArgoCD: I do not see any problems. This should make life easy. You can just deploy the chart and everything is managed by the chart. So if you decrease replicaCount for example and make a new chart revision by upgrade (helm upgrade --install ...), the changes will be translated automatically to pre upgrade hook to transfer the leader and a post upgrade hook to decommision the unwanted replicas. But of cause if we face issues we need to solve these:)

@pyttel
Copy link
Contributor Author

pyttel commented Dec 5, 2024

OM HA ideas

The main features for Helm managed Ozone Manager in HA mode is based on ReplicaCount changes from one revision to
another. Therefore, I used three Helm and Kubernetes elements listed below:

  • A changed argument wrapper for the start command: I did not change the command itself, because there is a lot of magic happening like environment to conf and so on. As usual in Helm commands, I changed the command to handle logic for the startup of OM instances. The main idea is to compare currently used replica count with replica count configured in the next helm chart revision. It is done by utilizing lookup function from Helm 3. If the difference from OM replica count ist zero nothing special to do at this point. If it is negative, which means the number of OM instances will be reduced, decommissioning for these nodes must be triggered. This is not relevant for the place here, but if the difference is positive, bootstrapping of new OM instances must happen. As bootstrap for an instance(=node) should only be done once, a file on successful bootstrap is written to Helm persistence path from values.yaml. On every start the file existence is checked and only if the file is missing bootstrap argument will be added to args configured values.yaml. Currently, the success of bootstrap is tagged by another background process based on the log output. This must be changed for production release soon, as it depends on LOG-LEVEL min INFO. The important part of bootstrapping in the kubernetes context is when more than one container is added at one helm revision (delta>1). The bootstrapping process is checking if all configs are updated with new nodes. Because of ordered container creations, the container instances with index bigger than the current bootstrapping one are not created, yet and thus these are not resolvable over DNS at that point of time. After a time, the bootstrapping process will be killed by timeout of Kubernetes and the current container is not resolved as ready. So further instances are not created anymore. So we get stuck in Bootstrapping deadlock here. I tried to resolve this by adding parallel marker for statefulset, so all pods are created at the same time. But this caused other problems while bootstrapping two instances at the same time. As it did not feel good, I wanted to use normal Kubernetes ordered container creation. Back to ordered container creation, the bootstrapping config for new container must not contain node ids of container instances which will be created after the current one. So the bootstrapping must be done one by one with such a config. This is why I overwrite the node ids while bootstrapping an instance only containing the instances already available in Kubernetes at that point of time and the current bootstrapping one. This works like a charm and we stick to the normal Kubernetes behavior.
  • A helm post upgrade job. This is needed for decommissioning. If the above-described delta is negative, this job is created as a helm post-upgrade job. At that point of time all config reloads of existing instances are triggered by kubernetes because of the checksum annotation I changed a bit. For each decommissioned node an own decommissioning job is created and a temporary service to reach this jobs is created. The PVC of each decommissioned node is mounted to its job. So after decreasing the instance number old instances are still available for the decommissioning job. In each job the decommissioning command is executed. If this has been successful all data of PVC is cleared to enable bootstrapping again in future. Please check the _helper.tpl for different env setup in these phases! One issue I faced while testing is the leader. If a leader becomes unavailable and should be decommissioned there were some problems. I decided to introduce a pre-upgrade job to handle the leader transfer to instance with index 0.
  • A helm pre upgrade job to transfer leader. This is an additional temporary container with special config (please see _helper.tpl which took me a while to determine what is the correct config. this runs before any container is updated and transfers the leader to instance with index 0.

This solution seems to be fail-safe, dynamic and works without the --force argument. I hope this is a good point to start discussions about solving the challenging task.

My ordered testing cases:

  1. Create ozone deployment in persistent mode with OM HA 3 instances
  2. Update helm chart revision with OM replica count 4 (one bootstrap node)
  3. Update helm chart revision with OM replica count 7 (multiple Bootstrap node)
  4. Update helm chart revision with OM replica count 7 but changed config (Same replica count)
  5. Transfer OM Leader to last instance with id 6 (Check if leader is automatically transferred to instance with id 0)
  6. Update helm chart revision with OM replica count 6 (Decommission one node)
  7. Update helm chart revision with OM replica count 3 (Decommission multiple nodes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants