Investigate and implement a means of migrating from K3s to RKE2 #7627
Replies: 15 comments
-
The trickier migration path that will need to be solved first is RKE -> RKE2, which is currently being planned out now. |
Beta Was this translation helpful? Give feedback.
-
I would like to migrate from K3 to rke2 and could possibly. also write the documentation for it. Is this just snapshotting? |
Beta Was this translation helpful? Give feedback.
-
no, there are several other hurdles: It isn't really supported to change the cloud provider or CNI on an existing cluster; the packaged components (ingress, coredns, etc) would need to be removed and the replacement installed, and so on. |
Beta Was this translation helpful? Give feedback.
-
Some raw notes from upgrading two k3s clusters to RKE2I upgraded two k3s clusters earlier this year (while subscribing to updates to this issue...)
In total we had about two hours of down-time in production due to issues we had not encounterd while testing. Writing this to help others take the plunge 😁 List of some of ours steps, findings and tricks
Cheers /Jörgen Troubleshooting and comparing tokens when RKE2 refused to start on the first RKE2 control-plane nodecat /var/lib/rancher/rke2/server/token
K10<DIFFERENT-K10-SHA256-HASH-ON-RKE2>::server:<ORIGINAL-SHA256-HASH-TOKEN> cat /var/lib/rancher/k3s/server/token
K10<ORIGINAL-K10-SHA256-HASH-FROM-K3S>::server:<ORIGINAL-SHA256-HASH-TOKEN> deleting rke2 passwd and restartingrm /var/lib/rancher/rke2/server/cred/passwd systemctl restart rke2-server.service
systemctl status rke2-server.service
the first RKE2 control-plane node starts/var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get nodes
NAME STATUS ROLES AGE VERSION
g8 NotReady <none> 536d v1.21.13+k3s1
vm-ceph-1 NotReady <none> 123d v1.21.13+k3s1
vm-ceph-2 NotReady <none> 123d v1.21.13+k3s1
vm-ceph-3 NotReady <none> 123d v1.21.13+k3s1
vm-ceph-4 NotReady <none> 123d v1.21.13+k3s1
vm-cpl-1 NotReady control-plane,etcd,master 31d v1.21.13+rke2r2 <--- YES
vm-cpl-2 NotReady control-plane,etcd,master 31d v1.21.13+k3s1
vm-cpl-3 NotReady control-plane,etcd,master 32d v1.21.13+k3s1
vm-neo-1 NotReady <none> 129d v1.21.13+k3s1
vm-neo-2 NotReady <none> 129d v1.21.13+k3s1
vm-neo-3 NotReady <none> 129d v1.21.13+k3s1
vm-worker-1 NotReady <none> 31d v1.21.13+k3s1
vm-worker-2 NotReady <none> 31d v1.21.13+k3s1 And now all of a sudden the token files are identical after rm /var/lib/rancher/rke2/server/cred/passwd and restartcat /var/lib/rancher/rke2/server/cred/passwd
<ORIGINAL-SHA256-HASH-TOKEN>,node,node,rke2:agent
<ORIGINAL-SHA256-HASH-TOKEN>,server,server,rke2:server cat /var/lib/rancher/rke2/server/token
K10<ORIGINAL-K10-SHA256-HASH-FROM-K3S>::server:<ORIGINAL-SHA256-HASH-TOKEN> cat /var/lib/rancher/k3s/server/token
K10<ORIGINAL-K10-SHA256-HASH-FROM-K3S>::server:<ORIGINAL-SHA256-HASH-TOKEN> Some errors encountered and workaroundsCoreDNS
Deleting old k3s corednskubectl scale -n kube-system deployment coredns --replicas 0
kubectl delete svc -n kube-system kube-dns
kubectl delete pod -n kube-system coredns-574bcc6c46-kkq9t --force Rancher 2.6 webhook failed everything (since it could not start)
Deleted Rancher mutating webhook (reinstalled later)kubectl delete -n cattle-system mutatingwebhookconfigurations.admissionregistration.k8s.io rancher.cattle.io Logs showing conflict between old and new metrics server
Deleting apiservice for k3s metrics$ kubectl get apiservices.apiregistration.k8s.io
NAME SERVICE AVAILABLE AGE
[...]
v1beta1.metrics.k8s.io kube-system/metrics-server False (ServiceNotFound) 603d
[...]
$ kubectl delete apiservices.apiregistration.k8s.io v1beta1.metrics.k8s.io
apiservice.apiregistration.k8s.io "v1beta1.metrics.k8s.io" deleted |
Beta Was this translation helpful? Give feedback.
-
@wthrbtn - thanks so much for sharing this process and experience!! Could you give some background on the business case? |
Beta Was this translation helpful? Give feedback.
-
Early 2020 I think the choice was between RKE(1 with Docker) and k3s. We could have stayed with k3s but there were voices demanding a solution with commercial support if/when needed. Air-gapped RKE2 still is not the same joy when patching for example compared to k3s. We have now learned to trick RKE2 and pretend it has Internet instead of gzip or zstd archives to avoid stupid evictions of pods – node NotReady for 0-2 seconds vs 1-2 minutes
About 10 of our nodes are hosting a Ceph cluster (using Rook) and provide that storage as both StorageClass and S3 to other resources in the cluster. We had to get everything up and running really fast and then we could rebuild nodes one at a time without impacting anything. |
Beta Was this translation helpful? Give feedback.
-
@wthrbtn - thanks for sharing these details! Helps to understand! ;-) Regarding the Air-gapped RKE2 - what are you missing when comparing K3S to RKE2? |
Beta Was this translation helpful? Give feedback.
-
I suspect that they're referring to the delay in importing the airgap images for core pods. K3s has everything in the binary and unpacks the few external components quickly. Rke2 takes longer because it has to import all the images from the tarball before things can come up all the way. Loading things into a local registry is definitely better than using airgap tarballs on every node. I am curious why you switched for support though, as far as I know you can get k3s support from us (suse) under the same terms as rke2. |
Beta Was this translation helpful? Give feedback.
-
Registry is a good way to go - but maybe also like we do it in Harvester where we do a pre-load of the new images before restarting.. the other request is still "do not re-import already imported images during ever rke2 restart ;-))
Me too.. |
Beta Was this translation helpful? Give feedback.
-
Correct. We DO use several registries but everything still has to be able to recover from COMPLETE blackout without registries. (Image running clusters on a submarine or similar – not that we do, but you can only bring so many backup systems...).
Politics. Someone who do not know Kubernetes heard we ran k3s and googled it. All they remembered where “lightweight, IoT & Edge”. So to end the FUD about running a “toy” Kubernetes I took the plunge even though k3s performed perfectly!!! Happy Holidays to you all! |
Beta Was this translation helpful? Give feedback.
-
Actually ONE thing I really miss is |
Beta Was this translation helpful? Give feedback.
-
I've been following the advice from #881 (comment) and it's been very helpful. However I'm now in a situation where everytime I restart my cluster I must remove Is there a way to prevent that? |
Beta Was this translation helpful? Give feedback.
-
@sdemura I had the same issue after migrating a cluster from K3s to RKE2. I was able to fix it by running |
Beta Was this translation helpful? Give feedback.
-
Since @Martin-Weiss asked for some background on the business case, I would like to contribute ours. Actually, our background is mostly the same as the one described by @wthrbtn. Our migration from K3s to RKE2 is motivated by a mix of political and technical reasons. Let's start with the political reasons. While we plan to continue using K3s for the majority of our clusters (especially those running on-premise on our customer's systems), parts of our businesses actually are subject to the laws of KRITIS. I don't believe that it's strictly necessary to migrate to RKE2 to conform to KRITIS, it surely is easier to reason about, as RKE2 is advertised as suitable for the security requirements of governments. It's just easier for us to use a Kubernetes distribution that advertises itself as being as secure as possible out of the box and that guarantees CSI benchmark conformance just by configuring a flag. But there are quite some technical reasons for us to prefer RKE2 as well. I think it was around 2020/2021 when we migrated from Docker Swarm to K3s, which was just the perfect alternative for us at the time. We had very little experience with Kubernetes and RKE2 wasn't even a thing back then. But as our clusters and requirements grew, we saw ourselves replacing more and more of K3s' bundled components. Just at the top of my head:
I understand that K3s can do everything that RKE2 can do, but the default setup of RKE2 is a way better fit for our requirements. |
Beta Was this translation helpful? Give feedback.
-
Request from Rancher Federal. Provide a means to migrate from K3s to RKE2. This can likely be done without much fuss today with snapshotting.
We will need to test and identify any issues with this today. It will be necessary to document the method to migrate from K3s to RKE2.
Beta Was this translation helpful? Give feedback.
All reactions