-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clusterctl upgrade Timed out waiting for all Machines to exist
#11209
Comments
Am I understanding this correctly that "cleaning up so we can re-provision from a clean state" is coming from Docker? To be honest, I'm not sure if we should just increase the timeout and then accept that we can only create Machines after multiple tries. This seems like just trying ignoring the root cause. I would prefer if we could try to correlate this #11133 (comment) with the time when the number of flakes increased. If there is a correlation I would like to try to revert the concurrency change first. I suspect that this is the root cause of many of our issues. If that's correct than it would be a lot better to fix that root cause then trying to get flakes green by retrying more often (or giving more time for retries). Basically we already know that we encounter this issue frequently: Failed to create control group inotify object: Too many open files I think this can fail our tests in a lot of places. |
Yes from what i can tell
yeah sure that sounds reasonable to me, will review again and try to pinpoint when this started happening (closed the timeout increase PR) |
If we don't have clear data I would be definitely also fine with rolling back the concurrency change and then checking if that significantly improves the flake situation. |
This particular message appears to predate the concurrency increase. Seems like it was present but quite rare up until 2024-07-31 when it begins to occur more frequently: https://storage.googleapis.com/k8s-triage/index.html?date=2024-08-07&text=Timed%20out%20waiting%20for%20all%20Machines%20to%20exist&job=.*periodic-cluster-api-e2e.*&test=.*clusterctl%20upgrades |
So it happened before, but way more often since we increased concurrency? |
the trend seems to support it, but maybe not directly? going back to 2024-08-19 before the concurrency change: move to 2024-09-12 (a week after the concurrency change was backported to release-1.8): a bit further (the whole lookback window now encapsulates tests after the concurrency change) 2024-09-20: now today 2024-09-24 we have: |
I'm not sure if there were also some infra migrations that might make the whole thing a bit fuzzy. What do you think about reverting concurrency on main & release-1.8 for a week and then check if that gives us clear data? |
sounds good to me |
You'll open the PRs, I approve? :) |
/triage accepted |
@cahillsf So far results seem to be pretty clear
(even slightly better when release-1-6 is excluded which we didn't touch) |
yes they do indeed! is there some other follow up work we can pursue to reintroduce the greater concurrency without these negative impacts? or are we kind of "stuck" with 3 nodes? i guess before introducing more variables, it might be worth it to "refresh" our understanding of the flakes that exist now that we have a cleaner signal to see if some of them might be actual bugs versus side effects of resource constraints in the test infra. what do you think is the best next step? |
I would keep the concurrency as is for now. Let's see what flakes remain and then try to fix them. I'm also not sure if we can do much to increase the concurrency (maybe we can slightly increase it without getting new flakes but not much). It seems like we're simply hitting limits of the infrastructure when we run too much in parallel (IIRC we're not even close to hitting CPU limits though). In general, we really need pretty stable CI the next 1-2 months until code freeze. We have a few critical things upcoming that could lead to some flakes that might not be easy to spot (so we need good visibility for new flakes :)). |
@sbueringer taking another look at this message specifically, seems like we have 0 occurrences of this message since the concurrency was dropped back down: https://storage.googleapis.com/k8s-triage/index.html?date=2024-09-30&text=Timed%20out%20waiting%20for%20all%20Machines%20to%20exist&job=.*periodic-cluster-api-e2e.*&test=.*clusterctl%20upgrades not sure if you want to give it some more time or we can just close this issue? |
Fine for me to close |
cool -- thanks for all your help on this 🙇 /close |
@cahillsf: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/reopen seems this has cropped back up: https://storage.googleapis.com/k8s-triage/index.html?date=2024-12-19&job=.*periodic-cluster-api-e2e.*&test=.*clusterctl%20upgrades |
/reopen |
@cahillsf: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Which jobs are flaking?
periodic-cluster-api-e2e-main
Which tests are flaking?
When testing clusterctl upgrades (v0.3=>v1.5=>current) Should create a management cluster and then upgrade all the providers
specifically looking at this pattern:
36 failures:
Timed out waiting for all Machines to exist
Since when has it been flaking?
for quite some time
Testgrid link
https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main
Reason for failure (if possible)
TLDR: seems like there is an issue in the Docker controller in creating the worker machines with many logs like:
dockermachine_controller.go:220] controllers/DockerMachine/DockerMachine-controller "msg"="failed to create worker DockerMachine: timed out waiting for the condition, cleaning up so we can re-provision from a clean state"
The failures follow different patterns
I think increasing
clusterctl-upgrade/wait-worker-nodes
e2e interval is a good first step as some of the examples below show the retry feedback in creating theDockerMachine
s appearing to resolve but the tests time outthe errors don't provide much info, but since the issue seems to stem from the docker container runtime
RunContainer
(cluster-api/test/infrastructure/docker/internal/docker/manager.go
Line 182 in 879617d
Example failure 1 (expected 2, found 1)
(
When testing clusterctl upgrades (v0.3=>v1.5=>current) Should create a management cluster and then upgrade all the providers
) prow linkcapd controller logs show:
I0831 03:46:04.834929 1 dockermachine_controller.go:220] controllers/DockerMachine/DockerMachine-controller "msg"="failed to create worker DockerMachine: timed out waiting for the condition, cleaning up so we can re-provision from a clean state" "cluster"="clusterctl-upgrade-workload-q53g9g" "docker-cluster"="clusterctl-upgrade-workload-q53g9g" "docker-machine"={"Namespace":"clusterctl-upgrade","Name":"clusterctl-upgrade-workload-q53g9g-md-0-lfnm6"} "machine"="clusterctl-upgrade-workload-q53g9g-md-0-5fb876f4bc-cq8k6"
in this case the CP machine is fully initialized, but the MD worker machine status shows:
Example failure 2 (expected 2, found 1)
(
When testing clusterctl upgrades (v0.3=>v1.5=>current) Should create a management cluster and then upgrade all the providers
) prow linkcapd controller logs show:
I0831 04:16:20.583537 1 dockermachine_controller.go:220] controllers/DockerMachine/DockerMachine-controller "msg"="failed to create worker DockerMachine: timed out waiting for the condition, cleaning up so we can re-provision from a clean state" "cluster"="clusterctl-upgrade-workload-merjuc" "docker-cluster"="clusterctl-upgrade-workload-merjuc" "docker-machine"={"Namespace":"clusterctl-upgrade","Name":"clusterctl-upgrade-workload-merjuc-md-0-sr89b"} "machine"="clusterctl-upgrade-workload-merjuc-md-0-856f7cbb7c-fjfdl"
in this failed test, we also see several failed creations of the control plane machine:
I0831 04:12:46.050065 1 dockermachine_controller.go:220] controllers/DockerMachine/DockerMachine-controller "msg"="failed to create worker DockerMachine: timed out waiting for the condition, cleaning up so we can re-provision from a clean state" "cluster"="clusterctl-upgrade-workload-merjuc" "docker-cluster"="clusterctl-upgrade-workload-merjuc" "docker-machine"={"Namespace":"clusterctl-upgrade","Name":"clusterctl-upgrade-workload-merjuc-control-plane-76jpw"} "machine"="clusterctl-upgrade-workload-merjuc-control-plane-kw497"
most have the same error, the other CP machine creation failure looks like this:
I0831 04:13:25.895170 1 dockermachine_controller.go:220] controllers/DockerMachine/DockerMachine-controller "msg"="failed to create worker DockerMachine: command \"docker run --detach --tty --privileged --security-opt seccomp=unconfined --tmpfs /tmp --tmpfs /run --volume /var --volume /lib/modules:/lib/modules:ro --hostname clusterctl-upgrade-workload-merjuc-control-plane-kw497 --network kind --name clusterctl-upgrade-workload-merjuc-control-plane-kw497 --label io.x-k8s.kind.cluster=clusterctl-upgrade-workload-merjuc --label io.x-k8s.kind.role=control-plane --expose 44139 --volume=/var/run/docker.sock:/var/run/docker.sock --publish=127.0.0.1:44139:6443/TCP kindest/node:v1.22.17@sha256:9af784f45a584f6b28bce2af84c494d947a05bd709151466489008f80a9ce9d5\" failed with error: exit status 125, cleaning up so we can re-provision from a clean state" "cluster"="clusterctl-upgrade-workload-merjuc" "docker-cluster"="clusterctl-upgrade-workload-merjuc" "docker-machine"={"Namespace":"clusterctl-upgrade","Name":"clusterctl-upgrade-workload-merjuc-control-plane-76jpw"} "machine"="clusterctl-upgrade-workload-merjuc-control-plane-kw497"
these appear to resolve and the provider ID gets set on the CP node:
I0831 04:15:55.173110 1 machine.go:337] controllers/DockerMachine/DockerMachine-controller "msg"="Setting Kubernetes node providerID" "cluster"="clusterctl-upgrade-workload-merjuc" "docker-cluster"="clusterctl-upgrade-workload-merjuc" "docker-machine"={"Namespace":"clusterctl-upgrade","Name":"clusterctl-upgrade-workload-merjuc-control-plane-76jpw"} "machine"="clusterctl-upgrade-workload-merjuc-control-plane-kw497"
in this case, the MD DockerMachine is still Waiting for BootstrapData
the CP DockerMachine only transitioned to
Ready
only 12 seconds before the test timed out:Example failure 3 (expected 2, found 0)
[It] When testing clusterctl upgrades (v0.3=>v1.5=>current) Should create a management cluster and then upgrade all the providers
: prow linkin this case, it seems that MD is waiting for the control plane (https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1829446304858640384/artifacts/clusters/clusterctl-upgrade-management-hywi70/resources/clusterctl-upgrade/DockerMachine/clusterctl-upgrade-workload-66mpos-md-0-ltqf6.yaml):
looks like all the
Failed to create worker DockerMachine
errors in the CAPD controller logs in this case relate to the control plane: https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1829446304858640384/artifacts/clusters/clusterctl-upgrade-management-hywi70/logs/capd-system/capd-controller-manager/capd-controller-manager-fb4b578f9-rzpq9/manager.log:10:06:52.444
:Anything else we need to know?
No response
Label(s) to be applied
/kind flake
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.
The text was updated successfully, but these errors were encountered: