Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve readiness polling on node startup #11878

Merged
merged 6 commits into from
Mar 24, 2025

Conversation

brandond
Copy link
Member

@brandond brandond commented Mar 4, 2025

Proposed Changes

This PR eliminates the scattered creation of ready channels that are passed around via struct fields. Ready channels for all components started by the Executor are now exposed by the Executor, providing a consistent interface that different portions of the codebase can use to wait for components to be available.

  • Add context to agent token validation error
    Makes the error message nice when agents are retrying retrieval of cacerts from cluster
  • Increase log output while waiting for apiserver ready.
    We were never actually logging anything here, as we would always take the result.Error() path whenever the apiserver isn't ready - and log nothing.
  • Move apiserver ready wait into common channel
    Splits server startup into prepare/start phases. Server's agent is now started after server is prepared, but before the datastore and control-plane components are started. This allows us to properly bootstrap the executor before starting server components, and use the executor to provide a shared channel to wait on apiserver readiness.
    This allows us to replace four separate callers of WaitForAPIServerReady with reads from a common ready channel.
  • Move container runtime ready channel into executor
    Move the container runtime ready channel into the executor interface, instead of passing it awkwardly between server and agent config structs. This is for parity with the apiserver ready channel, and also adds validation that user-provided CRI services are up, something that was previously only done for containerd and cri-dockerd.
  • Move container etcd ready channel into executor
    This eliminates the final channel that was being passed around in an internal struct. The ETCD management code passes in a func that can be polled until etcd is ready; the executor is responsible for polling this after etcd is started and closing the etcd ready channel at the correct time.

Types of Changes

tech debt
bugfix

Verification

Check logs

Testing

RKE2 validation:

Linked Issues

User-Facing Change


Further Comments

@brandond brandond requested a review from a team as a code owner March 4, 2025 02:15
Copy link

codecov bot commented Mar 4, 2025

Codecov Report

Attention: Patch coverage is 74.00000% with 65 lines in your changes missing coverage. Please review.

Project coverage is 45.05%. Comparing base (76c5c77) to head (3571f24).
Report is 8 commits behind head on master.

Files with missing lines Patch % Lines
pkg/agent/run.go 48.14% 12 Missing and 2 partials ⚠️
pkg/daemons/executor/embed.go 72.72% 5 Missing and 4 partials ⚠️
pkg/cli/server/server.go 71.42% 5 Missing and 3 partials ⚠️
pkg/agent/cri/cri.go 66.66% 4 Missing and 2 partials ⚠️
pkg/etcd/etcd.go 45.45% 6 Missing ⚠️
pkg/daemons/control/server.go 70.58% 3 Missing and 2 partials ⚠️
pkg/agent/tunnel/tunnel.go 84.61% 4 Missing ⚠️
pkg/cluster/cluster.go 71.42% 4 Missing ⚠️
pkg/cli/agent/agent.go 40.00% 2 Missing and 1 partial ⚠️
pkg/util/api.go 85.71% 2 Missing and 1 partial ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #11878      +/-   ##
==========================================
+ Coverage   39.64%   45.05%   +5.41%     
==========================================
  Files         189      188       -1     
  Lines       19091    19068      -23     
==========================================
+ Hits         7568     8592    +1024     
+ Misses      10305     9230    -1075     
- Partials     1218     1246      +28     
Flag Coverage Δ
e2etests 35.93% <67.20%> (+1.35%) ⬆️
inttests 35.40% <67.60%> (?)
unittests 16.85% <4.80%> (+1.69%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@brandond brandond force-pushed the improve-apiserver-wait-log branch 3 times, most recently from 792ebc8 to 40c53cb Compare March 4, 2025 05:34
manuelbuil
manuelbuil previously approved these changes Mar 4, 2025
dereknola
dereknola previously approved these changes Mar 4, 2025
@brandond brandond dismissed stale reviews from dereknola and manuelbuil via 39e7baf March 4, 2025 18:40
@brandond brandond force-pushed the improve-apiserver-wait-log branch 2 times, most recently from 39e7baf to 89fca8c Compare March 4, 2025 18:42
@brandond brandond requested review from dereknola and manuelbuil March 4, 2025 20:08
VestigeJ
VestigeJ previously approved these changes Mar 4, 2025
@brandond brandond force-pushed the improve-apiserver-wait-log branch from 89fca8c to 1210f94 Compare March 5, 2025 06:52
@brandond brandond requested a review from VestigeJ March 5, 2025 06:52
rbrtbnfgl
rbrtbnfgl previously approved these changes Mar 5, 2025
@brandond brandond marked this pull request as draft March 5, 2025 16:03
@brandond
Copy link
Member Author

brandond commented Mar 5, 2025

I want to do a bit more here, moving to draft.

@brandond brandond force-pushed the improve-apiserver-wait-log branch 2 times, most recently from 0b5f039 to 538b42d Compare March 5, 2025 22:16
@brandond brandond changed the title Increase log output while waiting for apiserver ready Improve apiserver readiness polling on node startup Mar 5, 2025
@brandond brandond marked this pull request as ready for review March 5, 2025 22:17
@brandond brandond marked this pull request as draft March 6, 2025 06:01
@brandond brandond force-pushed the improve-apiserver-wait-log branch from 538b42d to bd0f228 Compare March 6, 2025 10:23
@brandond brandond changed the title Improve apiserver readiness polling on node startup Improve readiness polling on node startup Mar 6, 2025
)

const maxMsgSize = 1024 * 1024 * 16

// Connection connects to a CRI socket at the given path.
func Connection(ctx context.Context, address string) (*grpc.ClientConn, error) {
Copy link
Member Author

@brandond brandond Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function was the exact same for both windows and linux, except for the socketPrefix and an alias on the util package import.

@brandond brandond force-pushed the improve-apiserver-wait-log branch from bd0f228 to b77457b Compare March 6, 2025 20:31
@brandond brandond force-pushed the improve-apiserver-wait-log branch from b77457b to 3901464 Compare March 6, 2025 22:30
@brandond brandond force-pushed the improve-apiserver-wait-log branch 3 times, most recently from 980e2fb to e8d4ed3 Compare March 15, 2025 00:30
@brandond brandond marked this pull request as ready for review March 15, 2025 00:41
@brandond brandond force-pushed the improve-apiserver-wait-log branch from e8d4ed3 to 8c45359 Compare March 15, 2025 02:13
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Increases log verbosity but decreases polling frequency to avoid
spamming the console. It usually takes a couple seconds for the
apiserver to come up anyway.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Splits server startup into prepare/start phases. Server's agent is now
started after server is prepared, but before it is started. This allows
us to properly bootstrap the executor before starting server components,
and use the executor to provide a shared channel to wait on apiserver
readiness.

This allows us to replace four separate callers of WaitForAPIServerReady
with reads from a common ready channel.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Move the container runtime ready channel into the executor interface, instead of passing it awkwardly between server and agent config structs

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
This eliminates the final channel that was being passed around in an internal struct. The ETCD management code passes in a func that can be polled until etcd is ready; the executor is responsible for polling this after etcd is started and closing the etcd ready channel at the correct time.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
@brandond brandond force-pushed the improve-apiserver-wait-log branch from 8c45359 to 3571f24 Compare March 20, 2025 18:55
@brandond brandond requested review from rbrtbnfgl and a team March 20, 2025 19:32
@brandond brandond merged commit d45006b into k3s-io:master Mar 24, 2025
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants