-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gracefully drain connections when stopping the gateway #1203
Gracefully drain connections when stopping the gateway #1203
Conversation
…public listeners (and exit anyway if it exceeds terminationGracePeriodSeconds)
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1203 +/- ##
==========================================
+ Coverage 60.63% 62.31% +1.67%
==========================================
Files 24 24
Lines 2002 1632 -370
==========================================
- Hits 1214 1017 -197
+ Misses 726 553 -173
Partials 62 62 ☔ View full report in Codecov by Sentry. |
/retest |
Hi @norbjd thanks again for your work and patience. I like the approach taken from the Contour project: projectcontour/contour#145 as it does not depend on configuring timeouts. Istio has implemented a similar approach with istio/istio#35059. I see some steps to get there to some extent. a) a documented design like https://github.com/projectcontour/contour/blob/main/design/envoy-shutdown.md I like the go approach compared to bash logic for waiting for active connections because it seems a bit "dirtier" in bash, an bash approach was proposed also here: projectcontour/contour#2177 (comment). In the mean time we can make things configurable, see Thus we could do:
Btw I agree that the envoy image only contains perl so no nc, socat, curl etc 🤷. However you can expose the admin
I guess we want to notify clients asap (like you have it in the PR).
|
@norbjd gentle ping. Do you have cycles to drive the effort here? |
@norbjd hi, pls let me know if you have the cycles to the PR. |
@norbjd gentle ping! |
Woops, really sorry, I've been busy at work these last weeks. I'll check again right now! Thanks for the detailed answer :) |
Yes please :) In general I think that approach is good. Could you please also:
|
@norbjd gentle ping |
…-requests-are-finished
Patch env and terminationGracePeriodSeconds at the same time
- avoids conflicts with other tests - change gracefulshutdown test to delete all gateway pods
/retest |
Thanks for the review, I have fixed the last requested changes ✔️ norbjd/net-kourier@7170cb6...gateway-prestop-hook-wait-until-all-incoming-requests-are-finished TL;DR:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
Thanks for doing this @norbjd
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: norbjd, ReToCode The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…sions#1203) * Fix gateway's preStop hook: curl does not exist (anymore?) in envoy image * Before stopping the gateway, wait until requests are finished on all public listeners (and exit anyway if it exceeds terminationGracePeriodSeconds) * Drain listeners with appropriate endpoint * Simpler drain + sleep * Remove PARENT_SHUTDOWN_TIME_SECONDS and terminationGracePeriodSeconds * Use a perl script (no need to open the admin HTTP interface!) * Use bash instead of perl in preStop hook * Review @skonto comments: use socket address for admin cluster * [WIP] add graceful shutdown test and tweak CI to just run that test * [WIP] Fix gracefulshutdown_test.go * [WIP] try to fix race condition and lint * [WIP] use initialTimeout + debug * [WIP] fix gracefulshutdown_test.go logic * [WIP] refacto and add some comments to clarify * [WIP] fix lint * [WIP] reintroduce kind-e2e-upgrade.yaml * [WIP] add test case when request takes a little longer than the drain time * [WIP] fix compilation issue * [WIP] FIx compilation issue (again) * [WIP] hopefully fix data race * [WIP] refacto and hopefully fix race condition (use sync.Map) * [WIP] fix compilation issue * [WIP] Handle EOF * [WIP] check gateway pod has been removed + manual debugging * [WIP] debugging * [WIP] more debugging * [WIP] more debugging * [WIP] increase livenessProbe failure threshold as I'm not sure it should return EOF * [WIP] remove debugging related stuff * Revert all unnecessary changes made for testing * Revert unnecessary change (livenessProbe) * Scale to 1 replica * Typo * Run gracefulshutdown test first (speed up feedback loop) * Add a comment for terminationGracePeriodSeconds * Don't update deployment twice Patch env and terminationGracePeriodSeconds at the same time * Fix bad patch * Run gracefulshutdown test at the end - avoids conflicts with other tests - change gracefulshutdown test to delete all gateway pods * Fix gracefulshutdown test * Fix gracefulshutdown test * Lint
* Gracefully drain connections when stopping the gateway (knative-extensions#1203) * Fix gateway's preStop hook: curl does not exist (anymore?) in envoy image * Before stopping the gateway, wait until requests are finished on all public listeners (and exit anyway if it exceeds terminationGracePeriodSeconds) * Drain listeners with appropriate endpoint * Simpler drain + sleep * Remove PARENT_SHUTDOWN_TIME_SECONDS and terminationGracePeriodSeconds * Use a perl script (no need to open the admin HTTP interface!) * Use bash instead of perl in preStop hook * Review @skonto comments: use socket address for admin cluster * [WIP] add graceful shutdown test and tweak CI to just run that test * [WIP] Fix gracefulshutdown_test.go * [WIP] try to fix race condition and lint * [WIP] use initialTimeout + debug * [WIP] fix gracefulshutdown_test.go logic * [WIP] refacto and add some comments to clarify * [WIP] fix lint * [WIP] reintroduce kind-e2e-upgrade.yaml * [WIP] add test case when request takes a little longer than the drain time * [WIP] fix compilation issue * [WIP] FIx compilation issue (again) * [WIP] hopefully fix data race * [WIP] refacto and hopefully fix race condition (use sync.Map) * [WIP] fix compilation issue * [WIP] Handle EOF * [WIP] check gateway pod has been removed + manual debugging * [WIP] debugging * [WIP] more debugging * [WIP] more debugging * [WIP] increase livenessProbe failure threshold as I'm not sure it should return EOF * [WIP] remove debugging related stuff * Revert all unnecessary changes made for testing * Revert unnecessary change (livenessProbe) * Scale to 1 replica * Typo * Run gracefulshutdown test first (speed up feedback loop) * Add a comment for terminationGracePeriodSeconds * Don't update deployment twice Patch env and terminationGracePeriodSeconds at the same time * Fix bad patch * Run gracefulshutdown test at the end - avoids conflicts with other tests - change gracefulshutdown test to delete all gateway pods * Fix gracefulshutdown test * Fix gracefulshutdown test * Lint * run hack/update-deps.sh * update openshift files --------- Co-authored-by: norbjd <norbjd@users.noreply.github.com>
* Gracefully drain connections when stopping the gateway (knative-extensions#1203) * Fix gateway's preStop hook: curl does not exist (anymore?) in envoy image * Before stopping the gateway, wait until requests are finished on all public listeners (and exit anyway if it exceeds terminationGracePeriodSeconds) * Drain listeners with appropriate endpoint * Simpler drain + sleep * Remove PARENT_SHUTDOWN_TIME_SECONDS and terminationGracePeriodSeconds * Use a perl script (no need to open the admin HTTP interface!) * Use bash instead of perl in preStop hook * Review @skonto comments: use socket address for admin cluster * [WIP] add graceful shutdown test and tweak CI to just run that test * [WIP] Fix gracefulshutdown_test.go * [WIP] try to fix race condition and lint * [WIP] use initialTimeout + debug * [WIP] fix gracefulshutdown_test.go logic * [WIP] refacto and add some comments to clarify * [WIP] fix lint * [WIP] reintroduce kind-e2e-upgrade.yaml * [WIP] add test case when request takes a little longer than the drain time * [WIP] fix compilation issue * [WIP] FIx compilation issue (again) * [WIP] hopefully fix data race * [WIP] refacto and hopefully fix race condition (use sync.Map) * [WIP] fix compilation issue * [WIP] Handle EOF * [WIP] check gateway pod has been removed + manual debugging * [WIP] debugging * [WIP] more debugging * [WIP] more debugging * [WIP] increase livenessProbe failure threshold as I'm not sure it should return EOF * [WIP] remove debugging related stuff * Revert all unnecessary changes made for testing * Revert unnecessary change (livenessProbe) * Scale to 1 replica * Typo * Run gracefulshutdown test first (speed up feedback loop) * Add a comment for terminationGracePeriodSeconds * Don't update deployment twice Patch env and terminationGracePeriodSeconds at the same time * Fix bad patch * Run gracefulshutdown test at the end - avoids conflicts with other tests - change gracefulshutdown test to delete all gateway pods * Fix gracefulshutdown test * Fix gracefulshutdown test * Lint * run hack/update-deps.sh * update openshift files --------- Co-authored-by: norbjd <norbjd@users.noreply.github.com>
* [SRVKS-1171] [RELEASE-1.12] Graceful shutdown (#130) * Gracefully drain connections when stopping the gateway (knative-extensions#1203) * Fix gateway's preStop hook: curl does not exist (anymore?) in envoy image * Before stopping the gateway, wait until requests are finished on all public listeners (and exit anyway if it exceeds terminationGracePeriodSeconds) * Drain listeners with appropriate endpoint * Simpler drain + sleep * Remove PARENT_SHUTDOWN_TIME_SECONDS and terminationGracePeriodSeconds * Use a perl script (no need to open the admin HTTP interface!) * Use bash instead of perl in preStop hook * Review @skonto comments: use socket address for admin cluster * [WIP] add graceful shutdown test and tweak CI to just run that test * [WIP] Fix gracefulshutdown_test.go * [WIP] try to fix race condition and lint * [WIP] use initialTimeout + debug * [WIP] fix gracefulshutdown_test.go logic * [WIP] refacto and add some comments to clarify * [WIP] fix lint * [WIP] reintroduce kind-e2e-upgrade.yaml * [WIP] add test case when request takes a little longer than the drain time * [WIP] fix compilation issue * [WIP] FIx compilation issue (again) * [WIP] hopefully fix data race * [WIP] refacto and hopefully fix race condition (use sync.Map) * [WIP] fix compilation issue * [WIP] Handle EOF * [WIP] check gateway pod has been removed + manual debugging * [WIP] debugging * [WIP] more debugging * [WIP] more debugging * [WIP] increase livenessProbe failure threshold as I'm not sure it should return EOF * [WIP] remove debugging related stuff * Revert all unnecessary changes made for testing * Revert unnecessary change (livenessProbe) * Scale to 1 replica * Typo * Run gracefulshutdown test first (speed up feedback loop) * Add a comment for terminationGracePeriodSeconds * Don't update deployment twice Patch env and terminationGracePeriodSeconds at the same time * Fix bad patch * Run gracefulshutdown test at the end - avoids conflicts with other tests - change gracefulshutdown test to delete all gateway pods * Fix gracefulshutdown test * Fix gracefulshutdown test * Lint * run hack/update-deps.sh * update openshift files --------- Co-authored-by: norbjd <norbjd@users.noreply.github.com> * update deps --------- Co-authored-by: norbjd <norbjd@users.noreply.github.com>
Changes
/kind enhancement
This PR supersedes #1200 (click for more details, but TL;DR: today's
preStop
hook doesn't work, becausecurl
is not installed in the gateway image).When the gateway stops, we want to gracefully drain existing connections to prevent in-flight requests getting rejected with 5xx errors. To do so, we just need to drain the listeners using Envoy admin interface and
/drain_listeners?graceful
endpoint.The
/drain_listeners?graceful
endpoint starts the graceful drain process, but returns immediately. The drain process is configured with--drain-time-s $(DRAIN_TIME_SECONDS) --drain-strategy immediate
, so the drain will be finished afterDRAIN_TIME_SECONDS
. All connections that couldn't finish afterDRAIN_TIME_SECONDS
will be automatically closed, but this can be configured through the environment variable. People might also need to updateterminationGracePeriodSeconds
depending on their use case (default: 30s), as it only makes sense to haveterminationGracePeriodSeconds >= DRAIN_TIME_SECONDS
.This change has been tested manually (in addition to the new e2e test), and the logs look like the following. Here, I've stopped the gateway manually (
kubectl delete pod
) at around2024-02-02T19:17:22
, while there were some in-flight requests. The drain started just after (around2024-02-02T19:17:23
). Then, we can see thatSIGTERM
was sent 15 seconds after (time for thepreStop
hook to finish, as there is asleep
forDRAIN_TIME_SECONDS
at the end). During the 15 seconds, my in-flight requests were able to finish properly.Fixes #1118.
Release Note