hydra-queue-runner: drop broken connections from pool #1370
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #1336
When restarting postgresql, the connections are still reused in
hydra-queue-runner
causing errors like thisand no more builds being processed.
hydra-evaluator
doesn't have that issue since it crashes right away. We could let it retry indefinitely as well (see below), but I don't want to change too much.If the DB is still unreachable 10s later, the process will stop with a non-zero exit code because of a missing DB connection. This however isn't such a big deal because it will be immediately restarted afterwards. With the current configuration, Hydra will never give up, but restart (and retry) infinitely. To me that seems reasonable, i.e. to retry DB connections on a long-running process. If this doesn't work out, the monitoring should fire anyways because the queue fills up, but I'm open to discuss that.
Please note that this isn't reproducible with the DB and the queue runner on the same machine when using
services.hydra-dev
, because of theRequires=
dependencyhydra-queue-runner.service
->hydra-init.service
->postgresql.service
that causes the queue runner to be restarted onsystemctl restart postgresql
.Internally, Hydra uses Nix's pool data structure: it basically has N slots (here DB connections) and whenever a new one is requested, an idle slot is provided or a new one is created (when N slots are active, it'll be waited until one slot is free). The issue in the code here is however that whenever an error is encountered, the slot is released, however the same broken connection will be reused the next time. By using
Pool::Handle::markBad
, Nix will drop a broken slot. This is now being done whenpqxx::broken_connection
was caught.cc @NixOS/infra