Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hydra-queue-runner: drop broken connections from pool #1370

Merged
merged 1 commit into from
Mar 26, 2024

Conversation

Ma27
Copy link
Member

@Ma27 Ma27 commented Mar 15, 2024

Closes #1336

When restarting postgresql, the connections are still reused in hydra-queue-runner causing errors like this

main thread: Lost connection to the database server.
queue monitor: Lost connection to the database server.

and no more builds being processed.

hydra-evaluator doesn't have that issue since it crashes right away. We could let it retry indefinitely as well (see below), but I don't want to change too much.

If the DB is still unreachable 10s later, the process will stop with a non-zero exit code because of a missing DB connection. This however isn't such a big deal because it will be immediately restarted afterwards. With the current configuration, Hydra will never give up, but restart (and retry) infinitely. To me that seems reasonable, i.e. to retry DB connections on a long-running process. If this doesn't work out, the monitoring should fire anyways because the queue fills up, but I'm open to discuss that.

Please note that this isn't reproducible with the DB and the queue runner on the same machine when using services.hydra-dev, because of the Requires= dependency hydra-queue-runner.service -> hydra-init.service -> postgresql.service that causes the queue runner to be restarted on systemctl restart postgresql.

Internally, Hydra uses Nix's pool data structure: it basically has N slots (here DB connections) and whenever a new one is requested, an idle slot is provided or a new one is created (when N slots are active, it'll be waited until one slot is free). The issue in the code here is however that whenever an error is encountered, the slot is released, however the same broken connection will be reused the next time. By using Pool::Handle::markBad, Nix will drop a broken slot. This is now being done when pqxx::broken_connection was caught.

cc @NixOS/infra

Closes NixOS#1336

When restarting postgresql, the connections are still reused in
`hydra-queue-runner` causing errors like this

    main thread: Lost connection to the database server.
    queue monitor: Lost connection to the database server.

and no more builds being processed.

`hydra-evaluator` doesn't have that issue since it crashes right away.
We could let it retry indefinitely as well (see below), but I don't
want to change too much.

If the DB is still unreachable 10s later, the process will stop with a
non-zero exit code because of a missing DB connection. This however
isn't such a big deal because it will be immediately restarted
afterwards. With the current configuration, Hydra will never give up,
but restart (and retry) infinitely. To me that seems reasonable, i.e. to
retry DB connections on a long-running process. If this doesn't work
out, the monitoring should fire anyways because the queue fills up, but
I'm open to discuss that.

Please note that this isn't reproducible with the DB and the queue
runner on the same machine when using `services.hydra-dev`, because of
the `Requires=` dependency `hydra-queue-runner.service` ->
`hydra-init.service` -> `postgresql.service` that causes the queue
runner to be restarted on `systemctl restart postgresql`.

Internally, Hydra uses Nix's pool data structure: it basically has N
slots (here DB connections) and whenever a new one is requested, an idle
slot is provided or a new one is created (when N slots are active, it'll
be waited until one slot is free). The issue in the code here is however
that whenever an error is encountered, the slot is released, however the
same broken connection will be reused the next time. By using
`Pool::Handle::markBad`, Nix will drop a broken slot. This is now being
done when `pqxx::broken_connection` was caught.
@mweinelt
Copy link
Member

Deployed on hydra.nixos.org. We'll see on the next database reboot how well this will work.

@Ericson2314
Copy link
Member

Thanks @Ma27. This makes sense to me.

@Ericson2314 Ericson2314 merged commit d614163 into NixOS:master Mar 26, 2024
1 check passed
@Ma27 Ma27 deleted the reconnect-db branch March 26, 2024 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

hydra-queue-runner doesn't reconnect after a postgresql server restart, staying stuck forever
3 participants