Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple collators started on same machine conflict on prometheus exporter default port 9616 #5689

Closed
2 tasks done
iulianbarbu opened this issue Sep 12, 2024 · 3 comments · Fixed by #5572
Closed
2 tasks done
Assignees
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.

Comments

@iulianbarbu
Copy link
Contributor

iulianbarbu commented Sep 12, 2024

Is there an existing issue?

  • I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

  • This is not a support question.

Description of bug

Description

I started zombienet natively with cumulus/zombienet/tests/0006_rpc-collator-builds-blocks.toml. The network topology has two collators with minimal relay chain nodes which try to setup prometheus by creating a thin hyper server that responds to /metrics requests to localhost:9616. The hyper server is separate from the similar hyper server for the parachain part of the same collator. Both collators will race towards binding to 9616, resulting in a failure for one of them, which is not necessarily obvious when looking at the logs of the collator which can not bind. The relaychain minimal node's prometheus exporter is somewhat hidden and can be found in the logs if successful when setting it up, or through open ports exploration, but either way, not very well documented and straightfoward to refer to.

To consider for a possible fix (maybe others too)

1. Expose the relay chain prometheus exporter ports through a CLI arg to be able to set it to specific unused ports. This is relevant for:
a. testing with multiple collators on the same machine.
b. node operators that want to set the relaychain prometheus exporter port to something else than 9616.
c. node operators/devs who don't know the relaychain prometheus exporter exists and find it by reading the code (or by reading the logs if its setup succeeds). If this is documented a bit more clearly in public docs this point can be ignored.
(We have support to pass a --prometheus-port flag to the relaychain part of the args received by a polkadot-parachain node already, which makes this point irrelevant).

2. We could also bind to a random unused port, don't act on 1 and let devs/operators rely on code reading and logs. It isn't very DX friendly. I would avoid this considering the first point would fix this the good way and it isn't very complex either. (First point is not relevant and this one doesn't matter anymore).

  1. Emit an error when the prometheus exporter listener binding fails with port in use. This is trivial and can be done here: cumulus/minimal-node: added prometheus metrics for the RPC client #5572. I will strike-through this part once we merge the PR.

Steps to reproduce

This will start only the relay chain minimal node's prometheus exporter of dave collator, while eve exporter fails with no way of telling why (and more problematic if not knowing that it should be started, given it can be found out only if reading the code* for the relay chain rpc interface setup).

zombienet -l text --dir 9 -f --provider native spawn polkadot-sdk/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml

* I am not completely sure if the relaychain rpc interface's prometheus exporter startup is documented clearly outside of code though. (there is a way to pass relaychain args including the prometheus port when starting a polkadot-parachain node)

@iulianbarbu iulianbarbu added I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known. labels Sep 12, 2024
@pepoviola
Copy link
Contributor

Hi @iulianbarbu, I think you can also pass the --prometheus-port flag to the relaychain part of the arguments (I will fix that in the native provider of zombienet). But also printing the error would be good.
Thx!

@iulianbarbu
Copy link
Contributor Author

iulianbarbu commented Sep 12, 2024

Hi @pepoviola ! Thanks for jumping in. Would be great if you can reference this issue on the associated issue/PR on zombienet. I can also take a look/help with it once my bandwidth allows it, if not fixed by then. It is not urgent to be fixed in zombienet either, I've opened this ticket to clarify what's the best way to go around this default 9616 for prometheus exporter of the relay chain rpc interface (when managed through zombienet initially, and to a bigger extent if it applies to other usecases).

Separately, I think I just understood the codepaths related to configuring the prometheus params for the relay chain part. Somehow the relay chain args make their way up to PrometheusParams, where based on another prometheus_port flag the prometheus config can be given a custom port specifically for the relay chain, which is different from the one used by the node for the parachain part. I will update the issue by removing the first point because I think this should be quite obvious in parachain nodes deployments for node devs/operators.

What should remain on this ticket is the the error logging, and opening an issue on zombienet side. Do you agree?

@iulianbarbu
Copy link
Contributor Author

iulianbarbu commented Sep 12, 2024

What should remain on this ticket is the the error logging, and opening an issue on zombienet side. Do you agree?

Seems you already did it. Just the error logging then.

@iulianbarbu iulianbarbu self-assigned this Sep 12, 2024
github-merge-queue bot pushed a commit that referenced this issue Sep 19, 2024
)

# Description

When we start a node with connections to external RPC servers (as a
minimal node), we lack metrics around how many individual calls we're
doing to the remote RPC servers and their duration. This PR adds metrics
that measure durations of each RPC call made by the minimal nodes, and
implicitly how many calls there are.

Closes #5409 
Closes #5689

## Integration

Node operators should be able to track minimal node metrics and decide
appropriate actions according to how the metrics are interpreted/felt.
The added metrics can be observed by curl'ing the prometheus metrics
endpoint for the ~relaychain~ parachain (it was changed based on the
review). The metrics are represented by
~`polkadot_parachain_relay_chain_rpc_interface`~
`relay_chain_rpc_interface` namespace (I realized lining up
`parachain_relay_chain` in the same metric might be confusing :).
Excerpt from the curl:

```
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.001"} 15
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.004"} 23
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.016"} 23
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.064"} 23
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.256"} 24
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="1.024"} 24
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="4.096"} 24
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="16.384"} 24
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="65.536"} 24
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="+Inf"} 24
relay_chain_rpc_interface_sum{method="chain_getBlockHash",chain="rococo_local_testnet"} 0.11719075
relay_chain_rpc_interface_count{method="chain_getBlockHash",chain="rococo_local_testnet"} 24
```

## Review Notes

The way we measure durations/hits is based on `HistogramVec` struct
which allows us to collect timings for each RPC client method called
from the minimal node., It can be extended to measure the RPCs against
other dimensions too (status codes, response sizes, etc). The timing
measuring is done at the level of the `relay-chain-rpc-interface`, in
the `RelayChainRpcClient` struct's method 'request_tracing'. A single
entry point for all RPC requests done through the
relay-chain-rpc-interface. The requests durations will fall under
exponential buckets described by start `0.001`, factor `4` and count
`9`.

---------

Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants