Multiple collators started on same machine conflict on prometheus exporter default port 9616 #5689

iulianbarbu · 2024-09-12T12:16:33Z

Is there an existing issue?

I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

This is not a support question.

Description of bug

Description

I started zombienet natively with cumulus/zombienet/tests/0006_rpc-collator-builds-blocks.toml. The network topology has two collators with minimal relay chain nodes which try to setup prometheus by creating a thin hyper server that responds to /metrics requests to localhost:9616. The hyper server is separate from the similar hyper server for the parachain part of the same collator. Both collators will race towards binding to 9616, resulting in a failure for one of them, which is not necessarily obvious when looking at the logs of the collator which can not bind. The relaychain minimal node's prometheus exporter is somewhat hidden and can be found in the logs if successful when setting it up, or through open ports exploration, but either way, not very well documented and straightfoward to refer to.

To consider for a possible fix (maybe others too)

1. Expose the relay chain prometheus exporter ports through a CLI arg to be able to set it to specific unused ports. This is relevant for:
a. testing with multiple collators on the same machine.
b. node operators that want to set the relaychain prometheus exporter port to something else than 9616.
c. node operators/devs who don't know the relaychain prometheus exporter exists and find it by reading the code (or by reading the logs if its setup succeeds). If this is documented a bit more clearly in public docs this point can be ignored. (We have support to pass a --prometheus-port flag to the relaychain part of the args received by a polkadot-parachain node already, which makes this point irrelevant).

2. We could also bind to a random unused port, don't act on 1 and let devs/operators rely on code reading and logs. It isn't very DX friendly. I would avoid this considering the first point would fix this the good way and it isn't very complex either. (First point is not relevant and this one doesn't matter anymore).

Emit an error when the prometheus exporter listener binding fails with port in use. This is trivial and can be done here: cumulus/minimal-node: added prometheus metrics for the RPC client #5572. I will strike-through this part once we merge the PR.

Steps to reproduce

This will start only the relay chain minimal node's prometheus exporter of dave collator, while eve exporter fails with no way of telling why (and more problematic if not knowing that it should be started, given it can be found out only if reading the code* for the relay chain rpc interface setup).

zombienet -l text --dir 9 -f --provider native spawn polkadot-sdk/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml

* I am not completely sure if the relaychain rpc interface's prometheus exporter startup is documented clearly outside of code though. (there is a way to pass relaychain args including the prometheus port when starting a polkadot-parachain node)

The text was updated successfully, but these errors were encountered:

pepoviola · 2024-09-12T13:43:29Z

Hi @iulianbarbu, I think you can also pass the --prometheus-port flag to the relaychain part of the arguments (I will fix that in the native provider of zombienet). But also printing the error would be good.
Thx!

iulianbarbu · 2024-09-12T14:10:13Z

Hi @pepoviola ! Thanks for jumping in. Would be great if you can reference this issue on the associated issue/PR on zombienet. I can also take a look/help with it once my bandwidth allows it, if not fixed by then. It is not urgent to be fixed in zombienet either, I've opened this ticket to clarify what's the best way to go around this default 9616 for prometheus exporter of the relay chain rpc interface (when managed through zombienet initially, and to a bigger extent if it applies to other usecases).

Separately, I think I just understood the codepaths related to configuring the prometheus params for the relay chain part. Somehow the relay chain args make their way up to PrometheusParams, where based on another prometheus_port flag the prometheus config can be given a custom port specifically for the relay chain, which is different from the one used by the node for the parachain part. I will update the issue by removing the first point because I think this should be quite obvious in parachain nodes deployments for node devs/operators.

What should remain on this ticket is the the error logging, and opening an issue on zombienet side. Do you agree?

iulianbarbu · 2024-09-12T14:27:03Z

What should remain on this ticket is the the error logging, and opening an issue on zombienet side. Do you agree?

Seems you already did it. Just the error logging then.

) # Description When we start a node with connections to external RPC servers (as a minimal node), we lack metrics around how many individual calls we're doing to the remote RPC servers and their duration. This PR adds metrics that measure durations of each RPC call made by the minimal nodes, and implicitly how many calls there are. Closes #5409 Closes #5689 ## Integration Node operators should be able to track minimal node metrics and decide appropriate actions according to how the metrics are interpreted/felt. The added metrics can be observed by curl'ing the prometheus metrics endpoint for the ~relaychain~ parachain (it was changed based on the review). The metrics are represented by ~`polkadot_parachain_relay_chain_rpc_interface`~ `relay_chain_rpc_interface` namespace (I realized lining up `parachain_relay_chain` in the same metric might be confusing :). Excerpt from the curl: ``` relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.001"} 15 relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.004"} 23 relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.016"} 23 relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.064"} 23 relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.256"} 24 relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="1.024"} 24 relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="4.096"} 24 relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="16.384"} 24 relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="65.536"} 24 relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="+Inf"} 24 relay_chain_rpc_interface_sum{method="chain_getBlockHash",chain="rococo_local_testnet"} 0.11719075 relay_chain_rpc_interface_count{method="chain_getBlockHash",chain="rococo_local_testnet"} 24 ``` ## Review Notes The way we measure durations/hits is based on `HistogramVec` struct which allows us to collect timings for each RPC client method called from the minimal node., It can be extended to measure the RPCs against other dimensions too (status codes, response sizes, etc). The timing measuring is done at the level of the `relay-chain-rpc-interface`, in the `RelayChainRpcClient` struct's method 'request_tracing'. A single entry point for all RPC requests done through the relay-chain-rpc-interface. The requests durations will fall under exponential buckets described by start `0.001`, factor `4` and count `9`. --------- Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>

iulianbarbu added I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known. labels Sep 12, 2024

This was referenced Sep 12, 2024

[native] Ensure uniqueness for prometheus port (relay part) paritytech/zombienet-sdk#258

Open

[native] Ensure uniqueness for prometheus port (relay part) paritytech/zombienet#1868

Open

iulianbarbu mentioned this issue Sep 12, 2024

cumulus/minimal-node: added prometheus metrics for the RPC client #5572

Merged

iulianbarbu self-assigned this Sep 12, 2024

iulianbarbu closed this as completed in #5572 Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple collators started on same machine conflict on prometheus exporter default port 9616 #5689

Multiple collators started on same machine conflict on prometheus exporter default port 9616 #5689

iulianbarbu commented Sep 12, 2024 •

edited

Loading

pepoviola commented Sep 12, 2024

iulianbarbu commented Sep 12, 2024 •

edited

Loading

iulianbarbu commented Sep 12, 2024 •

edited

Loading

Multiple collators started on same machine conflict on prometheus exporter default port 9616 #5689

Multiple collators started on same machine conflict on prometheus exporter default port 9616 #5689

Comments

iulianbarbu commented Sep 12, 2024 • edited Loading

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

Description

To consider for a possible fix (maybe others too)

Steps to reproduce

pepoviola commented Sep 12, 2024

iulianbarbu commented Sep 12, 2024 • edited Loading

iulianbarbu commented Sep 12, 2024 • edited Loading

iulianbarbu commented Sep 12, 2024 •

edited

Loading

iulianbarbu commented Sep 12, 2024 •

edited

Loading

iulianbarbu commented Sep 12, 2024 •

edited

Loading