-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v1] Remove orphan sites? #1951
Comments
@xe-leon Thank you for the issue! I think I'd support making changes to this, but I am not really clear on what is the correct behaviour. This metric is only really useful if it does report down links during a disruption, so adding a threshold where the time series is deleted after some period could cause a possibly larger issue than we are solving here. Adding a way to delete that orphan link is interesting - I will give that some thought. For now, I am afraid the only mitigation I have for you is restarting the the flow-collector container inside of the skupper-service-controller deployment (or just the whole deployment.) Are you able to share details on what may make sense for your use case? i.e. how are you collecting these metrics (hijacking the skupper-prometheus deployment somehow or simply scraping the skupper service with an external monitoring stack), what conditions do your alerting stack watch for, about how frequently do you intentionally remove sites? A quick analysis of what is going on in the code: The flow-collector container initially observes events from the site1 and site2 routers - reporting their links, and those are reflected in the active_links metric. When a remote site is deleted, the flow-collector first observes events from the site1 router(s) announcing the lost connections to the routers in site2 and then eventually (after a 60s timeout) notes it has stopped receiving events from the routers in site2. The function that updates the active_links metric sets the metric to zero, but never deletes the time series: skupper/pkg/flow/flow_mem_driver.go Lines 575 to 580 in c4f821e
|
Restarting skupper-service-controller deployment helped, but after that I had weird number of sites (-1 indirectly):
Had to revoke-access and issue new token. After that number of sites became correct and stale site has disappeared.
simply scraping /federate endpoint from
I have alerting if certain site has 0 incoming and 0 outgoing links
Well it happens sometimes, I would say few times per year. But also I might want to remove site if I deployed it by mistake or changed my mind. |
Describe the bug
After some site disappeared without proper unexposing services (i.e., I've deleted whole skupper site), I still have
active_links
metrics with labels from this site (just with value of 0). This triggers alerts and I have no idea how to delete this stale site from skupper.skupper link status
command only shows active links, but there's no links from that stale site.How To Reproduce
Steps to reproduce the behavior:
active_links
with neighbor site`s labelactive_links
metric with labelsourceSite=<site2uuid>
Expected behavior
Sites` labels disappear from labels after threshold, or I have ability to delete that orphan link
Environment details
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: