Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1] Remove orphan sites? #1951

Open
xe-leon opened this issue Feb 10, 2025 · 2 comments
Open

[v1] Remove orphan sites? #1951

xe-leon opened this issue Feb 10, 2025 · 2 comments

Comments

@xe-leon
Copy link

xe-leon commented Feb 10, 2025

Describe the bug
After some site disappeared without proper unexposing services (i.e., I've deleted whole skupper site), I still have active_links metrics with labels from this site (just with value of 0). This triggers alerts and I have no idea how to delete this stale site from skupper.

skupper link status command only shows active links, but there's no links from that stale site.

How To Reproduce
Steps to reproduce the behavior:

  1. Create and connect 2 sites in skupper
  2. Start collecting metrics from flow-collector, ensure you have active_links with neighbor site`s label
  3. Delete second site
  4. You still have active_links metric with label sourceSite=<site2uuid>

Expected behavior
Sites` labels disappear from labels after threshold, or I have ability to delete that orphan link

Environment details

  • Skupper CLI: 1.8.3
  • Skupper Operator (if applicable): 1.8.3
  • Platform: kubernetes
client version                 1.8.3
transport version              quay.io/skupper/skupper-router:2.7.3 (sha256:9e046cda37f8)
controller version             quay.io/skupper/service-controller:1.8.3 (sha256:bd7abfe26655)
config-sync version            quay.io/skupper/config-sync:1.8.3 (sha256:5cb86bebf0a6)
flow-collector version         quay.io/skupper/flow-collector:1.8.3 (sha256:73234bdfcd9f)

Additional context
Add any other context about the problem here.

@c-kruse
Copy link
Contributor

c-kruse commented Feb 11, 2025

@xe-leon Thank you for the issue! I think I'd support making changes to this, but I am not really clear on what is the correct behaviour. This metric is only really useful if it does report down links during a disruption, so adding a threshold where the time series is deleted after some period could cause a possibly larger issue than we are solving here. Adding a way to delete that orphan link is interesting - I will give that some thought. For now, I am afraid the only mitigation I have for you is restarting the the flow-collector container inside of the skupper-service-controller deployment (or just the whole deployment.)

Are you able to share details on what may make sense for your use case? i.e. how are you collecting these metrics (hijacking the skupper-prometheus deployment somehow or simply scraping the skupper service with an external monitoring stack), what conditions do your alerting stack watch for, about how frequently do you intentionally remove sites?

A quick analysis of what is going on in the code: The flow-collector container initially observes events from the site1 and site2 routers - reporting their links, and those are reflected in the active_links metric. When a remote site is deleted, the flow-collector first observes events from the site1 router(s) announcing the lost connections to the routers in site2 and then eventually (after a 60s timeout) notes it has stopped receiving events from the routers in site2. The function that updates the active_links metric sets the metric to zero, but never deletes the time series:

_, siteNodes := fc.graph()
for _, node := range siteNodes {
fc.metrics.activeLinks.WithLabelValues(node.ID, "outgoing").Set(float64(len(node.Forward)))
fc.metrics.activeLinks.WithLabelValues(node.ID, "incoming").Set(float64(len(node.Backward)))
}
}
. I think we'd need to add some less lazy state handling to keep track of the different active_links labels and delete them when it is right to do so.

@xe-leon
Copy link
Author

xe-leon commented Feb 11, 2025

Restarting skupper-service-controller deployment helped, but after that I had weird number of sites (-1 indirectly):

skupper status
Skupper is enabled for namespace "grafana" with site name "grafana". It is connected to 3 other sites (-1 indirectly). It has 3 exposed services.

Had to revoke-access and issue new token. After that number of sites became correct and stale site has disappeared.

how are you collecting these metrics

simply scraping /federate endpoint from skupper-prometheus

what conditions do your alerting stack watch for

I have alerting if certain site has 0 incoming and 0 outgoing links

how frequently do you intentionally remove sites?

Well it happens sometimes, I would say few times per year. But also I might want to remove site if I deployed it by mistake or changed my mind.
I think restarting service-controller and reissuing token is quite a workaround. Thank you for advices!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants