Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM] Create SLOs metrics to track dependencies performance #211481

Open
miloszmarcinkowski opened this issue Feb 17, 2025 · 7 comments
Open

[APM] Create SLOs metrics to track dependencies performance #211481

miloszmarcinkowski opened this issue Feb 17, 2025 · 7 comments
Assignees
Labels
apm Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team technical debt Improvement of the software architecture and operational architecture

Comments

@miloszmarcinkowski
Copy link
Contributor

Acceptance criteria:

  • Create SLOs for given metrics:
    • latency,
    • success rate,
    • event loop utilization.
  • Each dependency endpoint has its own set of SLO metrics, or a single SLO metric consolidates all endpoints if feasible.
@miloszmarcinkowski miloszmarcinkowski self-assigned this Feb 17, 2025
@botelastic botelastic bot added the needs-team Issues missing a team label label Feb 17, 2025
@miloszmarcinkowski miloszmarcinkowski added the Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team label Feb 17, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Feb 17, 2025
@miloszmarcinkowski
Copy link
Contributor Author

Three SLO metrics have been created to track dependencies endpoints performance:

  1. Success rate - tracks successful API transactions, target 99%,
  2. Latency - tracks transaction duration, we expect API to take no longer than 10000ms, target 95%,
  3. Event loop active below 5000ms - tracks event loop duration, we expect to be no longer than 5000ms, target 99%.

The set of metrics has been configured for two heaviest dependencies endpoints:

  • /internal/apm/dependencies/top_dependencies,
  • /internal/apm/services/?/dependencies.

SLOs can be seen in Overview cluster under APM UI space, the best way to browse them is groping by tag:

Image

Note

For comparison, it would be useful to create separate set of metrics for version including all performance improvements (there is possibility to group by version within SLO but it creates large number of individual SLOs which causes poor readibility).

Note

It could be good to set alerts once we know expected targets after implementing performance improvements.

@crespocarlos
Copy link
Contributor

crespocarlos commented Feb 19, 2025

Thanks, @miloszmarcinkowski. Does it make sense to create an SLO that looks at all APM transactions instead of focusing only on the dependencies endpoint? I'm asking because it might be helpful to uncover other CPU-intensive transactions that we are not yet aware of.

@miloszmarcinkowski
Copy link
Contributor Author

@crespocarlos the problem with the approach you mentioned is that we won't be able to narrow down metrics to find transactions that are slowing us down. SLO is a simple good events / bad events ratio that helps us track performance, but isn't very useful for troubleshooting.

We can consider it for tracking overall performance between Kibana releases, but we still want to keep those for dependencies as performance indicator.

@crespocarlos
Copy link
Contributor

crespocarlos commented Feb 19, 2025

@miloszmarcinkowski

the problem with the approach you mentioned is that we won't be able to narrow down metrics to find transactions that are slowing us down SLO is a simple good events / bad events ratio that helps us track performance, but isn't very useful for troubleshooting.

Right. I'm not saying that for troubleshooting purposes, but for visibility. My idea was to group the SLO by transaction.name instead of filtering by the dependencies transaction name. It's fine to leave this for the dependencies endpoint. I'll look into better surfacing CPU usage across Kibana endpoints

@smith
Copy link
Contributor

smith commented Feb 19, 2025

Any idea what the merits of the different event loop metrics? utilization is in the issue description. I think it's independent of the duration of the transaction. Is active better to use here?

@miloszmarcinkowski
Copy link
Contributor Author

@smith Initially, I have been thinking about using event loop utilization metric but after checking I found out that the metrics are fluctuating heavily for the same endpoints. Additionally, it turned out that event loop utilization is dependent on transaction duration, its value is basically numeric_labels.event_loop_utilization/transaction.duration.us ratio. Therefore, I ended up using event loop active duration time due to a lack of other alternatives.

@smith smith added the technical debt Improvement of the software architecture and operational architecture label Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apm Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team technical debt Improvement of the software architecture and operational architecture
Projects
None yet
Development

No branches or pull requests

4 participants