[APM] Create SLOs metrics to track dependencies performance #211481

miloszmarcinkowski · 2025-02-17T16:15:31Z

Acceptance criteria:

Create SLOs for given metrics:
- latency,
- success rate,
- event loop utilization.
Each dependency endpoint has its own set of SLO metrics, or a single SLO metric consolidates all endpoints if feasible.

elasticmachine · 2025-02-17T16:16:34Z

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

miloszmarcinkowski · 2025-02-19T10:29:21Z

Three SLO metrics have been created to track dependencies endpoints performance:

Success rate - tracks successful API transactions, target 99%,
Latency - tracks transaction duration, we expect API to take no longer than 10000ms, target 95%,
Event loop active below 5000ms - tracks event loop duration, we expect to be no longer than 5000ms, target 99%.

The set of metrics has been configured for two heaviest dependencies endpoints:

/internal/apm/dependencies/top_dependencies,
/internal/apm/services/?/dependencies.

SLOs can be seen in Overview cluster under APM UI space, the best way to browse them is groping by tag:

Note

For comparison, it would be useful to create separate set of metrics for version including all performance improvements (there is possibility to group by version within SLO but it creates large number of individual SLOs which causes poor readibility).

Note

It could be good to set alerts once we know expected targets after implementing performance improvements.

crespocarlos · 2025-02-19T10:39:31Z

Thanks, @miloszmarcinkowski. Does it make sense to create an SLO that looks at all APM transactions instead of focusing only on the dependencies endpoint? I'm asking because it might be helpful to uncover other CPU-intensive transactions that we are not yet aware of.

miloszmarcinkowski · 2025-02-19T10:52:30Z

@crespocarlos the problem with the approach you mentioned is that we won't be able to narrow down metrics to find transactions that are slowing us down. SLO is a simple good events / bad events ratio that helps us track performance, but isn't very useful for troubleshooting.

We can consider it for tracking overall performance between Kibana releases, but we still want to keep those for dependencies as performance indicator.

crespocarlos · 2025-02-19T11:43:32Z

@miloszmarcinkowski

the problem with the approach you mentioned is that we won't be able to narrow down metrics to find transactions that are slowing us down SLO is a simple good events / bad events ratio that helps us track performance, but isn't very useful for troubleshooting.

Right. I'm not saying that for troubleshooting purposes, but for visibility. My idea was to group the SLO by transaction.name instead of filtering by the dependencies transaction name. It's fine to leave this for the dependencies endpoint. I'll look into better surfacing CPU usage across Kibana endpoints

smith · 2025-02-19T19:39:42Z

Any idea what the merits of the different event loop metrics? utilization is in the issue description. I think it's independent of the duration of the transaction. Is active better to use here?

miloszmarcinkowski · 2025-02-19T20:07:29Z

@smith Initially, I have been thinking about using event loop utilization metric but after checking I found out that the metrics are fluctuating heavily for the same endpoints. Additionally, it turned out that event loop utilization is dependent on transaction duration, its value is basically numeric_labels.event_loop_utilization/transaction.duration.us ratio. Therefore, I ended up using event loop active duration time due to a lack of other alternatives.

miloszmarcinkowski added the apm label Feb 17, 2025

miloszmarcinkowski self-assigned this Feb 17, 2025

botelastic bot added the needs-team Issues missing a team label label Feb 17, 2025

miloszmarcinkowski added the Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team label Feb 17, 2025

botelastic bot removed the needs-team Issues missing a team label label Feb 17, 2025

smith added the technical debt Improvement of the software architecture and operational architecture label Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[APM] Create SLOs metrics to track dependencies performance #211481

[APM] Create SLOs metrics to track dependencies performance #211481

miloszmarcinkowski commented Feb 17, 2025

elasticmachine commented Feb 17, 2025

miloszmarcinkowski commented Feb 19, 2025

crespocarlos commented Feb 19, 2025 •

edited

Loading

miloszmarcinkowski commented Feb 19, 2025

crespocarlos commented Feb 19, 2025 •

edited

Loading

smith commented Feb 19, 2025

miloszmarcinkowski commented Feb 19, 2025

[APM] Create SLOs metrics to track dependencies performance #211481

[APM] Create SLOs metrics to track dependencies performance #211481

Comments

miloszmarcinkowski commented Feb 17, 2025

Acceptance criteria:

elasticmachine commented Feb 17, 2025

miloszmarcinkowski commented Feb 19, 2025

crespocarlos commented Feb 19, 2025 • edited Loading

miloszmarcinkowski commented Feb 19, 2025

crespocarlos commented Feb 19, 2025 • edited Loading

smith commented Feb 19, 2025

miloszmarcinkowski commented Feb 19, 2025

crespocarlos commented Feb 19, 2025 •

edited

Loading

crespocarlos commented Feb 19, 2025 •

edited

Loading