Instrumentation and monitoring for apps #580

martinothamar · 2024-04-11T06:26:09Z

Description

This is a big draft/braindump issue for problems surrounding monitoring and telemetry across SDLC (software development lifecycle) for Altinn 3.

Prompted by TEs and users, we need to clarify responsibilities regarding monitoring apps running in environments.
A line can be drawn between what should be the responsibility of the platform owners and the platform users/application developers.

In addition, we need to advise TEs on how to monitor their application.
We need to provide an instrumentation and diagnostics setup that is usable and flexible both for us as library developers
and for app developers - where correlation and contextualization is mostly automatic, but still customizable.
Flexibility must also be present in usage of the telemetry.
We use 1 vendor today, but might want to completely switch vendors or use two different vendors at the same time.

In scope

To deliver good DevEx related to monitoring and operational aspects we need

Library abstractions to configure, instrument and ship telemetry
Infrastructure to process, enrich and reliably deliver telemetry to backends
Documentation on
- How to instrument, monitor alert
- Separation of responsiblities, way of working, incidence response etc..
Process for onboarding
?

Out of scope

?

Additional Information

Who should monitor what

Platform team should monitor infrastructure components such as AKS/k8s and related infrastructure

Monitor state of security and respond to any CVE (Common Vulnerabilities and Exposures) that requires us to upgrade k8s version for example, or patch VMs
LB (Load Balancing)/reverse proxies - everything from security to cpu/mem and other resource usage, certificate renewal for TLS etc
Autoscaling components - we should not run out of compute resources unexpectedly, issues with autoscaling should be detected by platform team
... anything that the platform offers that may affect users, but the app developers don't have knowledge or ownership over

The platform team is primarily a team from Digdir, but in close cooperation with TE especially regarding scaling and security issues.

Service owners should monitor their application, for example

Unexpected rates of >=500 errors
Errors in any background processes from app code
Performance issues caused by app code
Memory or other resource leaks originating from app code
Security issues related to dependencies
- need to make sure app-lib is up to date
- apps need to be rebuilt periodically so that base image dependencies don't become stale

Digdir Team Apps should monitor based on library code, for example

Unexpected rates of >=500 errors
Errors in any background processes from library code
Performance issues caused by library code
Memory or other resource leaks originating from library code
Security issues related to library dependencies

Since issues discovered in applications may originate both from app and library code, and it requires investigation to know which, library code in particular needs to be well instrumented, and there should be a process in place for efficient collaboration between the teams/incident responders during incidents.
Both Team Apps and the app development teams need to be able to access telemetry through some monitoring and analysis tool such as Azure App Insights or Grafana.

Instrumentation guidance for app developers

It's tempting to just ask TEs what they want to monitor, but usually they don't know, as the primary competence app teams bring is usually less about operational aspects such as monitoring, and more about building good products. Good culture and process for instrumentation and monitoring can significantly improve operational performance of systems, lead to less bugs and deliver better value.

We should put Dev & Ops together and build some deliverables that makes it simple for app teams to be good at operating their applications.

Technical design

To ensure flexibility in vendor selection and instrumentation, we should use OpenTelemetry as the standard and library abstraction, as it has great .NET support (all APIs stable). Most library infrastructure and abstraction we need is built into the BCL, so we can drop some dependencies and stick to BCL and the core OpenTelemetry libriaries. Proposed design

Only System.Diagnostics (builtin/BCL) and OpenTelemetry.(Instrumentation|OTLPExporter) libraries in app-lib
Configure otel through environment variables in TE clusters (e.g. helm chart)
Ship to Azure Monitor through something like OpenTelemetry Collector or Grafana Agent (which wraps otel components)

We currently have two mechanisms for telemetry today

Prometheus metrics through prometheus-net
Telemetry through Application Insights TelemetryClient

Some users may rely on TelemetryClient for custom instrumentation today (and app-lib does some),
but Prometheus metrics are not being shipped yet currently (PR for helm chart).

Deployment plan:

Deprecate TelemetryClient - e.g. using a Roslyn anlyzer, or just documenting it
Deprecate or just remove prometheus-net depending on current use (some people might implement it even if the metrics don't go anywhere)
Add OpenTelemetry libraries, and start instrumentating app-lib with BCL System.Diagnostics APIs

Design considerations

As developers of app libraries we are responsible for developing/configuring/exposing abstractions that are well suited and flexible in use for instrumentation of code to gain observability.
We are also responsible for shipping and processing the telemetry in such a way that developers can make use of this capability in multiple phases of software development

Debugging and development - distributed tracing in particular is a useful tool for understanding the real behavior of code during runtime
Testing - by analyzing telemetry during testing one can validate that the code behaves in an expected way - e.g. expected number of calls to external APIs, lifecycle of components run as expected etc..
Bug fixing/production - root causing bugs and issues

In addition, telemetry and monitoring can be useful in the planning and delivery phases of software development lifecycle

Planning - what kind of metrics or telemetry do we need to prove the level of service (i.e. for an SLA) the app delivers. And what is the SLA we want to deliver
Delivery - dashboards and monitoring for verifying SLA based on metrics discovered above (SLOs, SLIs if using an SRE process)

Why otel?

Better API than prometheus-net and App Insights
- Standardized (can either use BCL directly or otel shim)
- Less diversity of concepts (thinking of app insights requests|dependencies|traces|events|exceptions|...)
- Better API for labels/attributes (prometheus-net depends on ordering when adding labels vs names)
Better config management
- Otel can be configured from the outside using env variables
Simpler localtest setup - we can run infra locally that matches deployed environments
Instrumenting with less dependencies in app-lib
Access to cheaper and/or better platforms without having to change abstractions or transport mechanism

Kinds of telemetry

Logs - useful for adhoc searching and filtering based on a known predicate (e.g. you filter by the severity and the log message)
Metrics - analysis based on timeseries/graphing and aggregation (e.g. what is the average request latency)
Traces - useful for inspecting specific operations or requests, understanding lifecycle, how often something runs or how long functions or subfunctions take
Continuous profiling - some vendors have started to build offerings here
Frontend RUM? Correlation?

When standard telemetry fails

Sometimes you can verify that there are issues, and what the nature of those issues are, but need more information to fix
Examples:

The app is seemingly deadlocked and frozen. Metrics can tell us that the number of threads spiked leading up to the outage. A common cause for this kind of issue is threadpool starvation, taking a process dump can help root cause the issue being inspecting the call stack of managed threads.
Memory usage keeps increasing until the application OOMs and is killed by k8s. There is no telemetry to tell you what is being allocated but no deallocated. Process dump can also help in this case

Tasks

Draft PR for lib changes
Draft PR for localtest changes

The text was updated successfully, but these errors were encountered:

martinothamar · 2024-05-14T06:24:43Z

ADR proposal: Altinn/altinn-decision-log#3

martinothamar · 2024-06-10T08:26:32Z

Relevant issue for client side analytics: Altinn/app-frontend-react#853

martinothamar added status/draft Status: When you create an issue before you have enough info to properly describe the issue. Epic labels Apr 11, 2024

github-actions bot added the status/triage label Apr 11, 2024

acn-sbuad added this to Team Apps Apr 11, 2024

RonnyB71 mentioned this issue Apr 11, 2024

Monitorering og overvåking av Altinn 3 applikasjoner digdir/roadmap#159

Open

This was referenced Apr 12, 2024

Instrumentation and monitoring for apps - Draft PR for lib changes #585

Closed

OpenTelemetry observability #586

Closed

Support for OpenTelemetry in localtest #587

Closed

martinothamar mentioned this issue Apr 12, 2024

Local monitoring setup with OTEL and Grafana LGTM-stack Altinn/app-localtest#97

Merged

5 tasks

This was referenced Apr 15, 2024

Removing TelemetryClient and prometheus-net #592

Open

Instrument app-lib for OpenTelemetry #593

Closed

martinothamar mentioned this issue May 8, 2024

Monitoring/operational docs for app developers and other stakeholders #638

Open

RonnyB71 removed status/draft Status: When you create an issue before you have enough info to properly describe the issue. status/triage labels May 21, 2024

martinothamar moved this to 👷 In Progress in Team Apps Jun 10, 2024

bbrandt mentioned this issue Feb 5, 2025

Allow Prometheus Rules to Publish Rules to Azure Managed Prometheus pyrra-dev/pyrra#1185

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instrumentation and monitoring for apps #580

Instrumentation and monitoring for apps #580

martinothamar commented Apr 11, 2024 •

edited

Loading

martinothamar commented May 14, 2024

martinothamar commented Jun 10, 2024

Instrumentation and monitoring for apps #580

Instrumentation and monitoring for apps #580

Comments

martinothamar commented Apr 11, 2024 • edited Loading

Description

In scope

Out of scope

Additional Information

Who should monitor what

Instrumentation guidance for app developers

Technical design

Design considerations

Why otel?

Kinds of telemetry

When standard telemetry fails

Tasks

martinothamar commented May 14, 2024

martinothamar commented Jun 10, 2024

martinothamar commented Apr 11, 2024 •

edited

Loading