Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instrumentation and monitoring for apps #580

Open
2 tasks
Tracked by #159
martinothamar opened this issue Apr 11, 2024 · 2 comments
Open
2 tasks
Tracked by #159

Instrumentation and monitoring for apps #580

martinothamar opened this issue Apr 11, 2024 · 2 comments
Labels

Comments

@martinothamar
Copy link
Contributor

martinothamar commented Apr 11, 2024

Description

This is a big draft/braindump issue for problems surrounding monitoring and telemetry across SDLC (software development lifecycle) for Altinn 3.

Prompted by TEs and users, we need to clarify responsibilities regarding monitoring apps running in environments.
A line can be drawn between what should be the responsibility of the platform owners and the platform users/application developers.

In addition, we need to advise TEs on how to monitor their application.
We need to provide an instrumentation and diagnostics setup that is usable and flexible both for us as library developers
and for app developers - where correlation and contextualization is mostly automatic, but still customizable.
Flexibility must also be present in usage of the telemetry.
We use 1 vendor today, but might want to completely switch vendors or use two different vendors at the same time.

In scope

To deliver good DevEx related to monitoring and operational aspects we need

  • Library abstractions to configure, instrument and ship telemetry
  • Infrastructure to process, enrich and reliably deliver telemetry to backends
  • Documentation on
    • How to instrument, monitor alert
    • Separation of responsiblities, way of working, incidence response etc..
  • Process for onboarding
  • ?

Out of scope

?

Additional Information

Who should monitor what

Platform team should monitor infrastructure components such as AKS/k8s and related infrastructure

  • Monitor state of security and respond to any CVE (Common Vulnerabilities and Exposures) that requires us to upgrade k8s version for example, or patch VMs
  • LB (Load Balancing)/reverse proxies - everything from security to cpu/mem and other resource usage, certificate renewal for TLS etc
  • Autoscaling components - we should not run out of compute resources unexpectedly, issues with autoscaling should be detected by platform team
  • ... anything that the platform offers that may affect users, but the app developers don't have knowledge or ownership over

The platform team is primarily a team from Digdir, but in close cooperation with TE especially regarding scaling and security issues.

Service owners should monitor their application, for example

  • Unexpected rates of >=500 errors
  • Errors in any background processes from app code
  • Performance issues caused by app code
  • Memory or other resource leaks originating from app code
  • Security issues related to dependencies
    • need to make sure app-lib is up to date
    • apps need to be rebuilt periodically so that base image dependencies don't become stale

Digdir Team Apps should monitor based on library code, for example

  • Unexpected rates of >=500 errors
  • Errors in any background processes from library code
  • Performance issues caused by library code
  • Memory or other resource leaks originating from library code
  • Security issues related to library dependencies

Since issues discovered in applications may originate both from app and library code, and it requires investigation to know which, library code in particular needs to be well instrumented, and there should be a process in place for efficient collaboration between the teams/incident responders during incidents.
Both Team Apps and the app development teams need to be able to access telemetry through some monitoring and analysis tool such as Azure App Insights or Grafana.

Instrumentation guidance for app developers

It's tempting to just ask TEs what they want to monitor, but usually they don't know, as the primary competence app teams bring is usually less about operational aspects such as monitoring, and more about building good products. Good culture and process for instrumentation and monitoring can significantly improve operational performance of systems, lead to less bugs and deliver better value.

We should put Dev & Ops together and build some deliverables that makes it simple for app teams to be good at operating their applications.

Technical design

To ensure flexibility in vendor selection and instrumentation, we should use OpenTelemetry as the standard and library abstraction, as it has great .NET support (all APIs stable). Most library infrastructure and abstraction we need is built into the BCL, so we can drop some dependencies and stick to BCL and the core OpenTelemetry libriaries. Proposed design

We currently have two mechanisms for telemetry today

  • Prometheus metrics through prometheus-net
  • Telemetry through Application Insights TelemetryClient

Some users may rely on TelemetryClient for custom instrumentation today (and app-lib does some),
but Prometheus metrics are not being shipped yet currently (PR for helm chart).

Deployment plan:

  • Deprecate TelemetryClient - e.g. using a Roslyn anlyzer, or just documenting it
  • Deprecate or just remove prometheus-net depending on current use (some people might implement it even if the metrics don't go anywhere)
  • Add OpenTelemetry libraries, and start instrumentating app-lib with BCL System.Diagnostics APIs

Design considerations

As developers of app libraries we are responsible for developing/configuring/exposing abstractions that are well suited and flexible in use for instrumentation of code to gain observability.
We are also responsible for shipping and processing the telemetry in such a way that developers can make use of this capability in multiple phases of software development

  • Debugging and development - distributed tracing in particular is a useful tool for understanding the real behavior of code during runtime
  • Testing - by analyzing telemetry during testing one can validate that the code behaves in an expected way - e.g. expected number of calls to external APIs, lifecycle of components run as expected etc..
  • Bug fixing/production - root causing bugs and issues

In addition, telemetry and monitoring can be useful in the planning and delivery phases of software development lifecycle

  • Planning - what kind of metrics or telemetry do we need to prove the level of service (i.e. for an SLA) the app delivers. And what is the SLA we want to deliver
  • Delivery - dashboards and monitoring for verifying SLA based on metrics discovered above (SLOs, SLIs if using an SRE process)

Why otel?

  • Better API than prometheus-net and App Insights
    • Standardized (can either use BCL directly or otel shim)
    • Less diversity of concepts (thinking of app insights requests|dependencies|traces|events|exceptions|...)
    • Better API for labels/attributes (prometheus-net depends on ordering when adding labels vs names)
  • Better config management
    • Otel can be configured from the outside using env variables
  • Simpler localtest setup - we can run infra locally that matches deployed environments
  • Instrumenting with less dependencies in app-lib
  • Access to cheaper and/or better platforms without having to change abstractions or transport mechanism

Kinds of telemetry

  • Logs - useful for adhoc searching and filtering based on a known predicate (e.g. you filter by the severity and the log message)
  • Metrics - analysis based on timeseries/graphing and aggregation (e.g. what is the average request latency)
  • Traces - useful for inspecting specific operations or requests, understanding lifecycle, how often something runs or how long functions or subfunctions take
  • Continuous profiling - some vendors have started to build offerings here
  • Frontend RUM? Correlation?

When standard telemetry fails

Sometimes you can verify that there are issues, and what the nature of those issues are, but need more information to fix
Examples:

  • The app is seemingly deadlocked and frozen. Metrics can tell us that the number of threads spiked leading up to the outage. A common cause for this kind of issue is threadpool starvation, taking a process dump can help root cause the issue being inspecting the call stack of managed threads.
  • Memory usage keeps increasing until the application OOMs and is killed by k8s. There is no telemetry to tell you what is being allocated but no deallocated. Process dump can also help in this case

Tasks

  • Draft PR for lib changes
  • Draft PR for localtest changes
@martinothamar
Copy link
Contributor Author

ADR proposal: Altinn/altinn-decision-log#3

@RonnyB71 RonnyB71 removed status/draft Status: When you create an issue before you have enough info to properly describe the issue. status/triage labels May 21, 2024
@martinothamar
Copy link
Contributor Author

Relevant issue for client side analytics: Altinn/app-frontend-react#853

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 👷 In Progress
Development

No branches or pull requests

2 participants