Skip to content

Commit

Permalink
doc: logger: add design document
Browse files Browse the repository at this point in the history
add design/arch document how the kni logger works and
the implementation choices.

Signed-off-by: Francesco Romani <fromani@redhat.com>
  • Loading branch information
ffromani committed Mar 6, 2024
1 parent 5733878 commit eee13a0
Showing 1 changed file with 70 additions and 0 deletions.
70 changes: 70 additions & 0 deletions pkg-kni/logger/DESIGN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Taming the scheduler logging
Owner: fromani@redhat.com

## Summary
Logging is not a solved problem in a complex system, especially in a complex distributed system.
Focusing on the kubernetes ecosystem, the most common experienced pain points are excessive or
insufficient verbosiness, which in turn creates the need to change the verbosiness level during
the component lifetime.
This is because keeping the verbosiness high will create a large amount of logs, while keeping
it low will make it way harder to troubleshoot an issue without increase the verbosiness before,
restarting the affected components and re-create the issue, which can take time and effort.

The scheduler logs are affected by all these issues. Keeping the log level high is, as it stands
today (March 2024), still discouraged and impractical. The matter is further complicated by the
fact the NUMA-aware scheduler is a new component which takes novel approaches, out of necessity,
and whose behavior is still under scrutiny. So it is especially important to have enough
data to troubleshoot issue, which once again calls for high verbosiness.

We would like to improve the current flow, which is basically keep verbosiness=2, and in case
of incidents (but note: always after the fact), bump the verbosiness to 4 or more,
reproduce again, send logs.

## Motivation
We want to improve the supportability of the NUMA-aware scheduler. Having detailed logs is key
to troubleshoot this component, because it is new and takes a novel approach (in the k8s ecosystem)
due to the characteristics of the problem space. Having detailed logs is thus a key enabler to
reduce the support cycle, or to make support possible at all.

The work described here explicitly targets the NUMA-aware scheduler plugin, which is a very small
subset of the code running in a (secondary) scheduler process.
We have to trust the k8s ecosystem to get insights about all the framework used in the
NUMA-aware scheduler process.

We believe this is a fair trade off because the k8s framework is very battle tested and has a
huge ecosystem backing it. Out of practicality, we cannot land nontrivial changes in that codebase.
Furthermore, most of the novel code is contained in the NUMA-aware scheduler plugin portion,
so focusing on this area for extra logging seems the sweet spot.


## Goals
- Make it possible/easier to correlate all the logs pertaining to a container during a scheduling cycle
- Improve the signal-to-noise ratio of the logs, filtering out all the content not relevant for
the NUMA-aware scheduling
- Avoid excessive storage consumption
- Minimize the slowdown caused by logging

## Non-Goals
- Change the logging system (e.g. migrate away from klog)
- Introduce a replacement logger (e.g. module with klog-like API)
- Break source code compatibility (e.g. no changes to the scheduler plugins source code)
- Move to traces (independent effort not mutually exclusive)
- Make the verbosiness tunable at runtime (no consensus about how to do securely and safely,
will require a new logging package)

## Proposal
- Introduce and use extensively a logID key/value pair to enable correlation of all the log entries
pertaining to a scheduling cycle, or to a scheduling activity in general
- Introduce a new logging backend plugging into the logr framework, which klog >= 2.100 fully supports
- Let the new logging backend handle the logging demultiplexing
- Aggregate the logs per-object

## Risks and Mitigations
TBD

## Design details
TBD

## Discarded alternatives
TBD

0 comments on commit eee13a0

Please sign in to comment.