doc: logger: add design document

add design/arch document how the kni logger works and the implementation choices. Signed-off-by: Francesco Romani <fromani@redhat.com>
openshift-kni · Mar 6, 2024 · eee13a0 · eee13a0
1 parent 5733878
commit eee13a0
Showing 1 changed file with 70 additions and 0 deletions.
diff --git a/pkg-kni/logger/DESIGN.md b/pkg-kni/logger/DESIGN.md
@@ -0,0 +1,70 @@
+# Taming the scheduler logging
+Owner: fromani@redhat.com
+
+## Summary
+Logging is not a solved problem in a complex system, especially in a complex distributed system.
+Focusing on the kubernetes ecosystem, the most common experienced pain points are excessive or
+insufficient verbosiness, which in turn creates the need to change the verbosiness level during
+the component lifetime.
+This is because keeping the verbosiness high will create a large amount of logs, while keeping
+it low will make it way harder to troubleshoot an issue without increase the verbosiness before,
+restarting the affected components and re-create the issue, which can take time and effort.
+
+The scheduler logs are affected by all these issues. Keeping the log level high is, as it stands
+today (March 2024), still discouraged and impractical. The matter is further complicated by the
+fact the NUMA-aware scheduler is a new component which takes novel approaches, out of necessity,
+and whose behavior is still under scrutiny. So it is especially important to have enough
+data to troubleshoot issue, which once again calls for high verbosiness.
+
+We would like to improve the current flow, which is basically keep verbosiness=2, and in case
+of incidents (but note: always after the fact), bump the verbosiness to 4 or more,
+reproduce again, send logs.
+
+## Motivation
+We want to improve the supportability of the NUMA-aware scheduler. Having detailed logs is key
+to troubleshoot this component, because it is new and takes a novel approach (in the k8s ecosystem)
+due to the characteristics of the problem space. Having detailed logs is thus a key enabler to
+reduce the support cycle, or to make support possible at all.
+
+The work described here explicitly targets the NUMA-aware scheduler plugin, which is a very small
+subset of the code running in a (secondary) scheduler process.
+We have to trust the k8s ecosystem to get insights about all the framework used in the
+NUMA-aware scheduler process.
+
+We believe this is a fair trade off because the k8s framework is very battle tested and has a
+huge ecosystem backing it. Out of practicality, we cannot land nontrivial changes in that codebase.
+Furthermore, most of the novel code is contained in the NUMA-aware scheduler plugin portion,
+so focusing on this area for extra logging seems the sweet spot.
+
+
+## Goals
+- Make it possible/easier to correlate all the logs pertaining to a container during a scheduling cycle
+- Improve the signal-to-noise ratio of the logs, filtering out all the content not relevant for
+  the NUMA-aware scheduling
+- Avoid excessive storage consumption
+- Minimize the slowdown caused by logging
+
+## Non-Goals
+- Change the logging system (e.g. migrate away from klog)
+- Introduce a replacement logger (e.g. module with klog-like API)
+- Break source code compatibility (e.g. no changes to the scheduler plugins source code)
+- Move to traces (independent effort not mutually exclusive)
+- Make the verbosiness tunable at runtime (no consensus about how to do securely and safely,
+  will require a new logging package)
+
+## Proposal
+- Introduce and use extensively a logID key/value pair to enable correlation of all the log entries
+  pertaining to a scheduling cycle, or to a scheduling activity in general
+- Introduce a new logging backend plugging into the logr framework, which klog >= 2.100 fully supports
+- Let the new logging backend handle the logging demultiplexing
+  - Aggregate the logs per-object
+
+## Risks and Mitigations
+TBD
+
+## Design details
+TBD
+
+## Discarded alternatives
+TBD
+