forked from kubernetes-sigs/scheduler-plugins
-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add design/arch document how the kni logger works and the implementation choices. Signed-off-by: Francesco Romani <fromani@redhat.com>
- Loading branch information
Showing
1 changed file
with
70 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
# Taming the scheduler logging | ||
Owner: fromani@redhat.com | ||
|
||
## Summary | ||
Logging is not a solved problem in a complex system, especially in a complex distributed system. | ||
Focusing on the kubernetes ecosystem, the most common experienced pain points are excessive or | ||
insufficient verbosiness, which in turn creates the need to change the verbosiness level during | ||
the component lifetime. | ||
This is because keeping the verbosiness high will create a large amount of logs, while keeping | ||
it low will make it way harder to troubleshoot an issue without increase the verbosiness before, | ||
restarting the affected components and re-create the issue, which can take time and effort. | ||
|
||
The scheduler logs are affected by all these issues. Keeping the log level high is, as it stands | ||
today (March 2024), still discouraged and impractical. The matter is further complicated by the | ||
fact the NUMA-aware scheduler is a new component which takes novel approaches, out of necessity, | ||
and whose behavior is still under scrutiny. So it is especially important to have enough | ||
data to troubleshoot issue, which once again calls for high verbosiness. | ||
|
||
We would like to improve the current flow, which is basically keep verbosiness=2, and in case | ||
of incidents (but note: always after the fact), bump the verbosiness to 4 or more, | ||
reproduce again, send logs. | ||
|
||
## Motivation | ||
We want to improve the supportability of the NUMA-aware scheduler. Having detailed logs is key | ||
to troubleshoot this component, because it is new and takes a novel approach (in the k8s ecosystem) | ||
due to the characteristics of the problem space. Having detailed logs is thus a key enabler to | ||
reduce the support cycle, or to make support possible at all. | ||
|
||
The work described here explicitly targets the NUMA-aware scheduler plugin, which is a very small | ||
subset of the code running in a (secondary) scheduler process. | ||
We have to trust the k8s ecosystem to get insights about all the framework used in the | ||
NUMA-aware scheduler process. | ||
|
||
We believe this is a fair trade off because the k8s framework is very battle tested and has a | ||
huge ecosystem backing it. Out of practicality, we cannot land nontrivial changes in that codebase. | ||
Furthermore, most of the novel code is contained in the NUMA-aware scheduler plugin portion, | ||
so focusing on this area for extra logging seems the sweet spot. | ||
|
||
|
||
## Goals | ||
- Make it possible/easier to correlate all the logs pertaining to a container during a scheduling cycle | ||
- Improve the signal-to-noise ratio of the logs, filtering out all the content not relevant for | ||
the NUMA-aware scheduling | ||
- Avoid excessive storage consumption | ||
- Minimize the slowdown caused by logging | ||
|
||
## Non-Goals | ||
- Change the logging system (e.g. migrate away from klog) | ||
- Introduce a replacement logger (e.g. module with klog-like API) | ||
- Break source code compatibility (e.g. no changes to the scheduler plugins source code) | ||
- Move to traces (independent effort not mutually exclusive) | ||
- Make the verbosiness tunable at runtime (no consensus about how to do securely and safely, | ||
will require a new logging package) | ||
|
||
## Proposal | ||
- Introduce and use extensively a logID key/value pair to enable correlation of all the log entries | ||
pertaining to a scheduling cycle, or to a scheduling activity in general | ||
- Introduce a new logging backend plugging into the logr framework, which klog >= 2.100 fully supports | ||
- Let the new logging backend handle the logging demultiplexing | ||
- Aggregate the logs per-object | ||
|
||
## Risks and Mitigations | ||
TBD | ||
|
||
## Design details | ||
TBD | ||
|
||
## Discarded alternatives | ||
TBD | ||
|