Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC - Pipeline Component Telemetry #11406

Merged
merged 18 commits into from
Nov 27, 2024

Conversation

djaglowski
Copy link
Member

@djaglowski djaglowski commented Oct 9, 2024

This PR adds a RFC for normalized telemetry across all pipeline components. See #11343

edit by @mx-psi:

Copy link

codecov bot commented Oct 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.61%. Comparing base (c6828f0) to head (cb72f2a).
Report is 8 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #11406   +/-   ##
=======================================
  Coverage   91.61%   91.61%           
=======================================
  Files         443      443           
  Lines       23770    23770           
=======================================
  Hits        21776    21776           
  Misses       1620     1620           
  Partials      374      374           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@djaglowski djaglowski force-pushed the component-telemetry-rfc branch from 99e3086 to 5df52e1 Compare October 10, 2024 13:05
@djaglowski djaglowski marked this pull request as ready for review October 10, 2024 13:36
@djaglowski djaglowski requested a review from a team as a code owner October 10, 2024 13:36
@djaglowski djaglowski requested a review from songy23 October 10, 2024 13:36
@djaglowski djaglowski added Skip Changelog PRs that do not require a CHANGELOG.md entry Skip Contrib Tests labels Oct 10, 2024
Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this as a RFC @djaglowski!

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
@djaglowski djaglowski changed the title RFC - Auto-instrumentation of pipeline components RFC - Pipeline Component Telemetry Oct 16, 2024
@djaglowski
Copy link
Member Author

Based on some offline feedback, I've broadened the scope of the RFC, while simultaneously clarifying that it is intended to evolve as we identify additional standards.

Copy link
Contributor

@jaronoff97 jaronoff97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few questions, I really like this proposal overall :)

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
@jpkrohling
Copy link
Member

Some of my comments might have been discussed before, in which case, feel free to ignore me and just mark the items as resolved.

Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with comments.

docs/rfcs/component-universal-telemetry.md Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Show resolved Hide resolved
@djaglowski djaglowski force-pushed the component-telemetry-rfc branch from 7d7a75b to a7a15e5 Compare November 21, 2024 16:00
@djaglowski djaglowski added the rfc:final-comment-period This RFC is in the final comment period phase label Nov 21, 2024
Copy link
Member

@dmitryax dmitryax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

codeboten added a commit to codeboten/opentelemetry-collector that referenced this pull request Nov 21, 2024
This sets the level of all metrics that where not previously stabilized as
alpha. Since many of these metrics will change as a result of
open-telemetry#11406, it made
sense to me to set their stability as alpha.

Signed-off-by: Alex Boten <223565+codeboten@users.noreply.github.com>
@mx-psi
Copy link
Member

mx-psi commented Nov 22, 2024

This has enough approvals and has entered the 'final comment period'. I will merge this on 2024-11-27 if nobody blocks before.

cc @open-telemetry/collector-approvers

Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of things to iron out, but I'm giving my approval already, as those are details that could be part of a follow-up PR. I don't want to block progress on dependent tasks because of those two rather small points.

docs/rfcs/component-universal-telemetry.md Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Show resolved Hide resolved
codeboten added a commit that referenced this pull request Nov 22, 2024
This sets the level of all metrics that where not previously stabilized
as alpha. Since many of these metrics will change as a result of
#11406, it
made sense to me to set their stability as alpha.

---------

Signed-off-by: Alex Boten <223565+codeboten@users.noreply.github.com>
@djaglowski
Copy link
Member Author

I believe all feedback has been addressed.

#11743 represents two followup items raised by @jpkrohling, but I believe the RFC is clear that some changes are anticipated.

Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @djaglowski

Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approved this before, but I'll approve again, to make it explicit that I'm OK with the latest state of this PR.

@mx-psi
Copy link
Member

mx-psi commented Nov 27, 2024

Per #11406 (comment) I am merging this 🎉

@mx-psi mx-psi merged commit 79357e8 into open-telemetry:main Nov 27, 2024
36 checks passed
@github-actions github-actions bot added this to the next release milestone Nov 27, 2024
@djaglowski djaglowski deleted the component-telemetry-rfc branch November 27, 2024 15:28
github-merge-queue bot pushed a commit that referenced this pull request Dec 16, 2024
## Description

This PR defines observability requirements for components at the
"Stable" stability levels. The goal is to ensure that Collector
pipelines are properly observable, to help in debugging configuration
issues.

#### Approach

- The requirements are deliberately not too specific, in order to be
adaptable to each specific component, and so as to not over-burden
component authors.
- After discussing it with @mx-psi, this list of requirements explicitly
includes things that may end up being emitted automatically as part of
the Pipeline Instrumentation RFC (#11406), with only a note at the
beginning explaining that not everything may need to be implemented
manually.

Feel free to share if you don't think this is the right approach for
these requirements.

#### Link to tracking issue
Resolves #11581

## Important note regarding the Pipeline Instrumentation RFC

I included this paragraph in the part about error count metrics:
> The goal is to be able to easily pinpoint the source of data loss in
the Collector pipeline, so this should either:
>   - only include errors internal to the component, or;
> - allow distinguishing said errors from ones originating in an
external service, or propagated from downstream Collector components.

The [Pipeline Instrumentation
RFC](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/rfcs/component-universal-telemetry.md)
(hereafter abbreviated "PI"), once implemented, should allow monitoring
component errors via the `outcome` attribute, which is either `success`
or `failure`, depending on whether the `Consumer` API call returned an
error.

Note that this does not work for receivers, or allow differentiating
between different types of errors; for that reason, I believe additional
component-specific error metrics will often still be required, but it
would be nice to cover as many cases as possible automatically.

However, at the moment, errors are (usually) propagated upstream through
the chain of `Consume` calls, so in case of error the `failure` state
will end up applied to all components upstream of the actual source of
the error. This means the PI metrics do not fit the first bullet point.

Moreover, I would argue that even post-processing the PI metrics does
not reliably allow distinguishing the ultimate source of errors (the
second bullet point). One simple idea is to compute
`consumed.items{outcome:failure} - produced.items{outcome:failure}` to
get the number of errors originating in a component. But this only works
if output items map one-to-one to input items: if a processor or
connector outputs fewer items than it consumes (because it aggregates
them, or translates to a different signal type), this formula will
return false positives. If these false positives are mixed with real
errors from the component and/or from downstream, the situation becomes
impossible to analyze by just looking at the metrics.

For these reasons, I believe we should do one of four things:
1. Change the way we use the `Consumer` API to no longer propagate
errors, making the PI metric outcomes more precise.
We could catch errors in whatever wrapper we already use to emit the PI
metrics, log them for posterity, and simply not propagate them.
Note that some components already more or less do this, such as the
`batchprocessor`, but this option may in principle break components
which rely on downstream errors (for retry purposes for example).
3. Keep propagating errors, but modify or extend the RFC to require
distinguishing between internal and propagated errors (maybe add a third
`outcome` value, or add another attribute).
This could be implemented by somehow propagating additional state from
one `Consume` call to another, allowing us to establish the first
appearance of a given error value in the pipeline.
5. Loosen this requirement so that the PI metrics suffice in their
current state.
6. Leave everything as-is and make component authors implement their own
somewhat redundant error count metrics.

---------

Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>
Co-authored-by: Pablo Baeyens <pablo.baeyens@datadoghq.com>
HongChenTW pushed a commit to HongChenTW/opentelemetry-collector that referenced this pull request Dec 19, 2024
…ry#11729)

This sets the level of all metrics that where not previously stabilized
as alpha. Since many of these metrics will change as a result of
open-telemetry#11406, it
made sense to me to set their stability as alpha.

---------

Signed-off-by: Alex Boten <223565+codeboten@users.noreply.github.com>
HongChenTW pushed a commit to HongChenTW/opentelemetry-collector that referenced this pull request Dec 19, 2024
This PR adds a RFC for normalized telemetry across all pipeline
components. See
open-telemetry#11343

edit by @mx-psi:
- Announced on #otel-collector-dev on 2024-10-23:
https://cloud-native.slack.com/archives/C07CCCMRXBK/p1729705290741179
- Announced on the Collector SIG meeting from 2024-10-30

---------

Co-authored-by: Alex Boten <223565+codeboten@users.noreply.github.com>
Co-authored-by: Damien Mathieu <42@dmathieu.com>
Co-authored-by: William Dumont <william.dumont@grafana.com>
Co-authored-by: Evan Bradley <11745660+evan-bradley@users.noreply.github.com>
HongChenTW pushed a commit to HongChenTW/opentelemetry-collector that referenced this pull request Dec 19, 2024
…ry#11772)

## Description

This PR defines observability requirements for components at the
"Stable" stability levels. The goal is to ensure that Collector
pipelines are properly observable, to help in debugging configuration
issues.

#### Approach

- The requirements are deliberately not too specific, in order to be
adaptable to each specific component, and so as to not over-burden
component authors.
- After discussing it with @mx-psi, this list of requirements explicitly
includes things that may end up being emitted automatically as part of
the Pipeline Instrumentation RFC (open-telemetry#11406), with only a note at the
beginning explaining that not everything may need to be implemented
manually.

Feel free to share if you don't think this is the right approach for
these requirements.

#### Link to tracking issue
Resolves open-telemetry#11581

## Important note regarding the Pipeline Instrumentation RFC

I included this paragraph in the part about error count metrics:
> The goal is to be able to easily pinpoint the source of data loss in
the Collector pipeline, so this should either:
>   - only include errors internal to the component, or;
> - allow distinguishing said errors from ones originating in an
external service, or propagated from downstream Collector components.

The [Pipeline Instrumentation
RFC](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/rfcs/component-universal-telemetry.md)
(hereafter abbreviated "PI"), once implemented, should allow monitoring
component errors via the `outcome` attribute, which is either `success`
or `failure`, depending on whether the `Consumer` API call returned an
error.

Note that this does not work for receivers, or allow differentiating
between different types of errors; for that reason, I believe additional
component-specific error metrics will often still be required, but it
would be nice to cover as many cases as possible automatically.

However, at the moment, errors are (usually) propagated upstream through
the chain of `Consume` calls, so in case of error the `failure` state
will end up applied to all components upstream of the actual source of
the error. This means the PI metrics do not fit the first bullet point.

Moreover, I would argue that even post-processing the PI metrics does
not reliably allow distinguishing the ultimate source of errors (the
second bullet point). One simple idea is to compute
`consumed.items{outcome:failure} - produced.items{outcome:failure}` to
get the number of errors originating in a component. But this only works
if output items map one-to-one to input items: if a processor or
connector outputs fewer items than it consumes (because it aggregates
them, or translates to a different signal type), this formula will
return false positives. If these false positives are mixed with real
errors from the component and/or from downstream, the situation becomes
impossible to analyze by just looking at the metrics.

For these reasons, I believe we should do one of four things:
1. Change the way we use the `Consumer` API to no longer propagate
errors, making the PI metric outcomes more precise.
We could catch errors in whatever wrapper we already use to emit the PI
metrics, log them for posterity, and simply not propagate them.
Note that some components already more or less do this, such as the
`batchprocessor`, but this option may in principle break components
which rely on downstream errors (for retry purposes for example).
3. Keep propagating errors, but modify or extend the RFC to require
distinguishing between internal and propagated errors (maybe add a third
`outcome` value, or add another attribute).
This could be implemented by somehow propagating additional state from
one `Consume` call to another, allowing us to establish the first
appearance of a given error value in the pipeline.
5. Loosen this requirement so that the PI metrics suffice in their
current state.
6. Leave everything as-is and make component authors implement their own
somewhat redundant error count metrics.

---------

Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>
Co-authored-by: Pablo Baeyens <pablo.baeyens@datadoghq.com>
@jade-guiton-dd
Copy link
Contributor

To make sure everyone involved is aware: I filed a PR (#11956) to amend this RFC. I am proposing adding a third outcome attribute value to make tracing the source of errors easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfc:final-comment-period This RFC is in the final comment period phase Skip Changelog PRs that do not require a CHANGELOG.md entry Skip Contrib Tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.