Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sdk-metrics] Refactor and improve interlocking code for doubles in MetricPoint #5378

Conversation

CodeBlanch
Copy link
Member

Changes

  • Refactored the custom interlocking code for double values in MetricPoint into a helper and improved them slightly. With help from @stephentoub.

@CodeBlanch CodeBlanch added the metrics Metrics signal related label Feb 21, 2024
@CodeBlanch CodeBlanch requested a review from a team February 21, 2024 21:04
Copy link

codecov bot commented Feb 21, 2024

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (6250307) 83.38% compared to head (19abd86) 83.36%.
Report is 92 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5378      +/-   ##
==========================================
- Coverage   83.38%   83.36%   -0.03%     
==========================================
  Files         297      278      -19     
  Lines       12531    11936     -595     
==========================================
- Hits        10449     9950     -499     
+ Misses       2082     1986      -96     
Flag Coverage Δ
unittests ?
unittests-Solution-Experimental 83.31% <89.47%> (?)
unittests-Solution-Stable 83.27% <89.47%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
src/OpenTelemetry/Internal/InterlockedHelper.cs 100.00% <100.00%> (ø)
src/OpenTelemetry/Metrics/MetricPoint.cs 67.81% <66.66%> (-0.66%) ⬇️

... and 47 files with indirect coverage changes

Comment on lines +32 to +33
public static double Read(ref double location)
=> Interlocked.CompareExchange(ref location, double.NaN, double.NaN);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this to be a CompareExchange rather than just Volatile.Read? You need a full fence for some reason?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code before was...

this.snapshotValue.AsDouble = Interlocked.CompareExchange(ref this.runningValue.AsDouble, 0.0, double.NegativeInfinity);

Now it is calling this helper...

this.snapshotValue.AsDouble = InterlockedHelper.Read(ref this.runningValue.AsDouble);

What that code needs is no stale reads. Does Interlocked.CompareExchange (full fence) give us 100% guaranteed up-to-date values? My understanding with Volatile.Read is it could give us back something stale.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What that code needs is no stale reads.

Define stale. Volatile.Read will ensure it's not just using some enregistered value. But beyond that, I don't see how CompareExchange(ref location, sentinel, sentinel) would provide strong guarantees about "up-to-date". The exact moment after you read the value but before you do anything with it it could be updated concurrently, at which point the value isn't "up to date".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How OpenTelemetry SDK is working is many threads are recording values. For this double case it is either of these:

            case AggregationType.DoubleSumIncomingDelta:
                {
                    InterlockedHelper.Add(ref this.runningValue.AsDouble, number);
                    break;
                }

            case AggregationType.DoubleSumIncomingCumulative:
                {
                    Interlocked.Exchange(ref this.runningValue.AsDouble, number);
                    break;
                }

There is a dedicated thread which does a collection operation periodically. Its job is to grab a snapshot of the current/running state. That is this block:

            case AggregationType.DoubleSumIncomingDelta:
            case AggregationType.DoubleSumIncomingCumulative:
                {
                    if (outputDelta)
                    {
                        double initValue = InterlockedHelper.Read(ref this.runningValue.AsDouble);
                        this.snapshotValue.AsDouble = initValue - this.deltaLastValue.AsDouble;
                        this.deltaLastValue.AsDouble = initValue;
                        this.MetricPointStatus = MetricPointStatus.NoCollectPending;

                        // Check again if value got updated, if yes reset status.
                        // This ensures no Updates get Lost.
                        if (initValue != InterlockedHelper.Read(ref this.runningValue.AsDouble))
                        {
                            this.MetricPointStatus = MetricPointStatus.CollectPending;
                        }
                    }
                    else
                    {
                        this.snapshotValue.AsDouble = InterlockedHelper.Read(ref this.runningValue.AsDouble);
                    }

                    break;
                }

The delta case needs to compute the change since the last collect:

                        double initValue = InterlockedHelper.Read(ref this.runningValue.AsDouble);
                        this.snapshotValue.AsDouble = initValue - this.deltaLastValue.AsDouble;
                        this.deltaLastValue.AsDouble = initValue;

It's OK if the value changes right before or right after but we need to make sure whatever we got was true and not some cached or local thing. Does that help?

Copy link

@stephentoub stephentoub Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK if the value changes right before or right after but we need to make sure whatever we got was true and not some cached or local thing.

I still don't understand. With a CompareExchange (that's not also updating anything atomically, just reading), you might read the value, then the thread might be context switched out, and 10 minutes later it might start running again and you continue with the 10 minute old value. What's the difference between that and "some cached" thing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub

Here is our version for long measurements:

            case AggregationType.LongSumIncomingDelta:
            case AggregationType.LongSumIncomingCumulative:
                {
                    if (outputDelta)
                    {
                        long initValue = Interlocked.Read(ref this.runningValue.AsLong);
                        this.snapshotValue.AsLong = initValue - this.deltaLastValue.AsLong;
                        this.deltaLastValue.AsLong = initValue;
                        this.MetricPointStatus = MetricPointStatus.NoCollectPending;

                        // Check again if value got updated, if yes reset status.
                        // This ensures no Updates get Lost.
                        if (initValue != Interlocked.Read(ref this.runningValue.AsLong))
                        {
                            this.MetricPointStatus = MetricPointStatus.CollectPending;
                        }
                    }
                    else
                    {
                        this.snapshotValue.AsLong = Interlocked.Read(ref this.runningValue.AsLong);
                    }

                    break;
                }

Also uses Interlocked.Read, do you think it would be better to use Volatile.Read there too?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is our main motivation behind doing our best to capture as many updates as possible before running collect. Any updates that we miss, would only be reported in the next collect cycle (which would be like a minute later).

Can you share the data you have that demonstrates using CompareExchange is actually improving this?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also uses Interlocked.Read, do you think it would be better to use Volatile.Read there too?

Interlocked.Read(ref value) is just CompareExchange(ref value, 0, 0), so it's effectively the same thing you're doing with the doubles. Personally, I'd just use Volatile.Read. I'm really interested in seeing the data that suggests there's any meaningful benefit to using CompareExchange instead in these scenarios; otherwise, I'd stick with the idiomatic answer that's also cheaper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share the data you have that demonstrates using CompareExchange is actually improving this?

We haven't tested this, and we most likely would not anytime soon. It would require elaborate testing setups with different kinds of hardware. That might not be the best use of our time. Ideally, we would like to rely on the official docs for this kind of information instead of running experiments ourselves and gathering this kind of data.

In this particular case, our understanding/assumption is in line with what you mentioned in one of your previous comments:

Using an Interlocked.CompareExchange here may give you a value more synchronized with the other uses of those variables, but that's because it's actually synchronizing,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is you'd find it makes essentially zero difference, other than slowing down the writing code.

@CodeBlanch CodeBlanch merged commit 09dd46f into open-telemetry:main Feb 22, 2024
37 checks passed
@CodeBlanch CodeBlanch deleted the sdk-metrics-double-interlocking-refactor branch February 22, 2024 00:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metrics Metrics signal related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants