-
Notifications
You must be signed in to change notification settings - Fork 785
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sdk-metrics] Exemplar spec & perf improvements #5364
[sdk-metrics] Exemplar spec & perf improvements #5364
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5364 +/- ##
==========================================
+ Coverage 83.38% 84.01% +0.62%
==========================================
Files 297 283 -14
Lines 12531 11972 -559
==========================================
- Hits 10449 10058 -391
+ Misses 2082 1914 -168
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Happy to see multiple refactoring/introduction of ExemplarReservoir/FixedSizeExemplarReservoir to make Exemplar more extensible and spec-compliant! |
switch (this.aggType) | ||
{ | ||
case AggregationType.LongSumIncomingDelta: | ||
{ | ||
this.mpComponents!.AcquireLock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current code was using a lock() (implemented with CAS calls/spin-wait, not lock() statement.), and then doing everything inside that lock - updating the sum, doing the exemplar storage.
Now we don't do that lock, but uses multiple interlocked calls - one to update the sum, one to update measurementsSeen, one (or possibly many) more to get the right buffer, and one more to make sure no one else is updating simultaneously.
Despite this, the perf gains in stress test is significant. What could be the a good reasoning for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is just a lot less contention/spinning going on.
The stress test runs 20 threads on my machine all hammering measurements to the same counter. The measurements are writing the same tag "DimName1" with one of ten random values.
Before this PR if two measurements happen on the same random value, one of those threads is going to have to spin for the duration of the value update + exemplar update. Exemplar update can be very slow. Best case is you wait for a random value and then you don't record. Worst case is you wait for a random value and then you record which is a bunch of writes.
After this PR all measurements do an Interlocked operation for the value record portion so that spin is eliminated. Then they enter exemplar world and get a thread-local random. If there isn't going to be a record the thread is done. If there is a record, we now have Environment.ProcessorCount
number of buckets available. Each thread may write to one of those buckets without blocking anything. If some thread happens to be writing to the same bucket, we just abort the record (we don't spin). Essentially in this design there is no spinning for a measurement so the threads spend more time doing work and we get better throughput.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. that explanation does makes sense.
As you can see, my original intuition was that, if we are going to need to go through multiple interlocks, then better get a lock equivalent via CAS, then do your job normal (simple additions, instead of Interlock.Add and so on.), release the lock!)
I'll also run the stress test on a poorer machine (4 cores) to see if still see some gains!
Can you add the stress test you used to this PR (we can remove it before merging so avoid bloating stress tests)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about Histograms? Even in this PR, the Update() is going to enter/exit critical section, and then for exemplar we do that again.. Would we be better off for us to do the Exemplar inside the already entered critical section? The default HistogramReservoir don't have much overhead (there is no random generation, just updating on the already known index)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran stress tests against the latest version of this PR.
- Counters are improved in all cases.
- Histogram
- Improved when there is contention (lock is held for less time).
- Slightly degraded when there isn't a lot of contention. This is due to a few extra operations being executed now in the exemplar path. I could special-case something to fix this. But if you look at the overall throughput of histograms vs counters, they suck 🤣 I would like to look at that on a follow-up and see if I can use the same active/passive buffer tricks to help histograms across the board.
Counters
New version using counter w/ tag-filtering view + exemplars:
- Total Runway Time (seconds) 300
- Total Loops: 9,899,082,415
- Average Loops/Second: 32,964,636
- Average CPU Cycles/Loop: 1,727
- GC Total Allocated Bytes: 10643640
Old version using counter w/ tag-filtering view + exemplars:
- Total Runway Time (seconds) 300
- Total Loops: 5,076,211,789
- Average Loops/Second: 16,909,264
- Average CPU Cycles/Loop: 869
- GC Total Allocated Bytes: 11098120
New version using counter w/ exemplars:
- Total Runway Time (seconds) 300
- Total Loops: 10,881,621,737
- Average Loops/Second: 36,250,443
- Average CPU Cycles/Loop: 1,573
- GC Total Allocated Bytes: 14448912
Old version using counter w/ exemplars:
- Total Runway Time (seconds) 300
- Total Loops: 8,958,303,265
- Average Loops/Second: 29,855,139
- Average CPU Cycles/Loop: 1,855
- GC Total Allocated Bytes: 11546904
Histograms
New version using histogram w/ tag-filtering view + exemplars:
- Total Runway Time (seconds) 300
- Total Loops: 3,069,842,084
- Average Loops/Second: 10,227,489
- Average CPU Cycles/Loop: 3,781
- GC Total Allocated Bytes: 11073856
Old version using histogram w/ tag-filtering view + exemplars:
- Total Runway Time (seconds) 300
- Total Loops: 2,160,807,746
- Average Loops/Second: 7,197,558
- Average CPU Cycles/Loop: 2,160
- GC Total Allocated Bytes: 16985496
New version using histogram w/ exemplars:
- Total Runway Time (seconds) 300
- Total Loops: 5,576,901,325
- Average Loops/Second: 18,583,167
- Average CPU Cycles/Loop: 3,001
- GC Total Allocated Bytes: 177120704
Old version using histogram w/ exemplars:
- Total Runway Time (seconds) 300
- Total Loops: 5,767,724,875
- Average Loops/Second: 19,220,944
- Average CPU Cycles/Loop: 2,733
- GC Total Allocated Bytes: 265700816
This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
Changes
ExemplarReservoir
Offer
method signature no longer includes the "index" parameter which is not part of the spec. Instead it now acceptsstruct ExemplarMeasurement
which only exposes the value & tags (to match spec) but also tracks explicit histogram index as an internal thing.Collect
method signature no longer include "actualTags" or "reset" which are not part of the spec. The "reset" flag has becomebool ResetOnCollect
property onExemplarReservoir
. Tag filtering is now handled bystruct ReadOnlyFilteredTagCollection
which is exposed onExemplar
. Return type changed fromExemplar[]
tostruct ReadOnlyExemplarCollection
.Exemplar
FilteredTags
property is nowstruct ReadOnlyFilteredTagCollection
instead ofList<KeyValuePair<string, object?>>?
.long LongValue
in addition todouble DoubleValue
.SimpleFixedSizeExemplarReservoir
1
toEnvironment.ProcessorCount
which is allowed by the spec.Concurrency
ExemplarReservoir
methods made safe to be called concurrently (required by the spec). Also fixed a bunch of race conditions with collect. Need to prove this out more probably using Coyote tests (follow-up).MetricPoint
had dual-implementations of all aggregation types which acquired locks when exemplars were in play. This exploded the size/complexity of the code while also destroying throughput. Now thatExemplarReservoir
can be called concurrently all of that has been removed.Misc improvements
ReadOnlyExemplarCollection
when enumerated only returnsExemplar
s which have been updated (Timestamp != default
). Previously each exporter had to understand the behavior and check that.Exemplar.FilteredTags
.FixedSizeExemplarReservoir
to house the shared logic bothSimpleFixedSizeExemplarReservoir
&AlignedHistogramBucketExemplarReservoir
share.Benchmarks
Exemplar
s are NOT enabled, perf is indistinguishable from before:Before:
After:
Exemplar
benchmarks show slight improvement:Before:
After:
Those
Exemplar
benchmarks above are very misleading. TheExemplar
implementation before this PR forces all aggregation types into locking mode insideMetricPoint
. This destroys throughput for all non-histograms. That really won't show up on benchmarks because that is one thread hammering measurements.Here are some stress test results:
Before:
After:
That is running the stress test using a
Counter<long>
+ a view filtering tags +AlwaysOnExemplarFilter
+ an OtlpExporter tweaked to send exemplars for counters transmitting every 1000ms.This new
Exemplar
implementation blows away the old one due to the concurrency changes. The reduction in allocations is due to theFilteredTags
improvements.Merge requirement checklist
CHANGELOG.md
files updated for non-trivial changes