[sdk-metrics] Exemplar spec & perf improvements #5364

CodeBlanch · 2024-02-15T22:17:47Z

Changes

ExemplarReservoir

Offer method signature no longer includes the "index" parameter which is not part of the spec. Instead it now accepts struct ExemplarMeasurement which only exposes the value & tags (to match spec) but also tracks explicit histogram index as an internal thing.
Collect method signature no longer include "actualTags" or "reset" which are not part of the spec. The "reset" flag has become bool ResetOnCollect property on ExemplarReservoir. Tag filtering is now handled by struct ReadOnlyFilteredTagCollection which is exposed on Exemplar. Return type changed from Exemplar[] to struct ReadOnlyExemplarCollection.

Exemplar

Return type of FilteredTags property is now struct ReadOnlyFilteredTagCollection instead of List<KeyValuePair<string, object?>>?.
Now exposes long LongValue in addition to double DoubleValue.

SimpleFixedSizeExemplarReservoir

Default size changed from 1 to Environment.ProcessorCount which is allowed by the spec.

Concurrency

ExemplarReservoir methods made safe to be called concurrently (required by the spec). Also fixed a bunch of race conditions with collect. Need to prove this out more probably using Coyote tests (follow-up).
MetricPoint had dual-implementations of all aggregation types which acquired locks when exemplars were in play. This exploded the size/complexity of the code while also destroying throughput. Now that ExemplarReservoir can be called concurrently all of that has been removed.

Misc improvements

ReadOnlyExemplarCollection when enumerated only returns Exemplars which have been updated (Timestamp != default). Previously each exporter had to understand the behavior and check that.
Removed nasty Linq & allocations from Exemplar.FilteredTags.
Introduced internal FixedSizeExemplarReservoir to house the shared logic both SimpleFixedSizeExemplarReservoir & AlignedHistogramBucketExemplarReservoir share.

Benchmarks

When Exemplars are NOT enabled, perf is indistinguishable from before:

Before:

Method	AggregationTemporality	Mean	Error	StdDev	Allocated
CounterHotPath	Cumulative	10.33 ns	0.128 ns	0.107 ns	-
CounterHotPath	Delta	10.85 ns	0.169 ns	0.158 ns	-

After:

Method	AggregationTemporality	Mean	Error	StdDev	Allocated
CounterHotPath	Cumulative	10.48 ns	0.136 ns	0.127 ns	-
CounterHotPath	Delta	10.71 ns	0.196 ns	0.184 ns	-

The Exemplar benchmarks show slight improvement:

Before:

Method	ExemplarFilter	Mean	Error	StdDev	Allocated
HistogramNoTagReduction	AlwaysOff	176.0 ns	1.30 ns	1.15 ns	-
HistogramWithTagReduction	AlwaysOff	168.7 ns	3.40 ns	3.91 ns	-
HistogramNoTagReduction	AlwaysOn	214.2 ns	4.31 ns	6.84 ns	-
HistogramWithTagReduction	AlwaysOn	230.4 ns	3.77 ns	3.52 ns	-
HistogramNoTagReduction	HighValueOnly	192.7 ns	1.91 ns	1.59 ns	-
HistogramWithTagReduction	HighValueOnly	188.6 ns	3.72 ns	4.56 ns	-

After:

Method	ExemplarFilter	Mean	Error	StdDev	Allocated
HistogramNoTagReduction	AlwaysOff	170.3 ns	1.54 ns	1.44 ns	-
HistogramWithTagReduction	AlwaysOff	164.3 ns	1.88 ns	1.67 ns	-
HistogramNoTagReduction	AlwaysOn	221.7 ns	3.29 ns	3.08 ns	-
HistogramWithTagReduction	AlwaysOn	214.4 ns	3.11 ns	2.91 ns	-
HistogramNoTagReduction	HighValueOnly	185.8 ns	3.12 ns	2.92 ns	-
HistogramWithTagReduction	HighValueOnly	177.1 ns	3.29 ns	3.23 ns	-

Stress test shows massive gains:

Those Exemplar benchmarks above are very misleading. The Exemplar implementation before this PR forces all aggregation types into locking mode inside MetricPoint. This destroys throughput for all non-histograms. That really won't show up on benchmarks because that is one thread hammering measurements.

Here are some stress test results:

Before:

Total Runaway Time (seconds) 300
Total Loops: 4,080,616,288
Average Loops/Second: 13,613,263
Average CPU Cycles/Loop: 511
GC Total Allocated Bytes: 15181888

After:

Total Runaway Time (seconds) 300
Total Loops: 9,105,818,639
Average Loops/Second: 30,314,633
Average CPU Cycles/Loop: 1,843
GC Total Allocated Bytes: 15048856

That is running the stress test using a Counter<long> + a view filtering tags + AlwaysOnExemplarFilter + an OtlpExporter tweaked to send exemplars for counters transmitting every 1000ms.

This new Exemplar implementation blows away the old one due to the concurrency changes. The reduction in allocations is due to the FilteredTags improvements.

Merge requirement checklist

CONTRIBUTING guidelines followed (license requirements, nullable enabled, static analysis, etc.)
Unit tests added/updated
Appropriate CHANGELOG.md files updated for non-trivial changes
Changes in public API reviewed (if applicable)

codecov · 2024-02-15T22:23:33Z

Codecov Report

Attention: Patch coverage is 78.75536% with 99 lines in your changes are missing coverage. Please review.

Project coverage is 84.01%. Comparing base (6250307) to head (4512b49).
Report is 110 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5364      +/-   ##
==========================================
+ Coverage   83.38%   84.01%   +0.62%     
==========================================
  Files         297      283      -14     
  Lines       12531    11972     -559     
==========================================
- Hits        10449    10058     -391     
+ Misses       2082     1914     -168

Flag	Coverage Δ
unittests	`?`
unittests-Instrumentation-Experimental	`24.58% <0.00%> (?)`
unittests-Instrumentation-Stable	`24.58% <0.00%> (?)`
unittests-Solution-Experimental	`83.97% <78.75%> (?)`
unittests-Solution-Stable	`83.96% <78.75%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...penTelemetry/Metrics/Exemplar/ExemplarReservoir.cs	`100.00% <100.00%> (ø)`
...trics/Exemplar/SimpleFixedSizeExemplarReservoir.cs	`100.00% <100.00%> (ø)`
src/OpenTelemetry/Metrics/Metric.cs	`96.77% <100.00%> (+0.25%)`	⬆️
...Telemetry/Metrics/MetricPointOptionalComponents.cs	`100.00% <100.00%> (+22.22%)`	⬆️
src/OpenTelemetry/Metrics/MetricReaderExt.cs	`91.50% <100.00%> (+1.50%)`	⬆️
...OpenTelemetry/Metrics/MetricStreamConfiguration.cs	`80.76% <100.00%> (+5.76%)`	⬆️
src/OpenTelemetry/ReadOnlyFilteredTagCollection.cs	`100.00% <100.00%> (ø)`
src/OpenTelemetry/ReadOnlyTagCollection.cs	`100.00% <100.00%> (ø)`
...xemplar/AlignedHistogramBucketExemplarReservoir.cs	`60.00% <57.14%> (-32.86%)`	⬇️
...nTelemetry/Metrics/Exemplar/ExemplarMeasurement.cs	`63.63% <63.63%> (ø)`
... and 7 more

... and 48 files with indirect coverage changes

src/OpenTelemetry/Metrics/Exemplar/FixedSizeExemplarReservoir.cs

src/OpenTelemetry/Metrics/Exemplar/Exemplar.cs

src/OpenTelemetry/Metrics/MetricPoint.cs

Yun-Ting · 2024-02-20T03:35:19Z

Happy to see multiple refactoring/introduction of ExemplarReservoir/FixedSizeExemplarReservoir to make Exemplar more extensible and spec-compliant!
I can help with adding multi-threaded test coverage for ExemplarReservoir as follow-ups!

src/OpenTelemetry.Exporter.Console/ConsoleMetricExporter.cs

src/OpenTelemetry/Metrics/Exemplar/Exemplar.cs

src/OpenTelemetry/Metrics/Exemplar/ExemplarMeasurement.cs

src/OpenTelemetry/Metrics/Exemplar/Exemplar.cs

src/OpenTelemetry/ReadOnlyFilteredTagCollection.cs

src/OpenTelemetry/Metrics/AggregatorStore.cs

cijothomas · 2024-02-21T20:40:40Z

src/OpenTelemetry/Metrics/MetricPoint.cs

        switch (this.aggType)
        {
            case AggregationType.LongSumIncomingDelta:
                {
-                    this.mpComponents!.AcquireLock();


The current code was using a lock() (implemented with CAS calls/spin-wait, not lock() statement.), and then doing everything inside that lock - updating the sum, doing the exemplar storage.

Now we don't do that lock, but uses multiple interlocked calls - one to update the sum, one to update measurementsSeen, one (or possibly many) more to get the right buffer, and one more to make sure no one else is updating simultaneously.

Despite this, the perf gains in stress test is significant. What could be the a good reasoning for this?

There is just a lot less contention/spinning going on.

The stress test runs 20 threads on my machine all hammering measurements to the same counter. The measurements are writing the same tag "DimName1" with one of ten random values.

Before this PR if two measurements happen on the same random value, one of those threads is going to have to spin for the duration of the value update + exemplar update. Exemplar update can be very slow. Best case is you wait for a random value and then you don't record. Worst case is you wait for a random value and then you record which is a bunch of writes.

After this PR all measurements do an Interlocked operation for the value record portion so that spin is eliminated. Then they enter exemplar world and get a thread-local random. If there isn't going to be a record the thread is done. If there is a record, we now have Environment.ProcessorCount number of buckets available. Each thread may write to one of those buckets without blocking anything. If some thread happens to be writing to the same bucket, we just abort the record (we don't spin). Essentially in this design there is no spinning for a measurement so the threads spend more time doing work and we get better throughput.

thanks. that explanation does makes sense.

As you can see, my original intuition was that, if we are going to need to go through multiple interlocks, then better get a lock equivalent via CAS, then do your job normal (simple additions, instead of Interlock.Add and so on.), release the lock!)

I'll also run the stress test on a poorer machine (4 cores) to see if still see some gains!
Can you add the stress test you used to this PR (we can remove it before merging so avoid bloating stress tests)?

How about Histograms? Even in this PR, the Update() is going to enter/exit critical section, and then for exemplar we do that again.. Would we be better off for us to do the Exemplar inside the already entered critical section? The default HistogramReservoir don't have much overhead (there is no random generation, just updating on the already known index)?

I ran stress tests against the latest version of this PR.

Counters are improved in all cases.

Histogram

Improved when there is contention (lock is held for less time).

Slightly degraded when there isn't a lot of contention. This is due to a few extra operations being executed now in the exemplar path. I could special-case something to fix this. But if you look at the overall throughput of histograms vs counters, they suck 🤣 I would like to look at that on a follow-up and see if I can use the same active/passive buffer tricks to help histograms across the board.

Counters

New version using counter w/ tag-filtering view + exemplars:

Total Runway Time (seconds) 300

Total Loops: 9,899,082,415

Average Loops/Second: 32,964,636

Average CPU Cycles/Loop: 1,727

GC Total Allocated Bytes: 10643640

Old version using counter w/ tag-filtering view + exemplars:

Total Runway Time (seconds) 300

Total Loops: 5,076,211,789

Average Loops/Second: 16,909,264

Average CPU Cycles/Loop: 869

GC Total Allocated Bytes: 11098120

New version using counter w/ exemplars:

Total Runway Time (seconds) 300

Total Loops: 10,881,621,737

Average Loops/Second: 36,250,443

Average CPU Cycles/Loop: 1,573

GC Total Allocated Bytes: 14448912

Old version using counter w/ exemplars:

Total Runway Time (seconds) 300

Total Loops: 8,958,303,265

Average Loops/Second: 29,855,139

Average CPU Cycles/Loop: 1,855

GC Total Allocated Bytes: 11546904

Histograms

New version using histogram w/ tag-filtering view + exemplars:

Total Runway Time (seconds) 300

Total Loops: 3,069,842,084

Average Loops/Second: 10,227,489

Average CPU Cycles/Loop: 3,781

GC Total Allocated Bytes: 11073856

Old version using histogram w/ tag-filtering view + exemplars:

Total Runway Time (seconds) 300

Total Loops: 2,160,807,746

Average Loops/Second: 7,197,558

Average CPU Cycles/Loop: 2,160

GC Total Allocated Bytes: 16985496

New version using histogram w/ exemplars:

Total Runway Time (seconds) 300

Total Loops: 5,576,901,325

Average Loops/Second: 18,583,167

Average CPU Cycles/Loop: 3,001

GC Total Allocated Bytes: 177120704

Old version using histogram w/ exemplars:

Total Runway Time (seconds) 300

Total Loops: 5,767,724,875

Average Loops/Second: 19,220,944

Average CPU Cycles/Loop: 2,733

GC Total Allocated Bytes: 265700816

…mprovements

src/OpenTelemetry/Metrics/Exemplar/FixedSizeExemplarReservoir.cs

github-actions · 2024-03-02T03:16:31Z

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

github-actions · 2024-03-09T03:17:25Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

CodeBlanch added 17 commits February 13, 2024 11:33

WIP to update exmplar code.

bf8b6dd

First compiling version.

daf57cd

Tweak.

c23e7bf

Improvements.

feb1c7e

API updates.

4e97f10

Tweak.

05e058a

Tweaks.

4542e83

Better concurrency. Bug fixes.

82e6fa6

Tweaks.

6b20ac3

Tweaks.

4c849be

Tweaks.

de1856a

Tweak.

ea58f75

Tweak.

369738b

Merge from main.

645bb75

Cleanup.

17cbe69

Cleanup.

39230c9

Tweaks.

7a8669c

CodeBlanch added pkg:OpenTelemetry Issues related to OpenTelemetry NuGet package metrics Metrics signal related labels Feb 15, 2024

CHANGELOG patch.

65e87eb

CodeBlanch marked this pull request as ready for review February 15, 2024 23:09

CodeBlanch requested a review from a team February 15, 2024 23:09

CodeBlanch added 6 commits February 16, 2024 10:26

Faster init for FixedSizeExemplarReservoir buffers.

aa0b3b5

Merge from main.

b21ba8d

Tweaks.

ea2f149

Tweak.

ee7a978

Tweaks.

6ccfd4d

Tweak.

b2b327a

Yun-Ting reviewed Feb 20, 2024

View reviewed changes

src/OpenTelemetry/Metrics/Exemplar/FixedSizeExemplarReservoir.cs Show resolved Hide resolved

src/OpenTelemetry/Metrics/Exemplar/Exemplar.cs Show resolved Hide resolved

src/OpenTelemetry/Metrics/MetricPoint.cs Show resolved Hide resolved

CodeBlanch added 4 commits February 20, 2024 13:59

Merge from main.

04d56ca

WIP

a6011de

Tweaks.

015670a

Test improvements and tweaks.

fb521aa