Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sdk-metrics] Clean up locking in MetricPoint #5368

Merged
merged 7 commits into from
Feb 20, 2024

Conversation

CodeBlanch
Copy link
Member

@CodeBlanch CodeBlanch commented Feb 16, 2024

Changes

  • Use the isCriticalSectionOccupied on MetricPointOptionalComponents for both types of histograms instead of maintaining two additional flags (little less memory consumed).
  • Try to obtain the lock cheaply first before initializing SpinWait.

Merge requirement checklist

  • CONTRIBUTING guidelines followed (license requirements, nullable enabled, static analysis, etc.)

@CodeBlanch CodeBlanch added the metrics Metrics signal related label Feb 16, 2024
@CodeBlanch CodeBlanch requested a review from a team February 16, 2024 20:13
Copy link

codecov bot commented Feb 16, 2024

Codecov Report

Attention: 31 lines in your changes are missing coverage. Please review.

Comparison is base (6250307) 83.38% compared to head (2530c2a) 83.35%.
Report is 90 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5368      +/-   ##
==========================================
- Coverage   83.38%   83.35%   -0.03%     
==========================================
  Files         297      277      -20     
  Lines       12531    11928     -603     
==========================================
- Hits        10449     9943     -506     
+ Misses       2082     1985      -97     
Flag Coverage Δ
unittests ?
unittests-Solution-Experimental 83.33% <61.72%> (?)
unittests-Solution-Stable 83.02% <61.72%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...lemetry/Metrics/Base2ExponentialBucketHistogram.cs 100.00% <ø> (ø)
src/OpenTelemetry/Metrics/HistogramBuckets.cs 100.00% <ø> (ø)
...Telemetry/Metrics/MetricPointOptionalComponents.cs 88.88% <100.00%> (+11.11%) ⬆️
src/OpenTelemetry/Metrics/MetricPoint.cs 68.07% <56.94%> (-0.40%) ⬇️

... and 48 files with indirect coverage changes

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public void AcquireLock()
{
if (Interlocked.CompareExchange(ref this.isCriticalSectionOccupied, 1, 0) != 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add some code comments to help with intuition/future readers?
something like:

isCriticalSectionOccupied = 0,1 indicates if some thread is already occupied or not.
The CAS is to do the following atomically:

If the original value was 0 (i.e unoccupied), set it to 1 (i.e occupied) and return (successfully acquired lock)
If the original value was 1 (i.e already occupied), don't change it, but spin and keep trying indefinitely until we succeed.

@cijothomas
Copy link
Member

@CodeBlanch The volatile switch seems unnecessary. The benchmark proves it is faster, iff the state was 0. But a thread never would release the lock, unless the state was 1...

{
if (Interlocked.Exchange(ref this.isCriticalSectionOccupied, 1) != 0)
{
this.AcquireLockRare();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this really be rare? 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define rare 😄

Let me explain why I split this method up. Before we had this:

        var sw = default(SpinWait);
        while (Interlocked.Exchange(ref isCriticalSectionOccupied, 1) != 0)
        {
            sw.SpinOnce();
        }

C# initializes all locals by default. Part of its safety system. So we pay for the initialization of sw even thought we might not use it. What the split does is make it optimistic essentially. If we can get the lock, we're done. Otherwise we jump into a more rare case where we pay for the SpinWait because we know we need it.

For the thread that got the lock, it saved a tiny bit of CPU so it can release the lock even faster. For threads that need to wait, we're going to spin them anyway so it seems OK to me to ask them to do an extra call/jump. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define rare 😄

I think having apps where multiple threads update the same MetricPoint would be a common thing.

For the thread that got the lock, it saved a tiny bit of CPU so it can release the lock even faster.

I don't think splitting the acquire lock code would speed up the release. In the existing code, the thread that acquires the lock would have already created the SpinWait struct before acquiring the lock. So, nothing has changed in what happens after a thread takes the lock. Or would the speed up come from using if vs while?
 
Having said all that, I'm okay with this change to avoid creating SpinWait struct for the threads that need to wait.

@CodeBlanch
Copy link
Member Author

@cijothomas

The volatile switch seems unnecessary. The benchmark proves it is faster, iff the state was 0. But a thread never would release the lock, unless the state was 1...

I think you are misreading the benchmark.

These are testing calls to Interlocked.Exchange and then either Volatile.Write or another Interlocked.Exchange to release the lock:

Method LockState Mean Error StdDev
ExchangeVolatile 0 3.950 ns 0.0382 ns 0.0339 ns
Exchange 0 7.181 ns 0.0737 ns 0.0576 ns

These are testing calls to Interlocked.Exchange and then no other calls are made because the lock was already taken:

Method LockState Mean Error StdDev
ExchangeVolatile 1 3.739 ns 0.0541 ns 0.0506 ns
Exchange 1 3.708 ns 0.0701 ns 0.0547 ns

@utpilla
Copy link
Contributor

utpilla commented Feb 16, 2024

  • Use the isCriticalSectionOccupied on MetricPointOptionalComponents for both types of histograms instead of maintaining two additional flags (little less memory consumed).

I don't think this helps with less memory consumption. For any given MetricPoint, you either initialize HistogramBuckets or Base2ExponentialBucketHistogram. We should never initialize both.

  • Try to obtain the lock cheaply first before initializing SpinWait.

Does it really make locking cheap? Creating a simple default SpinWait struct shouldn't cause any considerably measurable impact.

  • Switch ReleaseLock from Interlocked.Exchange to Volatile.Write.

Volatile.Write comes with weaker guarantees than Interlocked.Exchange. Are you sure that our logic would not be affected by these weaker guarantees?

@cijothomas
Copy link
Member

Use the isCriticalSectionOccupied on MetricPointOptionalComponents for both types of histograms instead of maintaining two additional flags (little less memory consumed).

I don't think this helps with less memory consumption. For any given MetricPoint, you either initialize HistogramBuckets or Base2ExponentialBucketHistogram. We should never initialize both.

This PR is removing locking field from histogram, and relying on the one on MetricPointOptionalComponents , since MetricPointOptionalComponents is already having that, and will always exist for any type of histogram.

@CodeBlanch
Copy link
Member Author

I don't think this helps with less memory consumption. For any given MetricPoint, you either initialize HistogramBuckets or Base2ExponentialBucketHistogram. We should never initialize both.

If you are going to end up with either HistogramBuckets or Base2ExponentialBucketHistogram you already have a MetricPointOptionalComponents (which has a isCriticalSectionOccupied already). So the two you get are either MetricPointOptionalComponents + HistogramBuckets or MetricPointOptionalComponents + Base2ExponentialBucketHistogram. For histograms we'll save 4 bytes x number of metric points.

@utpilla
Copy link
Contributor

utpilla commented Feb 16, 2024

This PR is removing locking field from histogram, and relying on the one on MetricPointOptionalComponents , since MetricPointOptionalComponents is already having that, and will always exist for any type of histogram.

If you are going to end up with either HistogramBuckets or Base2ExponentialBucketHistogram you already have a MetricPointOptionalComponents (which has a isCriticalSectionOccupied already). So the two you get are either MetricPointOptionalComponents + HistogramBuckets or MetricPointOptionalComponents + Base2ExponentialBucketHistogram. For histograms we'll save 4 bytes x number of metric points.

My bad! I missed the already existing IsCriticalSectionOccupied field on the MetricPointOptionalComponents class.

@CodeBlanch
Copy link
Member Author

@utpilla

Volatile.Write comes with weaker guarantees than Interlocked.Exchange. Are you sure that our logic would not be affected by these weaker guarantees?

I think it is good but this always comes up 😄

From the docs: https://learn.microsoft.com/dotnet/api/system.threading.volatile

Volatile reads and writes ensure that a value is read or written to memory and not cached (for example, in a processor register). Thus, you can use these operations to synchronize access to a field that can be updated by another thread or by hardware.

That is really what we want.

Interlocked.Exchange gives you a read + write in an atomic operation. But for release the lock we don't need the read bit, only the write. I think that's why it is faster.

@utpilla
Copy link
Contributor

utpilla commented Feb 17, 2024

From the docs: https://learn.microsoft.com/dotnet/api/system.threading.volatile

Volatile reads and writes ensure that a value is read or written to memory and not cached (for example, in a processor register). Thus, you can use these operations to synchronize access to a field that can be updated by another thread or by hardware.

That is really what we want.

Interlocked.Exchange gives you a read + write in an atomic operation. But for release the lock we don't need the read bit, only the write. I think that's why it is faster.

@CodeBlanch There are few other things mentioned in the same doc:

  1. Reordering of memory operations

A volatile write operation prevents earlier memory operations on the thread from being reordered to occur after the volatile write. A volatile read operation prevents later memory operations on the thread from being reordered to occur before the volatile read.

Since we are using a volatile write, later memory operations on the thread could be reordered before the volatile write. We have to review our code to see if this reordering could affect the MetricPoint updates or snapshot. The reordering may or may not be an issue for us, but we need to carefully review that first. A concrete example of how the reoderings may or may not affect would probably help.

  1. The second example on the doc says:

The volatile write to y does not guarantee that a following volatile read of y on a different processor will see the updated value.

This sounds problematic. In our case, this would be equivalent to having a thread release the lock but another thread on a different processor not finding the lock released. The doc doesn't clearly mention whether this would only be a temporary issue and another thread on a different processor would eventually see it updated.

@CodeBlanch
Copy link
Member Author

@utpilla

I put ReleaseLock back to Interlocked.Exchange.

Ran it by @stephentoub. If I understand him correctly Volatile.Write guarantees nothing will move out of the lock. So it is pretty safe to switch to Volatile.Write. But it doesn't guarantee other things won't move into the lock. If things move into the lock, it could be held longer. Since the goal here was to reduce the time we hold the lock, I decided to put it back so it is at least deterministic.

PS: Another interesting tidbit Stephen shared with me is the movement isn't a concern on x86/64 (due to its strong memory model) but could be a concern on ARM.

Copy link
Member

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@utpilla
Copy link
Contributor

utpilla commented Feb 20, 2024

From the docs: https://learn.microsoft.com/dotnet/api/system.threading.volatile

  1. The second example on the doc says:

The volatile write to y does not guarantee that a following volatile read of y on a different processor will see the updated value.

@stephentoub The doc doesn't explicitly mention whether this updated value will eventually (in subsequent reads) be made available to another processor. Is it fair to assume that the unavailability of the updated value would only be a transient issue?

@stephentoub
Copy link

Is it fair to assume that the unavailability of the updated value would only be a transient issue?

Yes. And there's not really any difference in that regard between a volatile or interlocked write.

@CodeBlanch
Copy link
Member Author

The doc is strange doesn't this seem contradictory?

image

@stephentoub
Copy link

doesn't this seem contradictory?

Contradictory how?

@CodeBlanch CodeBlanch merged commit 4ae0aaf into open-telemetry:main Feb 20, 2024
37 checks passed
@CodeBlanch CodeBlanch deleted the sdk-metricpoint-lock-fixes branch February 20, 2024 21:45
@CodeBlanch
Copy link
Member Author

Contradictory how?

The content in the "Note" seems to say you won't see a cached value on some other processor.

@stephentoub
Copy link

Contradictory how?

The content in the "Note" seems to say you won't see a cached value on some other processor.

The "is read or written to memory and not cached" is from the point of view of the thread/core doing the reading/writing.

@utpilla
Copy link
Contributor

utpilla commented Feb 20, 2024

@stephentoub A few follow-up questions:

Is it fair to assume that the unavailability of the updated value would only be a transient issue?

Yes. And there's not really any difference in that regard between a volatile or interlocked write.

Interlocked methods are supposed to perform atomic operations, right? So, they shouldn't lead to any intermediate states where one thread updates the value of a variable but another thread (that might or not be on the same processor) doesn't see the updated value?

The "is read or written to memory and not cached" is from the point of view of the thread/core doing the reading/writing.

The note seems contradictory because we have an assumption that if an update is written to the memory (instead of just the processor register), then it should be discoverable/aviablable for every other thread. Is that assumption wrong?

@stephentoub
Copy link

So, they shouldn't lead to any intermediate states where one thread updates the value of a variable but another thread (that might or not be on the same processor) doesn't see the updated value?

If both cores are using interlocked/atomic operations to mutate the same shared value, then they need to be coordinating, yes. That doesn't say anything, though, about one thread manipulating a memory location with an interlocked and another thread just reading that value... there's no guarantee that that other thread/core will see that value any sooner with it having been written with some atomic interlocked exchange than having been written with a volatile write. Imagine that other thread is just spinning in a while (*memoryLocation) { }... the value could have been enregistered so it doesn't see any changes, it might still be reading from a cache, the compiler may even have hoisted the read and changed the loop to be while (true).

then it should be discoverable/aviablable for every other thread

And what if that other thread is reading the value from a register?

@utpilla
Copy link
Contributor

utpilla commented Feb 21, 2024

@stephentoub

So, they shouldn't lead to any intermediate states where one thread updates the value of a variable but another thread (that might or not be on the same processor) doesn't see the updated value?

If both cores are using interlocked/atomic operations to mutate the same shared value, then they need to be coordinating, yes. That doesn't say anything, though, about one thread manipulating a memory location with an interlocked and another thread just reading that value

then it should be discoverable/aviablable for every other thread

And what if that other thread is reading the value from a register?

Okay so it looks like both the questions that I asked are kind of pointing to the same scenario:

Let's say thread A writes to a variable using Interlocked or Volatile methods. This write is written to the memory. Now, if we have another thread B (may or may not be on the same processor that made the write) that is trying to read the data, it may or may not discover the latest update depending on how it reads the data:

  1. If it reads the data using an Interlocked operation, then it should immediately see the update made by thread A
  2. If it's a simple read without using any synchronization construct, then there are no guarantees whether thread B would see the update by thread A as it might always read the value from its processor register.
  3. What if it's a volatile read, is it guaranteed that thread B should immediately see the update made by thread A?

I tried to summarize my understanding of this issue. Could you please confirm if my understanding of points 1 and 2 is correct? Also, could you answer what should be expected for point 3?

@stephentoub
Copy link

In general, there aren't guarantees being made about how quickly data becomes visible to other threads. The guarantees that are made are around the semantics of the operations. Volatile.Read/Write make guarantees about the order in which operations will be visible to other threads; they don't promise that the results of operations will be visible quickly or slowly, but that when they are visible, they'll be visible in the guaranteed order. Let's say you have this code using Volatile:

using System.Threading;

public class C
{
    private int _value;
    
    public int M1() => _value;
    public int M2() => Volatile.Read(ref _value);
}

Here's what SharpLab produces, on x64:

C.M1()
    L0000: mov eax, [rcx+8]
    L0003: ret

C.M2()
    L0000: mov eax, [rcx+8]
    L0003: ret

Note that they're identical. That's because the x64 memory model already prohibits the kinds of reorderings that volatile also prevents, so the JIT doesn't need to emit anything specific for volatile. And as such, there can't be any additional guarantees here about the speed at which data becomes visible to other threads, because if there were, such guarantees would be needed for all writes, or else these wouldn't be identical.

Interlocked.Exchange et al similarly don't make any guarantees about how quickly or slowly data will be visible, but they do make atomicity guarantees around the semantics of the operations (e.g. a new value will be written and the old one returned, atomically), and those guarantees mean synchronization is in play that necessitates certain behavior when interacting with operations being performed on other threads/cores, i.e. you might not see the updated value on another thread unless you perform an operation there where the semantics demand it and force the synchronization for that specific memory location / cache line.

So I'm struggling to answer the questions that are in terms of timing / immediacy.

@utpilla
Copy link
Contributor

utpilla commented Feb 21, 2024

@stephentoub Thanks a lot for that explanation!

you might not see the updated value on another thread unless you perform an operation there where the semantics demand it and force the synchronization for that specific memory location / cache line.

What would these operations be whose semantics demand to see the updated value (by forcing the synchronization for the memory location/ cache line)? (Any of these operations: lock statement, Interlocked methods, or Volatile read?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metrics Metrics signal related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants