Can an object with `thread_scope` be used by distinct thread groups? #325

jrhemstad · 2022-10-14T01:47:06Z

jrhemstad
Oct 14, 2022
Maintainer

libcu++ introduces the notion of a "thread scope" on its synchronization primitives like atomic, barrier, and pipeline.

A thread scope specifies the kind of threads that can synchronize with each other using a primitive such as an atomic or a barrier.

e.g., a cuda::atomic<int, thread_scope_block> can be used to synchronize among threads in the same block, but not by threads in different blocks.

However, it is not clear if an object with a particular scope can or should be reused by different threads that belong to a particular scope. For example:

__shared__ cuda::atomic<int, thread_scope_thread> a;

if(threadIdx.x == 0){
   new (&a) atomic<int,thread_scope_thread>{};
   a.store(42);
}

__syncthreads();

if(threadidx.x == 1)
  int result = a.load();

Here we have an atomic scoped to a single thread, but used from different threads.

The current specification for thread scopes says that:

The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic at a scope that includes the thread that performed the other operation,

In other words, so long as threads in different scopes don't attempt to concurrently use the same object, there appears to be no problem. The __syncthreads() guarantees that there are no concurrent accesses from threads in different scopes.

Note that this code "works", but is it intentional to allow this behavior?

Consider another example where we have a barrier in global memory, but with block scope. Two different blocks use the same barrier object, but separated by a grid synchronization such that we are guaranteed there are no concurrent accesses from each block.

// block scope barrier in global memory
__device__ cuda::barrier<thread_scope_block> bar;

__global__ void kernel(){
   // initialize and use barrier on block 0
   if(threadIdx.x == 0 && blockIdx.x == 0){
      init(bar, 1);
      bar.arrive_and_wait();
   }

   cooperative_groups::this_grid::sync();

  // Use barrier on block 1
  if(threadIdx.x == 0 && blockIdx.x == 1){
     bar.arrive_and_wait();
  }
}

The spec says nothing that would exclude this from being a well-defined program. Is this intentional? If not, do we think it is important to allow this behavior?

@ogiroux I'm especially interested in hearing what you think about this.

Answered by jrhemstad

Aug 21, 2023

For posterity, we decided to place a specific limitation on cuda::barrier<cuda::thread_scope_block> in __shared__ memory such that it can only be used by the threads in the CTA of the thread that constructed it. This was necessary compromise in order to leverage certain hardware acceleration features.

For all other data structures and state spaces, it remains valid to reuse those objects across different thread groups.

See: NVIDIA/cccl#75

View full answer

ogiroux · 2022-10-18T23:52:39Z

ogiroux
Oct 18, 2022

I think it's important that it be allowed in C++, so this was intentional when it was written.

Consider the first example with the atomic int, except replace it with a plain a non-atomic int instead -- that is obviously correct. I think it's important to be able to explain atomic int as a superset of an int, and not a super-subset of an int.

The barrier case looks different on the surface, but I think it should be allowed that: 1) barrier objects could be constructed once in a pool and reused across many blocks in the life of a grid, and 2) that the CPU be allowed to construct this pool.

TL;DR: I think it would be aberrant if atomic types were somehow weaker than non-atomic ones; thread scopes determine if you'll get atomic or non-atomic semantics for a given conflict but doesn't change what those semantics mean.

1 reply

jrhemstad Oct 19, 2022
Maintainer Author

Thanks @ogiroux, that makes sense.

What if there were particular optimizations that could be made if one knew that an object were only going to be used by the thread group that initialized it? How would you enable such optimizations? Perhaps require passing a group argument to the constructor of such an object to make it explicit that the object is scoped to that group?

hcedwar · 2022-10-19T16:35:47Z

hcedwar
Oct 19, 2022

Question is regarding the thread_scope state composed into the atomic.
E.g., atomic<int> vs. atomic<int,thread_scope_block>
Should the following be legal:

/* thread in block 'A' */ 
  atomic<int,thread_scope_block> a ;
  /* other operations */
  a.store(1,memory_order_release);

/* thread in block 'B' */
  while ( 1 != a.load(memory_order_acquire) );
  /* expect visibility to those other operations */

Assertion is that construction of an object (atomic, barrier, ...) with specified thread_scope allows (implies) that the object may be constructed to have state specific to the thread_scope in which it was constructed.
E.g., The pool scenario must destruct/construct (recycle) the raw memory for the new scope.

8 replies

hcedwar Oct 19, 2022

Point of example is this should be illegal because the thread_scope of the release->acquire is too weak.
But if we say that 'a' can be pulled from a pool without destruct/construct (recycle) to associate with the new thread_scope, then the above example comes into question.

griwes Oct 19, 2022
Maintainer

Right. The example is invalid. But you're jumping straight to "it will be valid if you destroy and reconstruct the object" here, which, yes, it will be valid (though not particularly useful), but this argument is skipping the interesting question which is the essence of this discussion: if you do not destroy and reconstruct, and instead insert an appropriately scoped sync between the operations, is it valid then? @ogiroux's reply says that that being valid has indeed been the intent of the specification of thread scopes.

hcedwar Oct 19, 2022

What is the appropriately scoped sync to make it valid? Something which will make the sync via 'a' superfluous?
Original design (intent) for composition of thread_scope into atomic had two options:

atomic<T>::op( ..., memory_order , thread_scope ); /* explicit on every op */
atomic<T,thread_scope>::op( ... , memory_order ); /* implicit on every op */

The "implicit on every op" proposal won the day because the explicit option was assessed to have been way too bug-prone; where developers would use the atomic without tracking and applying the appropriate thread_scope .
Thus the new form of atomic became stateful with thread_scope - where state was incorporated into type-system vs. member-data.
What we did not address at the original design time was implications of violating the thread_scope state, as per this discussion thread.

griwes Oct 19, 2022
Maintainer

Yes, something external to the atomic itself. Like a CG sync of the group containing both blocks.

To my eyes the spec addresses the violation of the state, in that it makes it a data race when there's concurrent accesses from threads not within the same scope as the one specified.

I think if we could go back in time, an answer to this would be to indeed incorporare explicit group objects into the creation of synchronization objects, and then we could do specialized block scope implementations when the scope is block and the group is this_block or a subset of such. But this would definitely be a big perf regression for current users of block scope barriers in shmem.

hcedwar Oct 19, 2022

To my eyes the spec does not sufficiently address violation of thread_scope state.
Specifically, should the spec permit the possibility of an implementation which incorporates thread_scope state more deeply
than simply and implicitly forwarding to qualifying the memory_order

atomic<T,thread_scope>::op( ... , memory_order ); /* implicit on every op */
  /* implicitly apply as */ atomic<T>::op( ..., memory_order , thread_scope );

jrhemstad · 2023-08-21T14:12:17Z

jrhemstad
Aug 21, 2023
Maintainer Author

For posterity, we decided to place a specific limitation on cuda::barrier<cuda::thread_scope_block> in __shared__ memory such that it can only be used by the threads in the CTA of the thread that constructed it. This was necessary compromise in order to leverage certain hardware acceleration features.

For all other data structures and state spaces, it remains valid to reuse those objects across different thread groups.

See: NVIDIA/cccl#75

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can an object with `thread_scope` be used by distinct thread groups? #325

{{title}}

Replies: 3 comments 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Can an object with thread_scope be used by distinct thread groups? #325

jrhemstad Oct 14, 2022 Maintainer

Replies: 3 comments · 9 replies

ogiroux Oct 18, 2022

jrhemstad Oct 19, 2022 Maintainer Author

hcedwar Oct 19, 2022

hcedwar Oct 19, 2022

griwes Oct 19, 2022 Maintainer

hcedwar Oct 19, 2022

griwes Oct 19, 2022 Maintainer

hcedwar Oct 19, 2022

jrhemstad Aug 21, 2023 Maintainer Author

Can an object with `thread_scope` be used by distinct thread groups? #325

jrhemstad
Oct 14, 2022
Maintainer

Replies: 3 comments 9 replies

ogiroux
Oct 18, 2022

jrhemstad Oct 19, 2022
Maintainer Author

hcedwar
Oct 19, 2022

griwes Oct 19, 2022
Maintainer

griwes Oct 19, 2022
Maintainer

jrhemstad
Aug 21, 2023
Maintainer Author