CUB 2.0.0
Summary
The CUB 2.0.0 major release adds a dependency on libcu++ and contains several breaking changes. These include new diagnostics when inspecting device-only lambdas from the host, an updated method of determining accumulator types for algorithms like Reduce and Scan, and a compile-time replacement for the runtime debug_synchronous
debugging flags.
This release also includes several new features. DeviceHistogram
now supports __half
and better handles various edge cases. WarpReduce
now performs correctly when restricted to a single-thread “warp”, and will use the __reduce_add_sync
accelerated intrinsic (introduced with Ampere) when appropriate. DeviceRadixSort
learned to handle the case where begin_bit == end_bit
.
Several algorithms also have updated documentation, with a particular focus on clarifying which operations can and cannot be performed in-place.
Breaking Changes
- #448 Add libcu++ dependency (v1.8.0+).
- #448: The following macros are no longer defined by default. They can be re-enabled by defining
CUB_PROVIDE_LEGACY_ARCH_MACROS
. These will be completely removed in a future release.CUB_IS_HOST_CODE
: Replace withNV_IF_TARGET
.CUB_IS_DEVICE_CODE
: Replace withNV_IF_TARGET
.CUB_INCLUDE_HOST_CODE
: Replace withNV_IF_TARGET
.CUB_INCLUDE_DEVICE_CODE
: Replace withNV_IF_TARGET
.
- #486: CUB’s CUDA Runtime support macros have been updated to support
NV_IF_TARGET
. They are now defined consistently across all host/device compilation passes. This should not affect most usages of these macros, but may require changes for some edge cases.CUB_RUNTIME_FUNCTION
: Execution space annotations for functions that invoke CUDA Runtime APIs.- Old behavior:
- RDC enabled: Defined to
__host__ __device__
- RDC not enabled:
- NVCC host pass: Defined to
__host__ __device__
- NVCC device pass: Defined to
__host__
- NVCC host pass: Defined to
- RDC enabled: Defined to
- New behavior:
- RDC enabled: Defined to
__host__ __device__
- RDC not enabled: Defined to
__host__
- RDC enabled: Defined to
- Old behavior:
CUB_RUNTIME_ENABLED
: No change in behavior, but no longer used in CUB. Provided for legacy support only. Legacy behavior:- RDC enabled: Macro is defined.
- RDC not enabled:
- NVCC host pass: Macro is defined.
- NVCC device pass: Macro is not defined.
CUB_RDC_ENABLED
: New macro, may be combined withNV_IF_TARGET
to replace most usages ofCUB_RUNTIME_ENABLED
. Behavior:- RDC enabled: Macro is defined.
- RDC not enabled: Macro is not defined.
- #509: A compile-time error is now emitted when a
__device__
-only lambda’s return type is queried from host code (requires libcu++ ≥ 1.9.0).- Due to limitations in the CUDA programming model, the result of this query is unreliable, and will silently return an incorrect result. This leads to difficult to debug errors.
- When using libcu++ 1.9.0, an error will be emitted with information about work-arounds:
- Use a named function object with a
__device__
-only implementation ofoperator()
. - Use a
__host__ __device__
lambda. - Use
cuda::proclaim_return_type
(Added in libcu++ 1.9.0)
- Use a named function object with a
- #509: Use the result type of the binary reduction operator for accumulating intermediate results in the
DeviceReduce
algorithm, following guidance from http://wg21.link/P2322R6.- This change requires host-side introspection of the binary operator’s signature, and device-only extended lambda functions can no longer be used.
- In addition to the behavioral changes, the interfaces for the
Dispatch*Reduce
layer have changed:DispatchReduce
:- Now accepts accumulator type as last parameter.
- Now accepts initializer type instead of output iterator value type.
- Constructor now accepts
init
as initial type instead of output iterator value type.
DispatchSegmentedReduce
:- Accepts accumulator type as last parameter.
- Accepts initializer type instead of output iterator value type.
- Thread operators now accept parameters using different types:
Equality
,Inequality
,InequalityWrapper
,Sum
,Difference
,Division
,Max
,ArgMax
,Min
,ArgMin
. ThreadReduce
now accepts accumulator type and uses a different type forprefix
.
- #511: Use the result type of the binary operator for accumulating intermediate results in the
DeviceScan
,DeviceScanByKey
, andDeviceReduceByKey
algorithms, following guidance from http://wg21.link/P2322R6.- This change requires host-side introspection of the binary operator’s signature, and device-only extended lambda functions can no longer be used.
- In addition to the behavioral changes, the interfaces for the
Dispatch
layer have changed:DispatchScan
now accepts accumulator type as a template parameter.DispatchScanByKey
now accepts accumulator type as a template parameter.DispatchReduceByKey
now accepts accumulator type as the last template parameter.
- #527: Deprecate the
debug_synchronous
flags on device algorithms.- This flag no longer has any effect. Define
CUB_DEBUG_SYNC
during compilation to enable these checks. - Moving this option from run-time to compile-time avoids the compilation overhead of unused debugging paths in production code.
- This flag no longer has any effect. Define
New Features
- #514: Support
__half
inDeviceHistogram
. - #516: Add support for single-threaded invocations of
WarpReduce
. - #516: Use
__reduce_add_sync
hardware acceleration forWarpReduce
on supported architectures.
Bug Fixes
- #481: Fix the device-wide radix sort implementations to simply copy the input to the output when
begin_bit == end_bit
. - #487: Fix
DeviceHistogram::Even
for a variety of edge cases:- Bin ids are now correctly computed when mixing different types for
SampleT
andLevelT
. - Bin ids are now correctly computed when
LevelT
is an integral type and the number of levels does not evenly divide the level range.
- Bin ids are now correctly computed when mixing different types for
- #508: Ensure that
temp_storage_bytes
is properly set in theAdjacentDifferenceCopy
device algorithms. - #508: Remove excessive calls to the binary operator given to the
AdjacentDifferenceCopy
device algorithms. - #533: Fix debugging utilities when RDC is disabled.
Other Enhancements
- #448: Removed special case code for unsupported CUDA architectures.
- #448: Replace several usages of
__CUDA_ARCH__
with<nv/target>
to handle host/device code divergence. - #448: Mark unused PTX arch parameters as legacy.
- #476: Enabled additional debug logging for the onesweep radix sort implementation. Thanks to @canonizer for this contribution.
- #480: Add
CUB_DISABLE_BF16_SUPPORT
to avoid including thecuda_bf16.h
header or using the__nv_bfloat16
type. - #486: Add debug log messages for post-kernel debug synchronizations.
- #490: Clarify documentation for in-place usage of
DeviceScan
algorithms. - #494: Clarify documentation for in-place usage of
DeviceHistogram
algorithms. - #495: Clarify documentation for in-place usage of
DevicePartition
algorithms. - #499: Clarify documentation for in-place usage of
Device*Sort
algorithms. - #500: Clarify documentation for in-place usage of
DeviceReduce
algorithms. - #501: Clarify documentation for in-place usage of
DeviceRunLengthEncode
algorithms. - #503: Clarify documentation for in-place usage of
DeviceSelect
algorithms. - #518: Fix typo in
WarpMergeSort
documentation. - #519: Clarify segmented sort documentation regarding the handling of elements that are not included in any segment.