- Last time
- A three-stop journey noted in the evolution of the CUDA memory model
- Z-C accesses on the host; the UVA milestone; the unified memory model that allowed to use of managed memory
- A three-stop journey noted in the evolution of the CUDA memory model
- Today
- GPU computing, from a distance (via thrust & CUB)
- Motivation
- Increase programmer productivity
- Do not sacrifice execution speed
- What is thrust?
- A template library for parallel computing on GPU and CPU
- Heavy use of C++ containers
- Provides ready-to-use algorithms
- To avoid name collisions, use
thrust
vs.std
namespaces - 2 vector containers: host_vector and device_vector
- Just like those in the C++ STL
- Manage both host & device memory
- Auto allocation & deallocation
- Iterators: Act like pointers for vector containers
- Can be converted to raw containers
- Raw pointers can also be wrapped with device_ptr
- Element-wise operations
- for_each, transform, gather, scatter
- Example: SAXPY, functor using transform
- Reductions
- reduce, inner_product, reduce_by_key
- Prefix sums (scans)
- inclusive_scan, inclusive_scan_by_key
- Sorting
- sort, stable_sort, sort_by_key
- Zipping
- Takes in multiple distinct sequences, zips into unique sequence of tuples
- Fusing
- Just like zipping, but it's for reorganizing computation (instead of data) for efficient thrust processing
- Increases the arithmetic intensity
Not covered in class
- CUB: CUDA UnBound
- CUB is on GitHub
- thrust is built on top of CUB
- What CUB does
- Parallel primitives
- Warp-wide "collective" primitives
- Block-wide "collective" primitives
- Device-wide primitives
- Utilities
- Fancy iterators
- Thread and thread block I/O
- PTX intrinsics
- Device, kernel, and storage management
- Parallel primitives