Lecture 15: CUDA Case Studies. (3) Parallel Prefix Scan on the GPU. Using Multiple Streams in CUDA.

Lecture Summary

Last time
- Case studies: parallel reduction on the GPU & 1D convolution
- Looking beyond today: some more GPU computing feature, but looking for a while into optimization features
Today
- One more cast study: parallel prefix scan
- Using streams in GPU computing: increasing problem size; improving execution speeds

Convoluted, but O(N)
Balanced trees: A common parallel algorithm pattern
- Upsweep from roots to the main trunk, and then down sweep from trunk to root
- "Tree": Just a concept--the actual data structure is not used

A CUDA-enabled GPU has 2 engines
- An execution engine
- A copy engine (which contains 2 sub-engines that can work simultaneously)
  - A H2D copy sub-engine
  - A D2H copy sub-engine
Async execution
- Examples: Kernel launches, D2D mem copies, mem copies by functions with the Async suffix, etc
Overlapping Host <--> Device data transfer with device execution
- Issue: The device execution stack is FIFO
  - Addressed by the usage of CUDA "streams"
Concurrency can be managed through streams
- Concurrency means one of two things:
  - The copy and the execution engines of GPU working at the same time
  - Several different kernels being executed at the same time on the GPU
A stream is a sequence of CUDA commands issued by the host that executes on the GPU in issue-order
- CUDA operations in different streams may run concurrently
- CUDA operations from different streams may be interleaved
As soon as a CUDA function is invoked, a default stream (stream 0) is created
Create using cudaStreamCreate(), destroy using cudaStreamDestroy()