- Last time
- Case studies: parallel reduction on the GPU & 1D convolution
- Looking beyond today: some more GPU computing feature, but looking for a while into optimization features
- Today
- One more cast study: parallel prefix scan
- Using streams in GPU computing: increasing problem size; improving execution speeds
- Simple, but suboptimal (O(N*log2(N)))
- Convoluted, but O(N)
- Balanced trees: A common parallel algorithm pattern
- Upsweep from roots to the main trunk, and then down sweep from trunk to root
- "Tree": Just a concept--the actual data structure is not used
- A CUDA-enabled GPU has 2 engines
- An execution engine
- A copy engine (which contains 2 sub-engines that can work simultaneously)
- A H2D copy sub-engine
- A D2H copy sub-engine
- Async execution
- Examples: Kernel launches, D2D mem copies, mem copies by functions with the
Async
suffix, etc
- Examples: Kernel launches, D2D mem copies, mem copies by functions with the
- Overlapping Host <--> Device data transfer with device execution
- Issue: The device execution stack is FIFO
- Addressed by the usage of CUDA "streams"
- Issue: The device execution stack is FIFO
- Concurrency can be managed through streams
- Concurrency means one of two things:
- The copy and the execution engines of GPU working at the same time
- Several different kernels being executed at the same time on the GPU
- Concurrency means one of two things:
- A stream is a sequence of CUDA commands issued by the host that executes on the GPU in issue-order
- CUDA operations in different streams may run concurrently
- CUDA operations from different streams may be interleaved
- As soon as a CUDA function is invoked, a default stream (stream 0) is created
- Create using
cudaStreamCreate()
, destroy usingcudaStreamDestroy()