Skip to content

Latest commit

 

History

History
75 lines (57 loc) · 2.1 KB

lecture-18-gpu-computing-with-thrust-and-cub..md

File metadata and controls

75 lines (57 loc) · 2.1 KB

Lecture 18: GPU Computing with thrust and cub.

Lecture Summary

  • Last time
    • A three-stop journey noted in the evolution of the CUDA memory model
      • Z-C accesses on the host; the UVA milestone; the unified memory model that allowed to use of managed memory
  • Today
    • GPU computing, from a distance (via thrust & CUB)

Thrust

  • Motivation
    • Increase programmer productivity
    • Do not sacrifice execution speed
  • What is thrust?
    • A template library for parallel computing on GPU and CPU
    • Heavy use of C++ containers
    • Provides ready-to-use algorithms

Namespaces, containers, iterators

  • To avoid name collisions, use thrust vs. std namespaces
  • 2 vector containers: host_vector and device_vector
    • Just like those in the C++ STL
    • Manage both host & device memory
    • Auto allocation & deallocation
  • Iterators: Act like pointers for vector containers
    • Can be converted to raw containers
    • Raw pointers can also be wrapped with device_ptr

Algorithms

  • Element-wise operations
    • for_each, transform, gather, scatter
    • Example: SAXPY, functor using transform
  • Reductions
    • reduce, inner_product, reduce_by_key
  • Prefix sums (scans)
    • inclusive_scan, inclusive_scan_by_key
  • Sorting
    • sort, stable_sort, sort_by_key

General transformations. Zipping & fusing

Problem at hand

  • Zipping
    • Takes in multiple distinct sequences, zips into unique sequence of tuples
  • Fusing
    • Just like zipping, but it's for reorganizing computation (instead of data) for efficient thrust processing
    • Increases the arithmetic intensity

Thrust example: Processing rainfall data

Not covered in class

CUB

  • CUB: CUDA UnBound
  • CUB is on GitHub
  • thrust is built on top of CUB
  • What CUB does
    • Parallel primitives
      • Warp-wide "collective" primitives
      • Block-wide "collective" primitives
      • Device-wide primitives
    • Utilities
      • Fancy iterators
      • Thread and thread block I/O
      • PTX intrinsics
      • Device, kernel, and storage management