- Last time
- GPU Computing: Execution Scheduling
- Block scheduler (at the GPU level)
- Warp scheduler (at the SM level)
- Thread Divergence
- GPU Computing: Execution Scheduling
- Today
- Aspects related to how GPU memory operations take place
Each thread can:
- R/W per-thread registers
- R/W per-thread local memory
- R/W per-block shared memory
- R/W per-grid global memory
- Read only per-grid constant memory
- Read only per-grid texture memory
- Read only per-grid surface memory
Some aspects of Local Memory:
- Physically, local memory does not exist
- In reality, data stored in local memory is placed in cache or the global memory at run time or by the compiler
- It's specific to one thread and not visible to any other thread
- Local memory has the same latency as global memory, unless cached
Different memories:
- Global memory: Main means of communicating R/W data between host and device. cudaMalloc(), cudaFree(), and cudaMemcpy() operate here. Note that there are four types of cudaMemcpy transfers ({host/device} to {host/device}), and things happen over a PCIe connection.
- Texture and Constant memories: Constants initialized by host, contents available to all threads.
Global, texture and constant memories are accessible by host (done at high latency, low bandwidth).
Purpose:
-
See an example where the use of multiple blocks of threads play a central role
-
Highlight the use/role of the shared memory
-
Point out the __syncthreads() function call (synchronizes all threads in a block)
-
The previous example: Low arithmetic intensity, a lot of unnecessary movements from global memory to device
-
Rule of thumb: If the data that you, as a thread, use can also be used by another thread in your block, then you should consider using shared memory
-
To use shared memory:
- Partition data into data subsets (tiles) that each fits into shared memory
- Handle each data subset (tile) with one thread block by:
- Loading the tile from global memory into shared memory, using multiple threads to exploit memory-level parallelism
- Performing the computation on the tile from shared memory; each thread can efficiently multi-pass over any data element of the tile
__syncthreads()
synchronizes all threads in a block- Used to avoid RAW/WAR/WAW hazards when accessing shared or global memory
- Be very careful when using it in a conditional
- 3 ways to set aside shared memory:
- Statically, declare inside a kernel
- Through the execution configuration (see code block below)
- Dynamically, via CUDA driver API
cuFuncSetSharedSize()
(out of scope)
__global__ void MyFunc(float*) // __device__ or __global__ function
{
extern __shared__ float shMemArray[];
// Size of shMemArray determined through the execution configuration
// You can use shMemArrayas you wish here...
}
// invoke like this. Ns indicates the size in bytes to be allocated in shared memory
MyFunc<<< Dg, Db, Ns>>>(parameter);
- Each SM has shared memory organized in 32 memory banks
- Successive 32-bit words map to successive banks
- Each bank has a bandwidth of 32 bits per clock cycle
- ShMem and L1 cache draw on the same physical memory inside an SM