- Last time
- OpenMP: Tasks, variable scoping, synchronization (barrier & critical constructs)
- Today
- Wrap up synchronization
- OpenMP rules of thumb
- Parallel computing w/ OpenMP: NUMA aspects & how caches come into play
- The atomic directive
- A guarded memory access operation
- Can only protect a single assignment
- Applies only to simple update of memory
- Is a special case of a critical section with significantly less overhead due to atomicity
- The reduction construct (see example down below)
- Local copy of sum for each thread engaged in the reduction is private
- Each local sum is initialized to the identity operand associated with the operator that comes into play. In this case, we have "+", so the init value is 0.
- All local copies of sum are added together and stored in a "global" variable
- #pragma omp for reduction(op:list)
- The variables in list will be shared in the enclosing parallel region
- Local copy of sum for each thread engaged in the reduction is private
- The simd directive
- #pragma omp for simd reduction(+:sum)
- Common causes are:
- Too much sequential code in your app
- Seek to reduce amount of execution time where only one thread executes code
- Too much communication
- Difficult to pin down costly memory operations
- Load imbalance
- One thread gets too much work, while others idle waiting for it
- For OpenMP for, one can use schedule(runtime)
- Example:
setenv OMP_SCHEDULE "dynamic,5"
- Example:
- Synchronization
- Barriers can be expensive
- Avoid them using
- Careful use of the
nowait
clause - Parallelize at the outermost level possible
- Use
critical
oratomic
- Use other OpenMP facilities like
reduce
- Careful use of the
- Compiler (non-)optimizations
- Sometimes the addition of parallel directives can prevent the compiler from performing sequential optimization
- Symptom: parallel code running with 1 thread has longer execution and higher instruction count than sequential code
- Too much sequential code in your app
- Up to this point, we have been using the Symmetric Multi-Processing (SMP) model and we haven't been concerned about the mechanics of shared memory access
- In today's servers/clusters, nodes have many CPUs, each with many cores (this is called multi-socket configurations, as opposed to one chip per motherboard), and not all memory access are equal
- NUMA: Non-uniform memory access
- Cost of memory access depends on which memory bank stores your data
- The NUMA factor: the ratio between the largest and shortest average amount of time for a thread running on a particular core to reach data in memory
- A low NUMA factor is desirable (not much of a difference which bank data is stored)
- Numa factor = 1: SMP system
- Accessing memory outside a NUMA node: 20% slowdown for reads, 30% slowdown for writes
- NUMA aspects where OS comes into play
- When a thread mallocs memory, how should this memory be allocated
- Affinity: How the runtime/OS assigns a thread to a certain core
- OMP_PROC_BIND: Allows you to dictate a distribution policy
- master: Collocate threads with the master thread
- close: Place threads close to the master in the places list
- Useful if code is compute-bound and don't do many trips to main memory
- Reduce synchronization costs (single, barrier, etc.)
- spread (default): Spread out thread as much as possible
- Useful if code is memory-bound as it improves aggregate system memory bandwidth
- false: Set no binding
- true: lock thread to a core
- OMP_PLACES: Allows you to control locations. OMP_PLACES can assume one of these values
- threads: Hardware thread, assuming hyper threading is on
- cores: Core
- sockets: Node (socket)
- A place list: Defined by user, explicitly referencing the underlying hardware of the machine
- An extensive list of examples can be found in the slides
- OMP_PROC_BIND: Allows you to dictate a distribution policy