Lecture 20: Multi-core Parallel Computing with OpenMP. Parallel Regions.

Lecture Summary

OpenMP: portable and scalable model for shared memory parallel applications
- No need to dive deep and work with POSIX pthreads
- Under the hood, the compiler translates OpenMPfunctions and directives to pthread calls
Structured block and OpenMP construct are the two sides of the “parallel region” coin
In a structured block, the only "branches" allowed are exit() function calls. There is an implicit barrier after each structured block where threads wait on each other.

The nested parallelism behavior can be controlled by using the OpenMP API
The single directive identifies a section of the code that must be run by a single thread
- The difference between single and master is that in single, the code is executed by whichever thread reaches the region first
- Another diff is that for single, there is an implicit barrier upon completion of the region

Work sharing is a general term used in OpenMP to describe the distribution of work across threads
The three main constructs for automatic work division are:
- omp for
- omp sections
- omp task

A #pragma omp for inside a #pragma omp parallel is equivalent to #pragma omp parallel for
Most OpenMP implementations use default block partitioning, where each thread is assigned roughly n/thread_count iterations. This may lead to load imbalance if the work per iteration varies
- The schedule clause comes to the rescue!
- Usage example: #pragma omp parallel for schedule(static, 8)

OpenMP will only parallelize for loops that are in canonical form. Counterintuitive behavior may happen
The collapse clause supports collapsing the embedded loops into one uber loop
- For example, if the outer loop has 10 iters, the inner loop has 10^7 iters, and we have 32 threads: parallelizing the outer loop is bad (10<32), parallelizing the inner loop is good, but we can do better using collapse