Lecture 27: MPI Parallel Programming Point-to-Point communication: Blocking vs. Non-blocking sends.

Lecture Summary

Different "send" modes:
- Synchronous send: MPI_SSEND
  - Risk of deadlock/waiting -> idle time
  - High latency but better bandwidth than bsend
- Buffered (async) send: MPI_BSEND
  - Low latency/bandwidth
- Standard send: MPI_SEND
  - Up to the MPI implementation to device whether to do rendezvous or eager
  - Less overhead if in eager mode
  - Blocks in rendezvous, switches to sync mode
- Ready send: MPI_RSEND
  - Works only if the matching receive has been posted
  - Rarely used, very dangerous
Receiving, all modes: MPI_RECV
Buffered send
- Reduces overhead associated with data transmission
- Relies on the existence of a buffer. Buffering incurs an extra memory copy
- Return from an MPI_Bsend does not guarantee the message was sent: the message remains in the buffer until a matching receive is posted

Blocking send: Covered above. Upon return from a send, you can modify the content of the buffer in which you stored data to be sent since the data has been sent
Non-blocking send: The sender returns immediately, no guarantee that the data has been transmitted
- Routine name starts with MPI_I
- Gets to do useful work (overlap communication with execution) upon return from the non-blocking call
- Use synchronization call to wait for communication to complete
MPI_Wait: Blocks until a certain request is completed
- Wait for multiple sends: Waitall, Waitany, Waitsome
MPI_Test: Non-blocking, returns quickly with status information
- int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status);
MPI_Probe: Allows for incoming messages to be queried prior to receiving them

Three types of collective actions:
- Synchronization (barrier)
- Communication (e.g., broadcast)
- Operation (e.g., reduce)
Writing distributed applications with PyTorch is a good tutorial
Broadcast: MPI_Bcast
Gather: MPI_Gather
Scatter: MPI_Scatter
Reduce: MPI_Reduce
- Result is collected by the root only
Allreduce: MPI_Allreduce
- Result is sent out to all ranks in the communicator
Prefix scan: MPI_Scan
User-defined reduction operations: Register using MPI_Op_create()