- Last time
- HPC via MPI
- MPI point-to-point communication: The blocking flavor
- Today
- Wrap up point-to-point communication
- Collective communication
- Different "send" modes:
- Synchronous send: MPI_SSEND
- Risk of deadlock/waiting -> idle time
- High latency but better bandwidth than bsend
- Buffered (async) send: MPI_BSEND
- Low latency/bandwidth
- Standard send: MPI_SEND
- Up to the MPI implementation to device whether to do rendezvous or eager
- Less overhead if in eager mode
- Blocks in rendezvous, switches to sync mode
- Ready send: MPI_RSEND
- Works only if the matching receive has been posted
- Rarely used, very dangerous
- Synchronous send: MPI_SSEND
- Receiving, all modes: MPI_RECV
- Buffered send
- Reduces overhead associated with data transmission
- Relies on the existence of a buffer. Buffering incurs an extra memory copy
- Return from an MPI_Bsend does not guarantee the message was sent: the message remains in the buffer until a matching receive is posted
- Blocking send: Covered above. Upon return from a send, you can modify the content of the buffer in which you stored data to be sent since the data has been sent
- Non-blocking send: The sender returns immediately, no guarantee that the data has been transmitted
- Routine name starts with MPI_I
- Gets to do useful work (overlap communication with execution) upon return from the non-blocking call
- Use synchronization call to wait for communication to complete
- MPI_Wait: Blocks until a certain request is completed
- Wait for multiple sends: Waitall, Waitany, Waitsome
- MPI_Test: Non-blocking, returns quickly with status information
- int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status);
- MPI_Probe: Allows for incoming messages to be queried prior to receiving them
- Three types of collective actions:
- Synchronization (barrier)
- Communication (e.g., broadcast)
- Operation (e.g., reduce)
- Writing distributed applications with PyTorch is a good tutorial
- Broadcast: MPI_Bcast
- Gather: MPI_Gather
- Scatter: MPI_Scatter
- Reduce: MPI_Reduce
- Result is collected by the root only
- Allreduce: MPI_Allreduce
- Result is sent out to all ranks in the communicator
- Prefix scan: MPI_Scan
- User-defined reduction operations: Register using MPI_Op_create()