A wrapper for NCCL for direct porting of distributed CUDA code to CPU clusters by reimplementing NCCL API calls with MPI under the hood