benchmarks.tex

%\subsection{GPUDirect}
%NVIDIA's GPUDirect technology was introduced to reduce the overhead of data
%movement across GPUs~\cite{GPUDirect, shainer2011development}.  GPUDirect supports both networking as
%well as peer-to-peer interfaces for single node multi-GPU systems.  The most
%recent implementation of GPUDirect, version 3, adds support for RDMA over
%InfiniBand for Kepler-class GPUs. 

%The networking component of GPUDirect relies on three key technologies: CUDA 5
%(and up), a CUDA-enabled MPI implementation, and a Kepler-class GPU (RDMA only).
%Both MVAPICH and OpenMPI support GPUDirect.  Support for RDMA over GPUDirect is
%enabled by the MPI library, given supported hardware, and does not depend on
%application-level changes to a user's code.  

%In this paper, our GPUDirect work focuses on GPUDirect v3 for multi-node RDMA
%support.  We demonstrate scaling for up to 4 nodes connected via QDR InfiniBand
%and show that GPUDirect RDMA improves both scalability and overall performance
%by approximately 9\% at no cost to the end user.


\section{Benchmarks}\label{benchmarks}
We selected two molecular dynamics (MD) applications for evaluation in this study:
LAMMPS and HOOMD~\cite{plimpton2007lammps,anderson2010hoomd}.  These MD simulations are chosen to represent a subset of advanced parallel computation for a number of fundamental reasons:

\begin{itemize}
\item MD simulations provide a practical representation of N-Body simulations, which are one of the major computational \emph{Dwarfs} \cite{asanovic2006landscape} in parallel and distributed computing. 
\item MD simulations are one of the most widely deployed applications on large scale supercomputers today.
\item Many MD simulations have a hybrid MPI+CUDA programming model, which has become commonplace in HPC as the use of accelerators increase.
\end{itemize}

As such, we look to LAMMPS and HOOMD to provide a real-world example for running cutting-edge parallel programs on virtualized infrastructure. While these applications by no means represent all parallel scientific computing efforts (see 13 Dwarfs \cite{asanovic2006landscape}), we hope these MD simulators offer a more pragmatic viewpoint than traditional synthetic HPC benchmarks such as High Performance Linpack. 

\paragraph {LAMMPS} The Large-scale Atomic/Molecular Parallel Simulator is a
well-understood highly parallel molecular dynamics simulator.  It supports both
CPU and GPU-based workloads.  Unlike many simulators, both MD and otherwise,
LAMMPS is heterogeneous.  It will use both GPUs and multicore CPUs concurrently.
For this study, this heterogeneous functionality introduces additional load on
the host, allowing LAMMPS to utilize all available cores on a given system.
Networking in LAMMPS is accomplished using a typical MPI model. That is, data is
copied from the GPU back to the host and sent over the InfiniBand fabric.  %No
%RDMA is used for these experiments. 
%GPUDirect’s v3.0 RDMA feature requires changes to the program, and the specific LAMMPS package used does not have support for GPUDirect RDMA functionality at the time of writing and is therefore unused. %for LAMMPS. 
LAMMPS does not include GPUDirect support, and no RDMA is used for these experiments.

\paragraph{HOOMD-blue} The Highly Optimized Object-oriented Many-particle
Dynamics -- Blue Edition is a particle dynamics simulator capable of
scaling into the thousands of GPUs.  HOOMD supports executing on both CPUs and
GPUs.  Unlike LAMMPS, HOOMD is homogeneous and does not support mixing
of GPUs and CPUs.  HOOMD supports GPUDirect using a CUDA-enabled MPI.
In this paper we focus on HOOMD's
support for GPUDirect and show its benefits for increasing cluster sizes.