15-md-gpudirect.tex

%\documentclass[10pt, conference, compsocconf]{IEEEtran}
%\documentclass[times,10pt,twocolumn,conference]{IEEEtran}
%\documentclass{sigplanconf}
\documentclass[10pt]{sigplanconf}


\usepackage{setspace}
\usepackage{times}
\usepackage{comment}
\usepackage{graphicx}
\usepackage{fancyvrb}
\usepackage{listings}
\usepackage{rotating}
\usepackage[usenames]{color}
\usepackage{setspace}
\usepackage{tabularx,colortbl}
\usepackage{subfigure}
%\usepackage{cite}
\usepackage{multicol}
\usepackage[usenames]{color}
%ACM SIGPLAN
\usepackage{amsmath}
%\PassOptionsToPackage{hyphens}{url}\usepackage{hyperref}
\usepackage[hyphens]{url}


\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{authblk}


\pagenumbering{gobble}
%ACM
\special{papersize=8.5in,11in}
\setlength{\pdfpageheight}{\paperheight}
\setlength{\pdfpagewidth}{\paperwidth}

\conferenceinfo{VEE~'15}{March 14--15, 2015, Istanbul, Turkey.} 
\copyrightyear{2015} 
\copyrightdata{978-1-4503-3450-1/15/03} 
%\doi{nnnnnnn.nnnnnnn} 
\doi{2731186.2731194} 

\input{tex/macros}

\title{Supporting High Performance Molecular Dynamics in Virtualized Clusters using IOMMU, SR-IOV, and GPUDirect}
%A bit of a hack, but the only way to get 4 authors and 2 institutions in good space
\subtitle{\large{Andrew J. Younge$^1$, John Paul Walters$^2$, Stephen P. Crago$^2$, Geoffrey C. Fox$^1$}}

\authorinfo{$^1$ School of Informatics and Computing \\
           Indiana Univerisity }
	   {Bloomington, IN 47408}
           {\{ajyounge,gcf\}@indiana.edu}
\authorinfo{$^2$ Information Sciences Institute \\
           University of Southern California }
	   {Arlington, VA 22203}
           {\{jwalters,crago\}@isi.edu}

%Broken or bad ideas for author list
\begin{comment}
\authorinfo{Andrew J. Younge, John Paul Walters, Stephen P. Crago, Geoffrey C. Fox}
{
\begin{multicols}{2}{
School of Informatics \& Computing\\
Indiana University\\
Bloomington, IN 47408}
{Information Sciences Institute\\
University of Southern California\\
Arlington, VA 22203}

}
{ajyounge@indiana.edu}


\begin{comment}
\authorinfo{Andrew J. Younge}
	{School of Informatics \& Computing\\
	Indiana University \\
	Bloomington, IN 47408}
	{ajyounge@indiana.edu}

\authorinfo{John Paul Walters}
	{Information Sciences Institute\\
	University of Southern California\\
	Arlington, VA 22203}
	{jwalters@isi.edu}

\authorinfo{Stephen P. Crago}
	{Information Sciences Institute\\
	University of Southern California\\
	Arlington, VA 22203}
	{crago@isi.edu}

\authorinfo{Geoffrey C. Fox}
	{School of Informatics \& Computing\\
	Indiana University \\
	Bloomington, IN 47408}
	{gcf@indiana.edu}
\end{comment}

\begin{document}
\maketitle
\begin{abstract}

Cloud Infrastructure-as-a-Service paradigms have recently shown their utility for a vast array of computational problems, ranging from advanced web service architectures to high throughput computing.  However, many scientific computing applications have been slow to adapt to virtualized cloud frameworks. This is due to performance impacts of virtualization technologies, coupled with the lack of advanced hardware support necessary for running many high performance scientific applications at scale. 

By using KVM virtual machines that leverage both Nvidia GPUs and InfiniBand, we show that molecular dynamics simulations with LAMMPS and HOOMD run at near-native speeds. This experiment also illustrates how virtualized environments can support the latest parallel computing paradigms, including both MPI+CUDA and new GPUDirect RDMA functionality. Specific findings show initial promise in scaling of such applications to larger production deployments targeting large scale computational workloads.  


%While these experiments do go beyond a single-node, their early implementation is limited to only 4 nodes due to the lack of feasible resources. Currently efforts are under way to scale the deployment to hundreds of cores and 32 GPUs within the FutureGrid testbed, which we look to demonstrate at Supercomputing 2014 in November. 
 

%and Cloud IaaS platforms may be well suited for supporting larger scale scientific applications, including support for the long tail of science. 

%* High performance Cloud IaaS \\
%* Support mid-tier scientific computing \\
%* Long tail of science \\
%* MD simulations \\
%* Good results \\

\end{abstract}

\keywords
Cloud computing; IaaS; Virtualization; KVM; IOMMU; SR-IOV; GPUDirect; Molecular Dynamics; OpenStack

\section{Introduction}

The cloud computing paradigm has become pervasive for commodity computing, but has not yet become widely accepted in the High Performance Computing (HPC) community.
%At present, we inevitably stand at the intersection between High Performance Computing (HPC) and clouds. 
Various platform tools such as Hadoop and MapReduce, among others, have already percolated into data intensive computing within HPC~\cite{jha2014apache}.  In addition, there are efforts to support traditional HPC-centric scientific computing applications in virtualized cloud infrastructure.  There are a multitude of reasons for supporting parallel computation in the cloud\cite{Armbrust2010}, including features such as dynamic scalability, specialized operating environments, simple management interfaces, fault tolerance, and enhanced quality of service, to name a few. There is a growing effort to support advanced scientific computing using virtualized infrastructure which can be seen by a variety of new efforts, including the Comet resource within XSEDE at San Diego Supercomputer Center \cite{moore2014gateways}.  

Nevertheless, there exists a past notion that virtualization used in 
cloud infrastructure is inherently inefficient.  Historically, cloud
infrastructure has done little to provide the necessary advanced hardware
capabilities that have become almost mandatory in supercomputers today, most
notably advanced GPUs and high-speed, low-latency interconnects.  Instead, cloud
infrastructure providers have favored commodity homogeneous systems.  The result
of these notions has hindered the use of virtualized environments for parallel
computation, where performance is paramount.

This is starting to change, however, as today's cloud providers seek improved
performance at lower power.  This has resulted in a heterogeneous cloud.
Amazon EC2 supports GPU accelerators in EC2~\cite{www-amazon-gpu}, and
OpenStack supports heterogeneity using flavors~\cite{www-openstack-flavors}.
These advancements in cloud level support for heterogeneity combined with better
support for high-performance virtualization makes the use of cloud for HPC much
more feasible for a wider range of applications and platforms.

Still, performance remains a concern within virtualized environments.  To that
end, a growing effort is currently underway that looks to systematically identify and
reduce any overhead in virtualization technologies.  While some of the first efforts to investigate HPC applications on cloud infrastructure like the DOE Magellan project \cite{yelick2011magellan} documented many shortcomings in performance, recent efforts have proven to be largely successful~\cite{Younge2011cloud, Luszczek:2011:EHC}, though further research is needed to address
issues of scalability and I/O.  

Thus, we see constantly diminishing overhead
with virtualization, not only with traditional cloud workloads
\cite{huber2011evaluating} but also with HPC workloads.  While virtualization
will almost always include some additional overhead in relation to its dynamic
features, the eventual goal for supporting HPC in virtualized environments is to
minimize what overhead exists whenever possible.  To advance the placement of
HPC applications on virtual machines, new efforts are emerging which focus
specifically on key hardware now commonplace in supercomputers. By leveraging
new virtualization tools such as I/O memory management unit (IOMMU) device passthrough and Single Root I/O Virtualization (SR-IOV), we can now
support such advanced hardware as the latest Nvidia Tesla GPUs
\cite{Walters2014cloud}  as well as InfiniBand fabrics for high performance networking
and I/O~\cite{jose2013sr,Musleh2014cloud}. While previous efforts have
focused on single-node advancements, our contribution in this paper is to show
that real-world applications can operate efficiently in multi-node clusters and cloud
infrastructure.  

%With the advances in hypervisor performance coupled with the newfound availability of HPC hardware in virtual machines analogous to the most powerful supercomputers used today, we can see the possibility of a high performance cloud infrastructure using virtualization. While our previous efforts in this area have focused on single-node advancements, it is now imperative to ensure real-world applications can also operate in distributed environments as found in today's cluster and cloud infrastructures. 




The remainder of this paper is organized as follows.  In
Section~\ref{Background} we describe the background and related work necessary for high performance
virtualization. In Section~\ref{openstack}, we describe a heterogeneous cloud platform, based on OpenStack. This
effort has been under development at USC/ISI since 2011~\cite{crago2011heterogeneous}.
We describe our work towards integrating GPU and InfiniBand support into
OpenStack, and we describe the heterogeneous scheduling additions that are
necessary to support not only attached accelerators, but any cloud composed of
heterogeneous elements.  

In Sections~\ref{benchmarks} and~\ref{setup} we describe the LAMMPS and HOOMD
benchmarks and our experimental setup.  In Section~\ref{results} we characterize the
performance of LAMMPS and HOOMD in a virtual infrastructure complete with both
Kepler GPUs and QDR InfiniBand.  Both HOOMD and LAMMPS are used extensively in
some of the world's fastest supercomputers and represent example simulations
that HPC supports today.  We show that these applications are able to run at
near-native speeds within a completely virtualized environment.
Furthermore, we demonstrate the ability of such a virtualized environment to
support cutting edge technologies such as RDMA GPUDirect, illustrating that the
latest HPC technologies are also possible in a virtualized environment.
In Section~\ref{discussion}, we provide a brief discussion before concluding in
Section~\ref{conclusion}. 

%Following these efforts, we hope to ensure upstream infrastructure projects such as OpenStack \cite{www-openstack, pepple2011deploying} are able to make effective and quick use of these features, allowing users to build private cloud infrastructure to support high performance distributed computational workloads. 

%Furthermore, the tighet and exact integration into an open source Cloud infrastructure framework such as OpenStack also becomes a critical next step.  

%This manuscript demonstrates 


%The tight and exact integration into an open source cloud IaaS framework such as OpenStack \cite{www-openstack} becomes critical.

 

%* Broadly talk about clouds, OpenStack, some of our earlier HPC and GPU work\\
%* We've shown single node GPU performance at nearly 100\% efficieny (we'll need to be more accurate/precise than that in the actual submission). \\
%* In this work we demonstrate two molecular dynamics simuations running in a virtual infrastructure: LAMMPS and  HOOMD \\
%* We show that both perform near-native, and we show GPU Direct RDMA for the first time in the cloud \\

\section{Background and Related Work}\label{Background}

%NOTE: first introduce virtualization, and I/O featuresets. Then enable GPUs. Then bring in SR-IOv and infiniBand. Finally, discuss applications and GPUDirect.

Virtualization technologies and hypervisors have seen 
wide-spread deployment in support of a vast array of applications.  This ranges from public commercial cloud deployments such as Amazon EC2 \cite{amazon2010}, Microsoft Azure \cite{jennings2010cloud}, and Google's Cloud Platform \cite{www-google-platform} to private deployments within colocation facilities, corporate data centers, and even national scale cyberinfrastructure initiatives.  All these look to support various use cases and applications such as web servers, ACID and BASE databases, online object storage, and even distributed systems, to name a few.  

The use of virtualization and hypervisors to support various HPC solutions has been studied with mixed results.  In ~\cite{Younge2011cloud}, it is found that there is a great deal of variance between hypervisors when running various distributed memory and MPI applications, and that overall, KVM performed well across an array of HPC benchmarks.  Furthermore, some applications may not may fit well into default virtualized environments, such as High Performance Linpack \cite{Luszczek:2011:EHC}. Other studies have looked at interconnect performance in virtualization and found the performance of conventional techniques to be lacking even in the best-case scenario, with up to a 60\% performance penalty \cite{Ramakrishnan2012}.
 
Recently, various CPU architectures have added support for I/O virtualization mechanisms in the CPU ISA through the use of an I/O memory management unit (IOMMU). Often, this is referred to as PCI Passthrough, as it enabled devices on the PCI-Express bus to be passed directly to a specific virtual machine (VM).  Specific hardware implementations include Intel's VT-d \cite{intelvirtualization}, AMD's IOMMU \cite{amdiommu} for x86\_64 architectures, and recently ARM System MMU \cite{armmmu}.  All of these implementations effectively look to aid in the usage of DMA-capable hardware to be used within a specific virtual machine. By using these features, a wide array of hardware can be utilized directly within VMs and enable fast and efficient computation and I/O capabilities.


%%commented out of camera ready to safe space. Perhaps more detail than necessary
%With PCI Passthrough, a PCI device is handed directly to a running (or booting) VM, thereby relinquishing control of the device within the host entirely. This is different from typical VM usage where hardware is emulated in the host and used in a guest VM, such as with bridged Ethernet adapters or emulated VGA devices. Performing PCI Passthrough requires the host to seize the device upon boot using a specialized driver to effectively block normal driver initialization. In the instance of the KVM hypervisor, this is done using the \emph{vfio} and \emph{pci\_stub} drivers. Then, this driver relinquishes control to the VM, whereby normal device drivers initiate the hardware and enable the device for use by the guest OS.  

\subsection{GPU Passthrough}

Nvidia GPUs comprise the single most common accelerator in the Nov 2014 Top 500 List \cite{www-top500} and represent an increasing shift towards accelerators for HPC applications. Historically, GPU usage in a virtualized environment has been difficult, especially for scientific computation. Various front-end remote API implementations have been developed to provide CUDA and OpenCL libraries in VMs, which translate library calls to a back-end or remote GPU. One common implementation of this is rCUDA \cite{duato2011enabling}, which provides a front-end CUDA API within a VM or any compute node, and then sends the calls via Ethernet or InfiniBand to a separate node with 1 or more GPUs. While this method provides the desired functionality, it has the drawback of relying on the interconnect itself and the bandwidth available, which can be especially problematic on an Ethernet based network. Furthermore, as this method consumes bandwidth, it can leave little remaining for MPI or RDMA routines, thereby constructing a bottleneck for some MPI+CUDA applications that depend on inter-process communication.  Another mechanism of using GPUs in VMs is hypervisor based virtualization \cite{suzuki2014gpuvm}.

Recently efforts have been seen to support such GPU accelerators within VMs using IOMMU technologies, with implementations now available with KVM, Xen, and VMWare \cite{Walters2014cloud, Younge2014hpgc, tian2014full, Vu2014}.  These efforts have shown that GPUs can achieve up to 99\% of their bare metal performance when passed to a virtual machine using PCI Passthrough.  %VMWare specifically shows how such PCI Passthrough solutions perform well and are likely to outperform front-end Remote API solutions such as rCUDA within a VM\cite{Vu2014}.  
While it has been demonstrated that using PCI Passthrough results in high performance across a range of hypervisors and GPUs, the efforts have been limited to investigating single node performance until now. 

\subsection{SR-IOV and InfiniBand}

With almost all parallel HPC applications, the interconnect fabric which enables fast and efficient communication between processors becomes a central requirement to achieving good performance. Specifically, a high bandwidth link is needed for distributed processors to share large amounts of data across the system. Furthermore, low latency becomes equally important for ensuring quick delivery of small message communications and resolving large collective barriers within many parallelized codes. One such interconnect, InfiniBand, has become the most common implementation used within the Top500 list. %However previously InfiniBand was inaccessible to virtualized environments.  

Supporting I/O interconnects in VMs has been aided by Single Root I/O Virtualization (SR-IOV), whereby multiple virtual PCI functions are created in hardware to represent a single PCI device. These virtual functions (VFs) can then be passed to a VM and used by the guest as if it had direct access to that PCI device. SR-IOV allows for the virtualization and multiplexing to be done within the hardware, effectively providing higher performance and greater control than software solutions. 

SR-IOV has been used in conjunction with Ethernet devices to provide high performance 10Gb TCP/IP connectivity within VMs \cite{Liu2010}, offering near-native bandwidth and advanced QoS features not easily obtained through emulated Ethernet offerings. Currently Amazon EC2 offers a high performance VM solution utilizing SR-IOV enabled 10Gb Ethernet adapters. While SR-IOV enabled 10Gb Ethernet solutions offers a big forward in performance, Ethernet still does not offer the high bandwidth or low latency typically found with InfiniBand solutions. 

Recently SR-IOV support for InfiniBand has been added by Mellanox in the ConnectX series adapters. Initial evaluation of SR-IOV InfiniBand within KVM VMs has demonstration point-to-point bandwidth to be near-native, but with up to 30\% latency overhead for very small messages \cite{jose2013sr, RuivoAGTKNR14}. However, even with the noted overhead, this still signifies up to an order of magnitude difference in latency between InfiniBand and Ethernet with VMs. Furthermore, advanced configuration of SR-IOV enabled InfiniBand fabric is taking shape, with recent research showing up to a 30\% reduction in the latency overhead \cite{Musleh2014cloud}. However, real application performance has not yet been well understood until now. 

\subsection{GPUDirect}
NVIDIA's GPUDirect technology was introduced to reduce the overhead of data movement across GPUs~\cite{GPUDirect, shainer2011development}.  
%GPUDirect supports both networking as well as peer-to-peer interfaces for single node multi-GPU systems.  The most recent implementation of GPUDirect, version 3, adds support for RDMA over InfiniBand for Kepler-class GPUs.
Currently, there exists three distinct versions of GPUDirect.  GPUDirect v1 adds
accelerated communication with network and storage devices through the use of a
single CPU buffer, and GPUDirect v2 provides peer-to-peer communication between
discrete GPUs on a single node.  GPUDirect v3, the most recent version and what
is used in this manuscript, provides support for direct RDMA between GPUs across
an InfiniBand interconnect for Kepler-class GPUs. This alleviates the need for
staging data to/from host memory in order to transmit data via InfiniBand
between GPUs on separate nodes, although it does require application code to
target GPUDirect.
 
GPUDirect relies on three key technologies: CUDA 5
(and up), a CUDA-enabled MPI implementation, and a Kepler-class GPU (RDMA only).
Both MVAPICH and OpenMPI support GPUDirect.  Support for RDMA over GPUDirect is
enabled by the MPI library, given supported hardware, without requiring changes
to the MPI calls.
In this paper, we demonstrate scaling a MD simulation to 4 nodes connected via QDR InfiniBand
and show that GPUDirect RDMA improves both scalability and overall performance
by approximately 9\% at no cost to the end user.

\input{openstack}

\input{benchmarks}




%Similarly, Infiniband SR-IOV (Single Root I/O Virtualization) has been evaluated within the context of microbenchmarks~\cite{SRIOVInfiniband,Musleh2014cloud}, but performance for real applications is not yet well-understood.

%moved to related work. right placement?
%Recent work recent work has focused on single-node performance.  In \cite{walters2014}, we've shown how the latest Kepler GPUs from Nvidia  with Sandy-Bridge Intel Xeon CPUs can perform at near-native performance running various workloads across wide range of hypervisors. Furthermore, advanced configuration of SR-IOV enabled Infiniband fabric has taken shape, with recent research showing up to a 30\% reduction latency \cite{musleh2014}.  



\section{Experimental Setup}\label{setup}

Using two molecular dynamics tools, LAMMPS\cite{plimpton2007lammps} and HOOMD~\cite{anderson2010hoomd}, we demonstrate a high performance \textit{system}.  That is, we combine PCI passthrough for Nvidia Kepler-class GPUs with QDR Infiniband SR-IOV and show that high performance molecular dynamics simulations are achievable within a virtualized environment. 
For the first time, we also demonstrate Nvidia GPUDirect technology within such a virtual environment.  Thus, we look to not only illustrate that virtual machines provide a flexible high performance infrastructure for scaling scientific workloads, including MD simulations, but also that the latest HPC features and programming environments are available and efficient in this same model.   

\subsection{Node configuration}



To support the use of Nvidia GPUs and InfiniBand within a VM, specific host configuration is needed. This node configuration is illustrated in Figure \ref{F:passthrough}.  While our implementation is specific to the KVM hypervisor, this setup represents a design that can be hypervisor agnostic.


Each node in the testbed uses CentOS 6.4 with a 3.13 upstream Linux kernel for the host OS, along with the latest KVM hypervisor, QEMU 2.1, and the \emph{vfio} driver.  Each guest VM runs CentOS 6.4 with a stock 2.6.32-358.23.2 kernel. A Kepler GPU is passed through using PCI Passthrough and directly initiated within the VM via the Nvidia 331.20 driver and CUDA release 5.5. While this specific implementation used only a single GPU, it is also possible to include as many GPUs as one can fit within the PCI Express bus if desired. As the GPU is used by the VM, an on-board VGA device was used by the host and a standard Cirrus VGA was emulated in the guest OS. 
OFED version 2.1-1.0.0 drivers are used with Mellanox ConnectX-3 VPI adapter with firmware 2.31.5050.  The host driver initiates 4 VFs, one of which is passed through to the VM where the default OFED mlnx\_ib drivers are loaded.  

\FIGURE{!htb}
  {images/host-pci-passthrough.png}
  {1.0}
  {Node PCI Passthrough of GPUs and InfiniBand}
  {F:passthrough}

%The native bare-metal base system and all guest VMs are composed of a CentOS 6.4 installation with a 2.6.32-358.23.2 stock kernel, MVAPICH 2.0 GDR, and CUDA version 5.5. Each guest was allocated 20 GB of RAM and a full socket (8 cores) as well as a single InfiniBand virtual function  and 1 Kepler GPU per VM.  


%CentOS 6.4 with a 3.13 upstream Linux kernel was used as the host OS with the KVM hypervisor.  The native bare-metal base system and all guest VMs are composed of a CentOS 6.4 installation with a 2.6.32-358.23.2 stock kernel, MVAPICH 2.0 GDR, and CUDA version 5.5. Each guest was allocated 20 GB of RAM and a full socket (8 cores) as well as a single InfiniBand virtual function  and 1 Kepler GPU per VM.  

\subsection{Cluster Configuration}

Our test environment is composed of 4 servers each with a single Nvidia
Kepler-class GPU.  Two servers are equipped with K20 GPUs, while the other two
servers are equipped with K40 GPUs, demonstrating the potential for a more
heterogeneous deployment.  Each server is composed of 2 Intel Xeon E5-2670 CPUs,
48GB of DDR3 memory, and Mellanox ConnectX-3 QDR InfiniBand.  CPU sockets and
memory are split evenly between the two NUMA nodes on each system. All
InfiniBand adapters use a single Voltaire 4036 QDR switch with a software subnet
manager for IPoIB functionality.   


For these experiments, both the GPUs and InfiniBand adapters are attached to NUMA node 1 and both the guest VMs and the base system utilized identical software stacks.  Each guest was allocated 20 GB of RAM and a full socket of 8 cores, and pinned to NUMA node 1 to ensure optimal hardware usage. 
%While all VMs are capable of login via the InfiniBand IPoIB setup, a 1Gb Ethernet network was used for all management and login tasks.  
For a fair and effective comparison, we also use a native environment without any virtualization. This native environment employs the same hardware configuration, and like the Guest OS runs CentOS 6.4 with the stock 2.6.32-358.23.2 kernel. 
%Our test environment is composed of 4 servers each with a single Nvidia Kepler-class GPU.  Two servers are equipped with K20 GPUs, while the other two servers are equipped with K40 GPUs, demonstrating the potenti
%In order to effectively test MD simulations in LAMMPS and HOOMD beyond single-node tests, ronment.  Bespin includes 4 blades, each with 2 Intel Xeron E5-2670 CPUs, 48Gb DDR3 memory, Mellanox ConnectX3 QDR InfiniBand cards, and a mixture of Nvidia Kepler series K20 and K40 GPUs.  While the effective experimental hardware allocation remains relatively low compared to production runs of either application, it does allow for a useful evaluation at a larger scale than preivously evaluated as well as a valued extrapolation to larger resources.

 
\section{Results}\label{results}

In this section, we discuss the performance of both the LAMMPS and HOOMD molecular dynamics simulation tools when running within a virtualized environment. 
%Specifically, we scale each application to 4 nodes4 GPUs, both in a native bare-metal and virtualized environments.  
Each application set was run 10 times, with the results averaged accordingly. 

\subsection{LAMMPS}


\FIGURE{!htb}
  {images/lammps-lj-scale.png}
  {1.0}
  {LAMMPS LJ Performance}
  {F:lammps-lj}

\FIGURE{!htb}
  {images/lammps-rhodo-scale.png}
  {1.0}
  {LAMMPS RHODO Performance}
  {F:lammps-rhodo}


Figure~\ref{F:lammps-lj} shows one of the most common LAMMPS algorithms used - the Lennard-Jones potential (LJ).  This algorithm is deployed in two main configurations - a 1:1 core to GPU mapping and a 8:1 core to GPU mapping, labeled in Figures~\ref{F:lammps-lj} and~\ref{F:lammps-rhodo} as 4c/4g and 32c/4g, respectively.  With the LAMMPS GPU implementation, a delicate balance between GPUs and CPUs is required to find the optimal ratio for fastest computation, however here we just look at the two most obvious choices. With small problem sizes, the 1:1 mapping outperforms the more complex core deployment, as the problem does not require the additional complexity provided with a multi-core solution.  As expected the multi-core configuration quickly offers better performance for larger problem sizes, achieving roughly twice the performance with all 32 available cores. 
%This is largely due to the availability of all 8 cores to keep the GPU running 100\% with continual computation.
This is largely due to the availability of all 8 cores to keep the GPU fully utilized.

The important factor for this manuscript is the relative performance of the virtualized environment. From the results, it is clear the VM solution performs very well compared to the best-case native deployment. For the multi-core configuration across all problem sizes, the virtualized deployment averaged 98.5\% efficiency compared to native. The single core per GPU deployment reported better-than native performance at 100\% native.  This is likely due to caching effects, but further investigation is needed to fully identify this occurrence. 

 %and the Rhodopsin protein in solvated lipid bilayer benchmark (Rhodo), both running with the GPU package across 8 cores per GPU. Here we see that both benchmarks scale remarkably well in the virtualized KVM guest environment. 
%Compared to the base system performance, the VMs running LAMMPs acheive 96.7\% and 99.3\% efficiency for LJ and Rhodo, respectively when running across all nodes.  These low overheads illustrate the utility of running LAMPS on cloud infrastructure, and also hold promise for other hybrid MPI + CUDA applications to also scale well in a virtualized environment. 

Another common LAMMPS algorithm, the Rhodopsin protein in solvated lipid bilayer benchmark (Rhodo), was also run with results given in Figure \ref{F:lammps-rhodo}. As with the LJ runs, we see the multi-core to GPU configuration resulting in higher computational performance for the larger problem sizes compared to the single core per GPU configuration, as expected.  



Again, the overhead of the virtualized configuration remains low across all configurations and problem sizes, with an average 96.4\% efficiency compared to native. We also see the gap in performance decrease as the problem size increases, with the 512k problem size yielding 99.3\% of native performance.  This finding leads us to extrapolate that a virtualized MPI+CUDA implementation could scale to a larger computational resource with similar success. 


\subsection{HOOMD}




% Compared to the base system's performance, we see overheads of 3.2\% and 0.6\% for the LJ and Rhodo benchmarks, respectively, when running 8 cores per GPU at experimental scale. 


In Figure~\ref{F:HOOMD} we show the performance of a Lennard-Jones liquid
simulation with 256K particles running under HOOMD.  HOOMD includes support for
CUDA-aware MPI implementations via GPUDirect.  The MVAPICH 2.0 GDR
implementation enables a further optimization by supporting RDMA for GPUDirect.
From Figure~\ref{F:HOOMD} we can see that HOOMD simulations, both with and
without GPUDirect, perform very near-native.  The GPUDirect results at 4 nodes
achieve 98.5\% of the base system's performance.  The non-GPUDirect results
achieve 98.4\% efficiency at 4 nodes. These results indicate the virtualized HPC
environment is able to support such complex workloads. While the effective
testbed size is relatively small, it indicates that such workloads may scale
equally well to hundreds or thousands of nodes. The advantage of using GPUDirect
RDMA is also evident in Figure~\ref{F:HOOMD}, with a 9\% performance boost realized for both virtualized and non-virtualized experiments. 

\FIGURE{!htb}
  {images/hoomd.png}
  {1.0}
  {HOOMD LJ Performance with 256k Simulation}
  {F:HOOMD}


\section{Discussion}\label{discussion}

From the results, we see the potential for running HPC applications in a virtualized environment using GPUs and InfiniBand interconnect fabric. Across all LAMMPS runs, we found only a 1.9\% overhead between the KVM virtualized environment and native. For HOOMD, we found a similar 1.5\% overhead, both with and without GPU Direct. These results go against conventional wisdom that HPC workloads perform poorly in VMs. In fact, we show two N-Body type simulations programmed in an MPI+CUDA implementation perform at roughly near-native performance in tuned KVM virtual machines.  

With HOOMD, we see how GPUDirect RDMA shows a clear advantage over the
non-GPUDirect implementation, achieving a 9\% performance boost in both the
native and virtualized experiments.  While GPUDirect's performance impact has been well evaluated previously \cite{GPUDirect}, it is the authors' belief that this manuscript represents the first time GPUDirect has has been utilized in a virtualized environment.  

Another interesting finding of running LAMMPS in a virtualized environment is that as workload increases from a single node to 32 cores, the overhead does not increase. These results lend credence to the notion that this solution would also work for a much larger deployment, assuming system jitter can be minimized \cite{Seelam2010}. Specifically, it would be possible to expand such computational problems to a larger deployment in FutureGrid \cite{fox2013futuregrid}, Chameleon Cloud \cite{www-chameleon}, or even the planned NSF Comet machine at SDSC, scheduled to provide up to 2 Petaflops of computational power. Effectively, these results provide evidence that HPC computations can be supported in virtualized environment with minimal overhead. 


\section{Conclusion}\label{conclusion}
The ability to run large-scale parallel scientific applications in the cloud has become possible, but historically has been limited by both performance concerns and infrastructure availability.
In this work we show that advanced HPC-oriented hardware such as the latest Nvidia GPUs and InfiniBand fabric are now available within a virtualized infrastructure. Our results find MPI + CUDA applications, such as molecular dynamics simulations, run at near-native performance compared to traditional non-virtualized HPC infrastructure, with just an averaged 1.9\% and 1.5\% overhead for LAMMPS and HOOMD, respectively. We also show the utility of GPUDirect RDMA for the first time in a cloud environment with HOOMD.  
%Effectively, we look to pave the way for large-scale virtualized cloud infrastructure to support a wide array of advanced scientific computation commonly found running on many supercomputers today.  
The support of these workloads coalesce to provide an open source
Infrastructure-as-a-Service framework using OpenStack.  Our future efforts will be twofold. We first hope to increase the scale to sigificantly larger cloud infrastructure on a new experimental system and confirm near-native performance is acheivable in a virtualized environment. We also look to expand the workload diversity and support many other HPC and big data applications running efficiently in virtual clusters.
%The tight and exact integration into an open source cloud IaaS framework such as OpenStack \cite{www-openstack} becomes critical.

%* Multi-node virtualized MD at low overhead \\

%* First to show GDR in virtual machine \\


\acks

This work was developed with support from the National Science Foundation (NSF)
under grant \#0910812 to Indiana University and with support from the Office
of Naval Research under grant \#N00014-14-1-0035 to USC/ISI.  Andrew J. Younge also
acknowledges support from The Persistent Systems Fellowship of the School of
Informatics and Computing at Indiana University.

%This document was developed with support from the National Science Foundation
%(%NSF) under Grant No. 0910812 to Indiana University for ``FutureGrid: An
%Experimental, High Performance Grid Test-bed.''  %Any opinions, findings,  and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF. 
%Andrew J. Younge also acknowledges support from The Persistent Systems Fellowship of the School of Informatics and Computing at Indiana University.

%\bibliographystyle{tex/IEEEtran}
\bibliographystyle{abbrvnat}

\bibliography{biblio}

\end{document}