This is the para-virtualized front driver of cuda-supported qemu and test case.
The user runtime wrappered library in VM (guest OS) provides CUDA runtime access, interfaces of memory allocation, CUDA commands, and passes cmds to the driver.
The front-end driver is responsible for the memory management, transferring data, analyzing the ioctl cmds from the customed library, and passing the cmds by the control channel.
The our experiment environment is as follows:
- Ubuntu 16.04.5 LTS (kernel v4.15.0-29-generic x86_64)
- cuda-9.1
- PATH
echo 'export PATH=$PATH:/usr/local/cuda/bin' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib:/usr/local/cuda/lib64' >> ~/.bashrc
source ~/.bashrc
sudo bash -c "echo /usr/local/cuda/lib64/ > /etc/ld.so.conf.d/cuda.conf"
sudo ldconfig
- Install required packages
sudo apt-get install -y pkg-config bridge-utils uml-utilities zlib1g-dev libglib2.0-dev autoconf
automake libtool libsdl1.2-dev libsasl2-dev libcurl4-openssl-dev libsasl2-dev libaio-dev libvde-dev libspice-server-dev
- Ubuntu 16.04 x86_64 image (guest OS)
- cuda-9.1 toolkit
- Our QEMU was modified from QEMU 2.12.0, for further information please refer to QEMU installation steps
- clone this repo.
[to do]
In the guest OS, nvcc compiles the source with host/device code and
standard CUDA runtime APIs. To compare with a native OS, in the
guest VM, compiling the CUDA program must add the nvcc flag
"-cudart=shared", which can be dynamically linked to the userspace
library as a shared library.
Therefore, the wrappered library provided functions that intercepted
dynamic memory allocation of CPU code and CUDA runtime APIs.
After installing qCUdriver and qCUlibrary in the guest OS, modify the
internal flags in the Makefile as below:
# internal flags
NVCCFLAGS := -m${TARGET_SIZE} --cudart=shared
Finally, run make and perform the executable file without change any
source code by LD_PRELOAD
or change the LD_LIBRARY_PATH
.
LD_PRELOAD=\path\to\libvcuda.so ./vectorAdd
- benchmarking vectorAdd
A command-line benchmarking tool hyperfine is recommended.
To run a benchmark, you can simply call hyperfine <command>....
, for example.
hyperfine 'LD_PRELOAD=\path\to\libvcuda.so ./vectorAdd'
By default, Hyperfine will perform at least 10 benchmarking runs. To change this, you can use the -m/--min-runs option or -M/--max-runs.
In our current version, we implement necessary CUDA runtime APIs. These CUDA runtime API are shown as below:
Classification | supported CUDA runtime API |
---|---|
Memory Management | cudaMalloc |
cudaMemset | |
cudaMemcpy | |
cudaMemcpyAsync | |
cudaFree | |
cudaMemGetInfo | |
cudaMemcpyToSymbol | |
cudaMemcpyFromSymbol | |
Device Management | cudaGetDevice |
cudaGetDeviceCount | |
cudaSetDevice | |
cudaSetDeviceFlags | |
cudaGetDeviceProperties | |
cudaDeviceSynchronize | |
cudaDeviceReset | |
Stream Management | cudaStreamCreate |
cudaStreamCreateWithFlags | |
cudaStreamDestroy | |
cudaStreamSynchronize | |
cudaStreamWaitEvent | |
Event Management | cudaEventCreate |
cudaEventCreateWithFlags | |
cudaEventRecord | |
cudaEventSynchronize | |
cudaEventElapsedTime | |
cudaEventDestroy | |
cudaEventQuery | |
Error Handling | cudaGetLastError |
cudaGetErrorString | |
Zero-copy | cudaHostRegister |
~~cudaHostGetDevicePointer~~ | |
cudaHostUnregister | |
cudaHostAlloc | |
cudaMallocHost | |
cudaFreeHost | |
cudaSetDeviceFlags | |
Thread Management | cudaThreadSynchronize |
Module & Execution Control | __cudaRegisterFatBinary |
__cudaUnregisterFatBinary | |
__cudaRegisterFunction | |
__cudaRegisterVar | |
cudaConfigureCall | |
cudaSetupArgument | |
cudaLaunch |
To support Caffe, we implement CUBLAS & CURAND API in libcudart.so.
Classification | supported API |
---|---|
CUBLAS API | cublasCreate |
cublasDestroy | |
cublasSetVector | |
cublasGetVector | |
cublasSetStream | |
cublasGetStream | |
cublasSasum | |
cublasDasum | |
cublasScopy | |
cublasDcopy | |
cublasSdot | |
cublasDdot | |
cublasSaxpy | |
cublasDaxpy | |
cublasSscal | |
cublasDscal | |
cublasSgemv | |
cublasDgemv | |
cublasSgemm | |
cublasDgemm | |
cublasSetMatrix | |
cublasGetMatrix | |
CURAND API | curandCreateGenerator |
curandCreateGeneratorHost | |
curandGenerate | |
curandGenerateUniform | |
curandGenerateUniformDouble | |
curandGenerateNormal | |
curandGenerateNormalDouble | |
curandDestroyGenerator | |
curandSetGeneratorOffset | |
curandSetPseudoRandomGeneratorSeed |
- part of NVIDIA_CUDA-9.1_Samples
- Rodinia benchmark
- Caffe: a fast open framework for deep learning.
Last but not least, thanks qcuda for inspiring.
Also, what we use for message channels is [chan: Pure C implementation of Go channels. ](https://github.com/tylertreat/chan.gitPure C implementation of Go channels. )