This is a lightweight heterogeneous system named Legend for efficient and cost-effective graph embedding learning, comprising a CPU, a GPU and an NVMe SSD. It adopts a novel workflow that systematically redefines data placement and meticulously assigns tasks to leverage the unique strengths of each hardware component. A prefetch-friendly order is proposed to support embedding prefetching from NVMe SSD to GPU and a high-throughput GPU-NVMe SSD direct access driver is designed to reduce I/O overhead. Furthermore, it also optimizes GPU computing by a customized parallel execution strategy to achieve better performance.
- Python 3.8
- CUDA 11.1
- torch 1.7.1
- Samsung 980 NVMe SSD
- Nvidia A100 GPU
The Nvidia driver kernel sources are typically installed in /usr/src/
. Following commands are used to get the kernel symbols.
$ cd /usr/src/nvidia-550.54.15
$ sudo make
The default NVMe driver should be unbind first before install the customized NVMe driver. The PCI ID of the NVMe SSD is required to do this, which can be find by using lspci
.
We assume the PCI ID is 86:00.0
. The NVMe driver can be unbinded using following commands.
$ echo -n "0000:86:00.0" > /sys/bus/pci/devices/0000\:86\:00.0/driver/unbind
From the project root directory, do the following:
$ mkdir build; cd build
$ cmake
$ make libnvm
$ make benchmarks
After this, the libnvm
kernel module have to be compiled. In the build
directory, do the following:
$ cd module
$ make
Subsequently, we need to load the custom libnvm
kernel module in the module
directory. It can be loaded and unloaded with the following:
$ sudo make load
$ sudo make unload
This should create a /dev/libnvm0
device file, representing the disk's BAR0.
Each dataset can be obtained from the following links.
Dataset | Nodes | Edges | Relations | Link |
---|---|---|---|---|
FB15k | 15k | 592k | 1345 | https://dl.fbaipublicfiles.com/starspace/fb15k.tgz |
LiveJournal | 4.8M | 68M | - | https://snap.stanford.edu/data/soc-LiveJournal1.txt.gz |
41.6M | 1.46B | - | https://snap.stanford.edu/data/twitter-2010.txt.gz | |
Freebase86m | 86.1M | 304.7M | 14824 | https://data.dgl.ai/dataset/Freebase.zip |
The executable files are located in build/bin
To train without NVMe SSD, we can run the code with the followings commands.
$ ./nvm-train-nonvme
The dataset used to train should be modified in the code ./benchmarks/train_nonvme/main.cu
.
To train with NVMe SSD, we can run the code with the followings commands.
$ ./nvm-train-nvme --ctrl=/dev/libnvm0 --threads=4096 --page_size=32768 --queue_pairs=8 --queue_depth=1024
ctrl
is the path of the custom NVMe controller file.threads
specifies the threads used to load data from NVMe SSD.page_size
denotes the size of each IO request.queue_pairs
indicates the number of queues in the NVMe controller.queue_depth
specifies the queue depth per queue.
The dataset used to train should be modified in the code ./benchmarks/train_nvme/main.cu
.
The GPU-SSD direct access module of this project is built on top of an open-source codebase available here. We employ the framework of this codebase and develop a customized queue management mechanism (submission queue inserting, doorbell ringing, and completion queue polling) to improve the throughput between GPU and SSD.