(For implementation details see the Report.pdf. For Compilation instructions see README.pdf.)
For this project, we implemented the Apriori algorithm for Frequent pattern mining on the DCS cluster of AiMOS using CUDA and MPI. The most computationally expensive part of the Apriori algorithm is scanning over the whole dataset at each level. This process can be massively parallelized since it requires performing independent subset operations on different datapoints for different candidate patterns. To implement this algorithm in CUDA used the bitset representations of the itemsets which allowed us to efficiently perform operations like subset and set intersection using simple bitwise operations which are very fast on both CPU and GPU. Each thread on the GPU is assigned to perform subset operation for each candidate pattern, allowing for simultaneous scanning of the datasets for all candidate patterns. On the advent of multiple MPI ranks and GPUs on AiMOS, we partitioned the dataset and allocated different parts to different MPI ranks. After computing support within each rank, the total support was calculated by adding up supports found by other ranks – which was performed by the MPI all-reduce operation. To facilitate the reading of large datasets, the data file is read in sequential blocks parallelly by each MPI rank – using MPI read. To make the output file write more efficient MPI write was used. We have verified the scalability of this implementation both for massively parallel systems and for large datasets by extensive experiments. We have evaluated the performance of different parts of the program and identified potential limiting factors. The current implementation which can be further improved, still achieves admirable results on benchmark datasets.