diff --git a/README.md b/README.md index b71c458..1c2546f 100644 --- a/README.md +++ b/README.md @@ -3,11 +3,291 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Mauricio Mutai +* Tested on: Windows 10, i7-7700HQ @ 2.2280GHz 16GB, GTX 1050Ti 4GB (Personal Computer) -### (TODO: Your README) +### Overview -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +#### Introduction + +The main aim of this project was to implement a few simple, but crucial GPU algorithms, understand their performance and how to improve it. The algorithms implemented were: exclusive scan, stream compaction, and radix sort. + +For the exclusive scan (hereafter referred to as simply "scan"), multiple different implementations were made. A serial CPU version was created to check for correctness and as a comparison point. Then, a "naive" GPU version and a more "work-efficient" GPU version were implemented. Finally, a version of the work-efficient implementation was modified to use shared memory. + +Below is a full list of the features implemented in this project: + +* Serial CPU scan +* Naive GPU scan +* Work-efficient GPU scan with global memory (handles arbitrary arrays large enough to fit in the GPU's global memory) +* Work-efficient GPU scan with shared memory (only handles array as large as those that can fit in a single block's worth of threads) +* Thrust GPU scan +* GPU stream compaction +* GPU radix sort + +### Performance Analysis + +#### Rough optimization of block size + +Below are two graphs showing some measurements of the runtime of the naive and work-efficient GPU scans, taken at different block sizes (that is, how many threads at most would be allocated to each block). The measurements were made with an array of size 1024. + +![](img/block-naive.png) + +![](img/block-workeff.png) + +For the naive scan, the optimal block size appears to be 32, while for the work-efficient version, it is 16. + +#### Comparison of GPU scans vs. CPU scans + +Below is a graph showing the runtime of the CPU scan along with the naive, work-efficient with global memory, and Thrust GPU scans. The logarithmic x-axis shows the size of the array used for that measurement. + +![](img/array-all-full.png) + +Looking at just this view, it is hard to analyze the graph. Below is a view of one half of the graph, where the x-axis only goes up to 10000: + +![](img/array-all-lower.png) + +Here, we can see that for relatively small arrays (size < 10000), the serial CPU algorithm is clearly superior. Even Thrust's optimized GPU implementation cannot outperform the CPU. Interestingly, the work-efficient GPU scan also performs worse than the naive version. + +Regarding the CPU's superiority, we have to consider that the main benefit of the GPU implementations is that their parallel nature allow us to reduce a serial O(n) computation into one that is performed in something closer to O(log n) steps, since at each step, we process approximately half of the input we processed in the previous iteration. However, this incurs a significant overhead in terms of access to global memory and thread scheduling. It is likely that, for smaller arrays, this overhead is much too large and outweighs the benefit of the parallel algorithms. + +Similarly, the work-efficient GPU scan has increased overhead compared to the naive version, since it performs two passes over the array ("sweeps") instead of just one. This probably causes it to be slower than the naive implementation. + +![](img/array-all-higher.png) + +On the other hand, for larger arrays (size >= 2^17), the Thrust and work-efficient GPU implementations start outperforming the CPU scan. The naive GPU scan, unfortunately, is never faster than the CPU scan. It is also noteworthy that the work-efficient GPU scan has comparable performance with the Thrust scan from sizes between 2^17 and 2^18. + +For larger sizes, the parallel nature of the GPU algorithms allows them to scale better than the CPU scan, making them outperform it in spite of the overhead mentioned above. The lack of work-efficiency in the naive scan exaggerates the effect of this overhead, such that it still cannot truly benefit from the parallel algorithm. + +#### Thrust implementation analysis + +Below is an image taken from the Nsight timeline after running only the Thrust scan, and below it is a zoomed-in view: + +![](img/thrust-full.png) + +![](img/thrust-zoom.png) + +This was run with an array size of 2^20, which explains the 4 MB copy from the host and, later, back to the host. + +It's not very clear how memory is being accessed. From looking at the GPU vs. CPU graphs above, we can see that the Thrust has a noticeable drop in performance in the are around the array sizes of 2^16 and 2^18. Perhaps this is due to an algorithm involving shared memory that can no longer function optimally for such large arrays, requiring, for example, more blocks to be run and their results to be combined later. + +#### Possible bottlenecks + +There is good evidence here that the main bottleneck for the GPU scans is memory access. We know global memory access is very slow, but this cost can be amortized by having additional warps in flight. As the array size grows, we can hide more and more of the memory access latency, until the GPU implementations (except for Naive) begin outperforming the CPU. + +#### Shared memory + +Below is a graph comparing the work-efficient GPU scan with shared memory versus the version with global memory and Thrust's scan. + +![](img/shared.png) + +We can clearly see that using the much faster on-chip shared memory greatly improves performance. In fact, it is likely that due to device-specific optimizations, such as using the number of banks to avoid bank conflicts, our implementation of scan is even faster than Thrust's. + +Unfortunately, this specific implementation is limited in that it only correctly runs if executed in a single block. It is definitely possible to expand this to use multiple blocks ("scan of scans"), though. + +### Optimization to improve work-efficient GPU approach ("Part 5") + +I noticed I could invoke my up-sweep and down-sweep kernels with a different number of blocks and threads per block at each iteration, since each iteration operates on a different number of elements in the array. This, together with a calculation of a "nodeIdx" in those kernels, allow me to invoke fewer kernels when the algorithm allows for it (i.e. when the sweeps are closer to the root). + +This adds value by improving the performance of the GPU approach enough that, even with global memory accesses, it can outperform the CPU scan for large enough arrays. + +### Radix Sort ("Part 6") + +I implemented a GPU radix sort using my work-efficient GPU scan with global memory. It may be invoked with: + +` +StreamCompaction::RadixSort::radixSort(size, out, in, true); +` + +The radix sort is tested in the main test program by comparing its output with std::sort. Below is a graph comparing the runtimes of the GPU radix sort and the CPU std::sort: + +![](img/radix.png) + +We can see the CPU sort starts out faster, but the parallel nature of the GPU sort allows it to scale better and peform better for larger arrays. Again, this is likely due to the overhead of launching multiple threads and accessing global memory slowing down the GPU approach initially, but as the arrays grow, more warps can be launched in order to amortize those costs. + +It should be noted the elements in the array were all <= 256. This is perhaps an advantage for the radix sort, as it thrives when it has to compare keys that that require fewer bits to be represented in memory. + +This radix sort adds value by providing a relatively fast way of sorting small keys, while also showcasing an additional application of the GPU scan algorithm. + +### GPU scan with shared memory ("Part 7") + +The performance of the work-efficient GPU scan with shared memory has been analyzed above. The value added by this can be seen exactly in that performance improvement. It may be invoked as follows: + +` +StreamCompaction::EfficientShared::scan(size, out, in); +` + +In my GPU, it can only handle arrays of size up to 2^11. + +### Output of test program + +The output below was generated with an array size of 2^20. It includes tests for the **GPU radix sort**, but not the work-efficient with shared memory GPU scan, due to limited memory. + +``` +**************** +** SCAN TESTS ** +**************** + [ 32 17 17 36 3 8 3 30 39 24 10 45 22 ... 0 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 2.52317ms (std::chrono Measured) + [ 0 32 49 66 102 105 113 116 146 185 209 219 264 ... 25664601 25664601 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 2.00643ms (std::chrono Measured) + [ 0 32 49 66 102 105 113 116 146 185 209 219 264 ... 25664563 25664573 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 3.7417ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 3.7159ms (CUDA Measured) + passed +==== (Skipping efficient shared tests due to large array size... ==== +==== work-efficient scan, power-of-two ==== + elapsed time: 1.09261ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 1.08237ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 0.388096ms (CUDA Measured) + [ 0 32 49 66 102 105 113 116 146 185 209 219 264 ... 25664601 25664601 ] + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.384ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 0 1 3 2 1 2 3 0 1 2 2 1 2 ... 2 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 3.63104ms (std::chrono Measured) + [ 1 3 2 1 2 3 1 2 2 1 2 2 3 ... 2 2 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 3.01912ms (std::chrono Measured) + [ 1 3 2 1 2 3 1 2 2 1 2 2 3 ... 3 2 ] + passed +==== cpu compact with scan ==== + elapsed time: 12.9889ms (std::chrono Measured) + [ 1 3 2 1 2 3 1 2 2 1 2 2 3 ... 2 2 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 1.89542ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 1.64707ms (CUDA Measured) + passed + +**************** +** SORT TESTS ** +**************** + [ 236 229 255 78 121 226 91 132 61 166 26 101 126 ... 18 0 ] +==== cpu std::sort, power-of-two ==== + elapsed time: 12.9889ms (std::chrono Measured) + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 255 255 ] +==== gpu radix sort, power-of-two ==== + elapsed time: 14.8291ms (CUDA Measured) + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 255 255 ] + passed +==== cpu std::sort, non-power-of-two ==== + elapsed time: 12.9889ms (std::chrono Measured) + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 255 255 ] +==== gpu radix sort, non-power-of-two ==== + elapsed time: 14.8316ms (CUDA Measured) + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 255 255 ] + passed +Press any key to continue . . . + +``` + +The output below was generated with a size of 2^10, and does include the **shared memory scan**. + +``` +**************** +** SCAN TESTS ** +**************** + [ 28 25 45 12 4 47 26 36 15 40 46 40 46 ... 22 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 0.001824ms (std::chrono Measured) + [ 0 28 53 98 110 114 161 187 223 238 278 324 364 ... 25226 25248 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 0.001824ms (std::chrono Measured) + [ 0 28 53 98 110 114 161 187 223 238 278 324 364 ... 25178 25181 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 0.026624ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 0.03584ms (CUDA Measured) + passed +==== EFFICIENT SHARED scan, power-of-two ==== + elapsed time: 0.009216ms (CUDA Measured) + passed +==== EFFICIENT SHARED scan, NON-power-of-two ==== + elapsed time: 0.009728ms (CUDA Measured) + [ 0 28 53 98 110 114 161 187 223 238 278 324 364 ... 25178 25181 ] + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 0.047104ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 0.053248ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 0.012288ms (CUDA Measured) + [ 0 28 53 98 110 114 161 187 223 238 278 324 364 ... 25226 25248 ] + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.012288ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 0 3 1 2 0 1 0 2 3 0 0 0 0 ... 0 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 0.004376ms (std::chrono Measured) + [ 3 1 2 1 2 3 3 3 1 3 2 3 3 ... 2 1 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 0.010941ms (std::chrono Measured) + [ 3 1 2 1 2 3 3 3 1 3 2 3 3 ... 3 2 ] + passed +==== cpu compact with scan ==== + elapsed time: 0.02735ms (std::chrono Measured) + [ 3 1 2 1 2 3 3 3 1 3 2 3 3 ... 2 1 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 0.153984ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 0.365568ms (CUDA Measured) + passed + +**************** +** SORT TESTS ** +**************** + [ 220 195 217 230 192 97 96 114 255 60 72 140 116 ... 216 0 ] +==== cpu std::sort, power-of-two ==== + elapsed time: 0.02735ms (std::chrono Measured) + [ 0 0 0 0 0 0 0 0 1 1 2 2 3 ... 255 255 ] +==== gpu radix sort, power-of-two ==== + elapsed time: 1.62816ms (CUDA Measured) + [ 0 0 0 0 0 0 0 0 1 1 2 2 3 ... 255 255 ] + passed +==== cpu std::sort, non-power-of-two ==== + elapsed time: 0.02735ms (std::chrono Measured) + [ 0 0 0 0 0 0 0 1 2 2 3 3 3 ... 255 255 ] +==== gpu radix sort, non-power-of-two ==== + elapsed time: 2.15584ms (CUDA Measured) + [ 0 0 0 0 0 0 0 1 2 2 3 3 3 ... 255 255 ] + passed +Press any key to continue . . . +``` + +### Miscellaneous details + +I added a "dummy" run of Thrust's scan at the beginning of the test. Nothing is printed or checked from this test, but it is used to avoid a problem where the runtime of the first run of Thrust's scan is much higher than it should be (on the order of tens of milliseconds). + +The naive and work-efficient, global memory GPU scans take an optional parameter `internalUse` that specifies whether the scan will be used on its own, or as part of another GPU algorithm, such as compact or radix sort. diff --git a/img/array-all-full.png b/img/array-all-full.png new file mode 100644 index 0000000..7b497d1 Binary files /dev/null and b/img/array-all-full.png differ diff --git a/img/array-all-higher.png b/img/array-all-higher.png new file mode 100644 index 0000000..0b3c214 Binary files /dev/null and b/img/array-all-higher.png differ diff --git a/img/array-all-lower.png b/img/array-all-lower.png new file mode 100644 index 0000000..afa7172 Binary files /dev/null and b/img/array-all-lower.png differ diff --git a/img/block-naive.png b/img/block-naive.png new file mode 100644 index 0000000..51ce3b7 Binary files /dev/null and b/img/block-naive.png differ diff --git a/img/block-workeff.png b/img/block-workeff.png new file mode 100644 index 0000000..55c4c00 Binary files /dev/null and b/img/block-workeff.png differ diff --git a/img/radix.png b/img/radix.png new file mode 100644 index 0000000..fcdcaa8 Binary files /dev/null and b/img/radix.png differ diff --git a/img/shared.png b/img/shared.png new file mode 100644 index 0000000..a99a436 Binary files /dev/null and b/img/shared.png differ diff --git a/img/thrust-full.png b/img/thrust-full.png new file mode 100644 index 0000000..a514fa2 Binary files /dev/null and b/img/thrust-full.png differ diff --git a/img/thrust-zoom.png b/img/thrust-zoom.png new file mode 100644 index 0000000..259093d Binary files /dev/null and b/img/thrust-zoom.png differ diff --git a/src/main.cpp b/src/main.cpp index 7305641..20ab6c4 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -10,16 +10,19 @@ #include #include #include +#include #include +#include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 20; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two -int a[SIZE], b[SIZE], c[SIZE]; +int a[SIZE], b[SIZE], c[SIZE], sorted[SIZE]; int main(int argc, char* argv[]) { + // dummy thrust run to get rid of slow first run + StreamCompaction::Thrust::scan(2, c, a); // Scan tests - printf("\n"); printf("****************\n"); printf("** SCAN TESTS **\n"); @@ -59,6 +62,25 @@ int main(int argc, char* argv[]) { //printArray(SIZE, c, true); printCmpResult(NPOT, b, c); + if (SIZE <= (1 << 11)) { + zeroArray(SIZE, c); + printDesc("EFFICIENT SHARED scan, power-of-two"); + StreamCompaction::EfficientShared::scan(SIZE, c, a); + printElapsedTime(StreamCompaction::EfficientShared::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(SIZE, c, true); + printCmpResult(SIZE, b, c); + + zeroArray(SIZE, c); + printDesc("EFFICIENT SHARED scan, NON-power-of-two"); + StreamCompaction::EfficientShared::scan(NPOT, c, a); + printElapsedTime(StreamCompaction::EfficientShared::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + printArray(NPOT, c, true); + printCmpResult(NPOT, b, c); + } + else { + printDesc("(Skipping efficient shared tests due to large array size..."); + } + zeroArray(SIZE, c); printDesc("work-efficient scan, power-of-two"); StreamCompaction::Efficient::scan(SIZE, c, a); @@ -77,7 +99,7 @@ int main(int argc, char* argv[]) { printDesc("thrust scan, power-of-two"); StreamCompaction::Thrust::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); @@ -139,5 +161,40 @@ int main(int argc, char* argv[]) { //printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); - system("pause"); // stop Win32 console from closing on exit + printf("\n"); + printf("****************\n"); + printf("** SORT TESTS **\n"); + printf("****************\n"); + + genArray(SIZE - 1, a, 257); // Leave a 0 at the end to test that edge case + a[SIZE - 1] = 0; + printArray(SIZE, a, true); + + zeroArray(SIZE, sorted); + printDesc("cpu std::sort, power-of-two"); + StreamCompaction::CPU::sort(SIZE, sorted, a); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); + printArray(SIZE, sorted, true); + + zeroArray(SIZE, b); + printDesc("gpu radix sort, power-of-two"); + StreamCompaction::RadixSort::radixSort(SIZE, b, a, true); + printElapsedTime(StreamCompaction::RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + printArray(SIZE, b, true); + printCmpResult(SIZE, b, sorted); + + zeroArray(SIZE, sorted); + printDesc("cpu std::sort, non-power-of-two"); + StreamCompaction::CPU::sort(NPOT, sorted, a); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); + printArray(NPOT, sorted, true); + + zeroArray(SIZE, b); + printDesc("gpu radix sort, non-power-of-two"); + StreamCompaction::RadixSort::radixSort(NPOT, b, a, true); + printElapsedTime(StreamCompaction::RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + printArray(NPOT, b, true); + printCmpResult(NPOT, b, sorted); + + //system("pause"); // stop Win32 console from closing on exit } diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt index cdbef77..7fea251 100644 --- a/stream_compaction/CMakeLists.txt +++ b/stream_compaction/CMakeLists.txt @@ -7,6 +7,10 @@ set(SOURCE_FILES "naive.cu" "efficient.h" "efficient.cu" + "efficientShared.h" + "efficientShared.cu" + "radixSort.h" + "radixSort.cu" "thrust.h" "thrust.cu" ) diff --git a/stream_compaction/common.h b/stream_compaction/common.h index 55f1b38..1b33d57 100644 --- a/stream_compaction/common.h +++ b/stream_compaction/common.h @@ -27,7 +27,7 @@ inline int ilog2(int x) { } inline int ilog2ceil(int x) { - return ilog2(x - 1) + 1; + return (x == 1) ? 0 : ilog2(x - 1) + 1; } namespace StreamCompaction { diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 05ce667..93c22fc 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -2,6 +2,7 @@ #include "cpu.h" #include "common.h" +#define TEST_TIME_IN_SCAN 0 namespace StreamCompaction { namespace CPU { @@ -17,10 +18,18 @@ namespace StreamCompaction { * For performance analysis, this is supposed to be a simple for loop. * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first. */ - void scan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); + void scan(int n, int *odata, const int *idata, bool internalUse) { + if (!internalUse) { + timer().startCpuTimer(); + } + + odata[0] = 0; + for (int i = 1; i < n; ++i) { + odata[i] = idata[i - 1] + odata[i - 1]; + } + if (!internalUse) { + timer().endCpuTimer(); + } } /** @@ -29,10 +38,15 @@ namespace StreamCompaction { * @returns the number of elements remaining after compaction. */ int compactWithoutScan(int n, int *odata, const int *idata) { + int nextEmptyIdx = 0; timer().startCpuTimer(); - // TODO + for (int i = 0; i < n; ++i) { + if (idata[i]) { + odata[nextEmptyIdx++] = idata[i]; + } + } timer().endCpuTimer(); - return -1; + return nextEmptyIdx; } /** @@ -41,10 +55,35 @@ namespace StreamCompaction { * @returns the number of elements remaining after compaction. */ int compactWithScan(int n, int *odata, const int *idata) { + int *temp = (int*) malloc(n * sizeof(int)); + int *scannedTemp = (int*)malloc(n * sizeof(int)); timer().startCpuTimer(); - // TODO + // calculate temp array + for (int i = 0; i < n; ++i) { + temp[i] = (idata[i]) ? 1 : 0; + } + // scan + scan(n, scannedTemp, temp, true); + // scatter + for (int i = 0; i < n; ++i) { + if (temp[i]) { + odata[scannedTemp[i]] = idata[i]; + } + } timer().endCpuTimer(); - return -1; + return scannedTemp[n - 1] + (temp[n - 1] ? 1 : 0); + } + + /** + * CPU sort function. + * + * @returns the number of elements remaining after compaction. + */ + void sort(int n, int *odata, const int *idata) { + memcpy(odata, idata, n * sizeof(int)); + timer().startCpuTimer(); + std::sort(odata, odata + n); + timer().endCpuTimer(); } } } diff --git a/stream_compaction/cpu.h b/stream_compaction/cpu.h index 236ce11..1248ccd 100644 --- a/stream_compaction/cpu.h +++ b/stream_compaction/cpu.h @@ -6,10 +6,12 @@ namespace StreamCompaction { namespace CPU { StreamCompaction::Common::PerformanceTimer& timer(); - void scan(int n, int *odata, const int *idata); + void scan(int n, int *odata, const int *idata, bool internalUse = false); int compactWithoutScan(int n, int *odata, const int *idata); int compactWithScan(int n, int *odata, const int *idata); + + void sort(int n, int *odata, const int *idata); } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 36c5ef2..347155a 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -12,13 +12,144 @@ namespace StreamCompaction { return timer; } +#define blockSize 32 +#define MAX_BLOCK_SIZE 16 +#define checkCUDAErrorWithLine(msg) ((void)0) + //checkCUDAError(msg, __LINE__) +#define USE_CUDA_DEV_SYNC 0 + + __global__ void upSweepIteration(int n, int *odata, const int offset, const int halfOffset) { + int idx = (blockIdx.x * blockDim.x) + threadIdx.x; + int nodeIdx = (idx + 1) * offset - 1; + if (nodeIdx < n) { + odata[nodeIdx] = odata[nodeIdx] + odata[nodeIdx - halfOffset]; + } + } + + __global__ void setRootToZero(int n, int *odata) { + odata[n - 1] = 0; + } + + __global__ void downSweepIteration(int n, int *odata, const int offset, const int halfOffset) { + int idx = (blockIdx.x * blockDim.x) + threadIdx.x; + int nodeIdx = (idx + 1) * offset - 1; + if (nodeIdx < n) { + int originalNodeValue = odata[nodeIdx]; + odata[nodeIdx] = odata[nodeIdx] + odata[nodeIdx - halfOffset]; + odata[nodeIdx - halfOffset] = originalNodeValue; + } + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. + * internalUse specifies whether this is used as a helper function, + * for example, in compact. If so, it assumes idata and odata are in + * device memory and does not use gpuTimer. */ - void scan(int n, int *odata, const int *idata) { + void scan(int n, int *odata, const int *idata, bool internalUse) { + if (n == 1) { + odata[0] = 0; + return; + } + // TODO: handle n <= 2 ??? + // nearest power of two + const int bufSize = 1 << ilog2ceil(n); + + int *dev_buf; + if (internalUse) { + dev_buf = odata; + } + else { + cudaMalloc((void**)&dev_buf, bufSize * sizeof(int)); + checkCUDAErrorWithLine("malloc dev_buf error!!!"); + + if (n != bufSize) { + cudaMemset(dev_buf + n, 0, (bufSize - n) * sizeof(int)); + checkCUDAErrorWithLine("memset dev_buf to 0 error!!!"); + } + + cudaMemcpy(dev_buf, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAErrorWithLine("memcpy dev_buf error!!!"); + } + + + + int halfOffset = 1; + int numThreads = bufSize / 2; + dim3 numBlocks(1); + int threadsPerBlock; + + cudaDeviceSynchronize(); + checkCUDAErrorWithLine("cuda sync error!!!"); + + if (!internalUse) { timer().startGpuTimer(); - // TODO + } + // skip offset = n because we overwrite root's value anyway + for (int offset = 2; offset < bufSize; offset *= 2) { + if (numThreads > MAX_BLOCK_SIZE) { + numBlocks.x = numThreads / MAX_BLOCK_SIZE; + //numBlocks = dim3(numThreads / MAX_BLOCK_SIZE); + threadsPerBlock = MAX_BLOCK_SIZE; + } + else { + numBlocks.x = 1; + //numBlocks = dim3(1); + threadsPerBlock = numThreads; + } + upSweepIteration<<>>(bufSize, dev_buf, offset, halfOffset); + checkCUDAErrorWithLine("upSweep error!!!"); + halfOffset = offset; + numThreads /= 2; + } + + setRootToZero << > > (bufSize, dev_buf); + + int offset = bufSize; + numThreads = 1; + for (int halfOffset = bufSize / 2; halfOffset >= 1; halfOffset /= 2) { + if (numThreads > MAX_BLOCK_SIZE) { + numBlocks.x = numThreads / MAX_BLOCK_SIZE; + //numBlocks = dim3(numThreads / MAX_BLOCK_SIZE); + threadsPerBlock = MAX_BLOCK_SIZE; + } + else { + numBlocks.x = 1; + //numBlocks = dim3(1); + threadsPerBlock = numThreads; + } + downSweepIteration << > >(bufSize, dev_buf, offset, halfOffset); + checkCUDAErrorWithLine("downSweep error!!!"); + offset = halfOffset; + numThreads *= 2; + } + + if (!internalUse) { timer().endGpuTimer(); + cudaMemcpy(odata, dev_buf, n * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAErrorWithLine("memcpy dev_buf to host error!!!"); + cudaFree(dev_buf); + checkCUDAErrorWithLine("free dev_buf error!!!"); + } + + } + + __global__ void map(int n, int *odata, const int *idata) { + int idx = (blockIdx.x * blockDim.x) + threadIdx.x; // TODO? + if (idx < n) { + odata[idx] = (idata[idx] != 0) ? 1 : 0; + } + } + + __global__ void scatter(int n, int *odata, const int *postMapData, const int *postScanData, const int *originalData) { + int idx = (blockIdx.x * blockDim.x) + threadIdx.x; // TODO? + if (idx < n && postMapData[idx]) { + odata[postScanData[idx]] = originalData[idx]; + } + } + + __global__ void getCompactedSize(int n, int *odata, const int *postMapData, const int *postScanData) { + *odata = postScanData[n - 1] + (postMapData[n - 1] ? 1 : 0); } /** @@ -31,10 +162,81 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - return -1; + int *dev_originalData; + int *dev_postMapBuf; + int *dev_postScanBuf; + int *dev_postScatterBuf; + int *dev_scatteredSize; + + cudaMalloc((void**)&dev_originalData, n * sizeof(int)); + checkCUDAErrorWithLine("malloc dev_originalData error!!!"); + + cudaMalloc((void**)&dev_postMapBuf, n * sizeof(int)); + checkCUDAErrorWithLine("malloc dev_postMapBuf error!!!"); + + // needs to be power of 2 for scan to work + const int postScanBufSize = 1 << ilog2ceil(n); + + cudaMalloc((void**)&dev_postScanBuf, postScanBufSize * sizeof(int)); + checkCUDAErrorWithLine("malloc dev_postScanBuf error!!!"); + + if (postScanBufSize != n) { + cudaMemset(dev_postScanBuf, 0, postScanBufSize * sizeof(int)); + checkCUDAErrorWithLine("memset dev_postScanBuf to 0 error!!!"); + } + + cudaMalloc((void**)&dev_postScatterBuf, n * sizeof(int)); + checkCUDAErrorWithLine("malloc dev_postScatterBuf error!!!"); + + cudaMalloc((void**)&dev_scatteredSize, sizeof(int)); + checkCUDAErrorWithLine("malloc dev_scatteredSize error!!!"); + + cudaMemcpy(dev_originalData, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAErrorWithLine("memcpy dev_originalData from host error!!!"); + + dim3 numBlocks((n + blockSize - 1) / blockSize); + + timer().startGpuTimer(); + + map<<>>(n, dev_postMapBuf, dev_originalData); + checkCUDAErrorWithLine("map error!!!"); + + cudaMemcpy(dev_postScanBuf, dev_postMapBuf, n * sizeof(int), cudaMemcpyDeviceToDevice); + checkCUDAErrorWithLine("memcpy map to scan error!!!"); + + scan(n, dev_postScanBuf, dev_postMapBuf, true); + checkCUDAErrorWithLine("scan error!!!"); + scatter << > > (n, dev_postScatterBuf, dev_postMapBuf, dev_postScanBuf, dev_originalData); + checkCUDAErrorWithLine("scatter error!!!"); + getCompactedSize<<>>(n, dev_scatteredSize, dev_postMapBuf, dev_postScanBuf); + checkCUDAErrorWithLine("get size error!!!"); + + timer().endGpuTimer(); + + int scatteredSize; + + cudaMemcpy(&scatteredSize, dev_scatteredSize, sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAErrorWithLine("memcpy dev_scatteredSize to host error!!!"); + + cudaMemcpy(odata, dev_postScatterBuf, scatteredSize * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAErrorWithLine("memcpy dev_postScatterBuf to host error!!!"); + + cudaFree(dev_originalData); + checkCUDAErrorWithLine("free dev_originalData error!!!"); + + cudaFree(dev_postMapBuf); + checkCUDAErrorWithLine("free dev_postMapBuf error!!!"); + + cudaFree(dev_postScanBuf); + checkCUDAErrorWithLine("free dev_postScanBuf error!!!"); + + cudaFree(dev_postScatterBuf); + checkCUDAErrorWithLine("free dev_postScatterBuf error!!!"); + + cudaFree(dev_scatteredSize); + checkCUDAErrorWithLine("free dev_scatteredSize error!!!"); + + return scatteredSize; } } } diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h index 803cb4f..afca471 100644 --- a/stream_compaction/efficient.h +++ b/stream_compaction/efficient.h @@ -6,7 +6,7 @@ namespace StreamCompaction { namespace Efficient { StreamCompaction::Common::PerformanceTimer& timer(); - void scan(int n, int *odata, const int *idata); + void scan(int n, int *odata, const int *idata, bool internalUse = false); int compact(int n, int *odata, const int *idata); } diff --git a/stream_compaction/efficientShared.cu b/stream_compaction/efficientShared.cu new file mode 100644 index 0000000..9ddf6f8 --- /dev/null +++ b/stream_compaction/efficientShared.cu @@ -0,0 +1,127 @@ +#include +#include +#include "common.h" +#include "efficient.h" + +namespace StreamCompaction { + namespace EfficientShared { + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; + } + +#define blockSize 32 +#define MAX_BLOCK_SIZE 32 +#define MAX_CHUNK_SIZE 32 +#define checkCUDAErrorWithLine(msg) ((void)0) + //checkCUDAError(msg, __LINE__) +#define USE_CUDA_DEV_SYNC 0 + +/* Macros below adapated from: + * https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html/ + * Example 39-3 */ +#define NUM_BANKS 32 +#define LOG_NUM_BANKS 5 +#define CONFLICT_FREE_OFFSET(n) \ + ((n) >> NUM_BANKS + (n) >> (2 * LOG_NUM_BANKS)) + + __global__ void scanChunk(int n, int *g_odata, const int *g_idata) { + extern __shared__ float sharedBuf[]; + int idx = threadIdx.x; + int offset = 1; + // compute indices and offsets to write to banks without conflict + int aIdx = idx; + int bIdx = idx + n / 2; + int bankOffsetA = CONFLICT_FREE_OFFSET(aIdx); + int bankOffsetB = CONFLICT_FREE_OFFSET(bIdx); + sharedBuf[aIdx + bankOffsetA] = g_idata[aIdx]; + sharedBuf[bIdx + bankOffsetB] = g_idata[bIdx]; + // begin up-sweep + for (int d = n / 2; d > 0; d /= 2) { + __syncthreads(); + if (idx < d) { + aIdx = offset * (2 * idx + 1) - 1; + bIdx = aIdx + offset;//offset * (2 * idx + 2) - 1; + aIdx += CONFLICT_FREE_OFFSET(aIdx); + bIdx += CONFLICT_FREE_OFFSET(bIdx); + + sharedBuf[bIdx] += sharedBuf[aIdx]; + } + + offset *= 2; + } + + // set last idx to 0 + if (idx == 0) { + sharedBuf[n - 1 + CONFLICT_FREE_OFFSET(n - 1)] = 0; + } + + for (int d = 1; d < n; d *= 2) { + offset /= 2; + __syncthreads(); + if (idx < d) { + aIdx = offset * (2 * idx + 1) - 1; + bIdx = aIdx + offset;//offset * (2 * idx + 2) - 1; + aIdx += CONFLICT_FREE_OFFSET(aIdx); + bIdx += CONFLICT_FREE_OFFSET(bIdx); + + float originalNodeValue = sharedBuf[aIdx]; + sharedBuf[aIdx] = sharedBuf[bIdx]; + sharedBuf[bIdx] += originalNodeValue; + } + } + + __syncthreads(); + + g_odata[aIdx] = sharedBuf[aIdx + bankOffsetA]; + g_odata[bIdx] = sharedBuf[bIdx + bankOffsetB]; + } + + /** + * Performs prefix-sum (aka scan) on idata, storing the result into odata. + * internalUse specifies whether this is used as a helper function, + * for example, in compact. If so, it assumes idata and odata are in + * device memory and does not use gpuTimer. + */ + void scan(int n, int *odata, const int *idata) { + if (n == 1) { + odata[0] = 0; + return; + } + // TODO: handle n <= 2 ??? + // nearest power of two + const int bufSize = 1 << ilog2ceil(n); + + int *dev_buf; + + cudaMalloc((void**)&dev_buf, bufSize * sizeof(int)); + checkCUDAErrorWithLine("malloc dev_buf error!!!"); + + if (n < bufSize) { + cudaMemset(dev_buf + n, 0, (bufSize - n) * sizeof(int)); + checkCUDAErrorWithLine("memset dev_buf to 0 error!!!"); + } + + cudaMemcpy(dev_buf, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAErrorWithLine("memcpy dev_buf error!!!"); + + timer().startGpuTimer(); + + scanChunk<<>>(bufSize, dev_buf, dev_buf); + checkCUDAErrorWithLine("scan chunk error!!!"); + + timer().endGpuTimer(); + + + cudaMemcpy(odata, dev_buf, n * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAErrorWithLine("memcpy dev_buf to host error!!!"); + cudaFree(dev_buf); + checkCUDAErrorWithLine("free dev_buf error!!!"); + + + } + + } +} diff --git a/stream_compaction/efficientShared.h b/stream_compaction/efficientShared.h new file mode 100644 index 0000000..466a40b --- /dev/null +++ b/stream_compaction/efficientShared.h @@ -0,0 +1,12 @@ +#pragma once + +#include "common.h" + +namespace StreamCompaction { + namespace EfficientShared { + StreamCompaction::Common::PerformanceTimer& timer(); + + void scan(int n, int *odata, const int *idata); + + } +} diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 9218f8e..a189c17 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -11,15 +11,76 @@ namespace StreamCompaction { static PerformanceTimer timer; return timer; } - // TODO: __global__ + +#define blockSize 32 +#define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__) +#define USE_CUDA_DEV_SYNC 0 + + __global__ void shiftRight(int n, int *odata, const int *idata) { + int idx = (blockIdx.x * blockDim.x) + threadIdx.x;; // TODO? + if (idx < n) { + odata[idx] = (idx == 0) ? 0 : idata[idx - 1]; + } + } + + __global__ void scanIteration(int n, int *odata, const int *idata, const int offset) { + int idx = (blockIdx.x * blockDim.x) + threadIdx.x;; // TODO? + if (idx < n) { + odata[idx] = idata[idx] + ((idx >= offset) ? idata[idx - offset] : 0); + } + } + + int *dev_bufA; + int *dev_bufB; /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); + cudaMalloc((void**)&dev_bufA, n * sizeof(int)); + checkCUDAErrorWithLine("malloc dev_bufA error!!!"); + cudaMalloc((void**)&dev_bufB, n * sizeof(int)); + checkCUDAErrorWithLine("malloc dev_bufB error!!!"); + cudaMemcpy(dev_bufB, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAErrorWithLine("memcpy dev_bufB error!!!"); + dim3 numBlocks((n + blockSize - 1) / blockSize); + + timer().startGpuTimer(); + // TODO + // shift right by one, set [0] to 0 + shiftRight<<>>(n, dev_bufA, dev_bufB); +#if USE_CUDA_DEV_SYNC + cudaDeviceSynchronize(); + checkCUDAErrorWithLine("sync error"); +#endif + + // start scan iterations + int pingPongCount = 0; + for (int offset = 1; offset < n; offset *= 2) { + pingPongCount = 1 - pingPongCount; + if (pingPongCount) { + scanIteration << > >(n, dev_bufB, dev_bufA, offset); + } + else { + scanIteration << > >(n, dev_bufA, dev_bufB, offset); + } +#if USE_CUDA_DEV_SYNC + cudaDeviceSynchronize(); + checkCUDAErrorWithLine("sync error"); +#endif + } + timer().endGpuTimer(); + + if (pingPongCount) { + // output is in bufB + cudaMemcpy(odata, dev_bufB, n * sizeof(int), cudaMemcpyDeviceToHost); + } + else { + // output is in bufA + cudaMemcpy(odata, dev_bufA, n * sizeof(int), cudaMemcpyDeviceToHost); + } + cudaFree(dev_bufA); + cudaFree(dev_bufB); } } } diff --git a/stream_compaction/radixSort.cu b/stream_compaction/radixSort.cu new file mode 100644 index 0000000..013f4dc --- /dev/null +++ b/stream_compaction/radixSort.cu @@ -0,0 +1,195 @@ +#include +#include +#include "common.h" +#include "radixSort.h" +#include "efficient.h" + +namespace StreamCompaction { + namespace RadixSort { + // access to "efficient" scan + using StreamCompaction::Efficient::scan; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; + } + +#define blockSize 32 +#define MAX_BLOCK_SIZE 32 +#define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__) +#define USE_CUDA_DEV_SYNC 0 +#define BITS_IN_INT (sizeof(int) * 8) +#define INT_WITH_HIGHEST_BIT_SET (1 << (BITS_IN_INT - 1)) + + /* Sorts a tile of size _tileSize_ using radix sort in ascending order + * if _ascending_ is true, descending otherwise. + * This kernel should be called in blocks with size == tileSize, due to + * the use of __syncthreads(). */ + /* could do the whole thing at once, instead of tiles then merge?? */ + __global__ void radixSortTile(int tileSize, int *odata, const int *idata, bool ascending) { + + } + + __global__ void computeEnumerate(int n, int *enumBuf, const int *originalData, const int passMask) { + int idx = threadIdx.x + (blockDim.x * blockIdx.x); + if (idx < n) { + // idata[idx] & passMask is bit, so "negate" it + enumBuf[idx] = (originalData[idx] & passMask) == 0 ? 1 : 0; + } + } + + __global__ void computeTotalFalses(int n, int *totalFalsesBuf, const int *enumBuf, const int *falseBuf) { + *totalFalsesBuf = enumBuf[n - 1] + falseBuf[n - 1]; + } + + __global__ void computeTrueAndScatter(int n, int *odata, const int *falseBuf, const int *originalData, const int *totalFalsesBuf, const int passMask, bool ascending) { + int idx = threadIdx.x + (blockDim.x * blockIdx.x); + if (idx < n) { + int originalValue = originalData[idx]; + // if masked bit is 0 and sorting by ascending, or if bit is 1 and sorting by descending + if (((originalValue & passMask) == 0) == ascending) { + // write to false idx + odata[falseBuf[idx]] = originalValue; + } + else { + // write to true idx + odata[idx - falseBuf[idx] + *totalFalsesBuf] = originalValue; + } + } + } + + __global__ void findFirstHighBit(int n, int *highBits, const int *idata) { + int idx = threadIdx.x + (blockDim.x * blockIdx.x); + if (idx < n) { + int val = idata[idx]; + if (val == 0) { + return; + } + int mask = INT_WITH_HIGHEST_BIT_SET; + for (int i = BITS_IN_INT - 1; i >= 0; --i) { + if (val & mask) { + highBits[i] = 1; + return; + } + mask >>= 1; + } + } + } + + __global__ void findLargestHighBit(int *result, const int *highBits) { + for (int i = BITS_IN_INT - 1; i >= 0; --i) { + if (highBits[i]) { + *result = i; + return; + } + } + } + + void radixSort(int n, int *odata, const int *idata, bool ascending) { + int *dev_highBitsBuf; + int *dev_originalData; + int *dev_enumBuf; + int *dev_falseBuf; + int *dev_totalFalsesBuf; + int *dev_postScatterBuf; + + cudaMalloc((void**)&dev_highBitsBuf, BITS_IN_INT * sizeof(int)); + checkCUDAErrorWithLine("malloc dev_highBitsBuf error!!!"); + + cudaMemset(dev_highBitsBuf, 0, BITS_IN_INT * sizeof(int)); + checkCUDAErrorWithLine("memset dev_highBitsBuf to 0 error!!!"); + + cudaMalloc((void**)&dev_originalData, n * sizeof(int)); + checkCUDAErrorWithLine("malloc dev_originalData error!!!"); + + cudaMalloc((void**)&dev_enumBuf, n * sizeof(int)); + checkCUDAErrorWithLine("malloc dev_enumBuf error!!!"); + + // needs to be power of 2 for scan to work + const int falseBufSize = 1 << ilog2ceil(n); + + cudaMalloc((void**)&dev_falseBuf, falseBufSize * sizeof(int)); + checkCUDAErrorWithLine("malloc dev_falseBuf error!!!"); + + if (falseBufSize != n) { + cudaMemset(dev_falseBuf, 0, falseBufSize * sizeof(int)); + checkCUDAErrorWithLine("memset dev_falseBuf to 0 error!!!"); + } + + cudaMalloc((void**)&dev_totalFalsesBuf, sizeof(int)); + checkCUDAErrorWithLine("malloc dev_totalFalsesBuf error!!!"); + + cudaMalloc((void**)&dev_postScatterBuf, n * sizeof(int)); + checkCUDAErrorWithLine("malloc dev_postScatterBuf error!!!"); + + cudaMemcpy(dev_originalData, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAErrorWithLine("memcpy dev_originalData from host error!!!"); + + dim3 numBlocks((n + blockSize - 1) / blockSize); + + int bitIdx; + int passMask = 1; + int maxBitIdx; + + timer().startGpuTimer(); + + // find how many bits to search + findFirstHighBit << > > (n, dev_highBitsBuf, dev_originalData); + // store result in dev_highBitsBuf itself -- no race conditions + findLargestHighBit<<>>(dev_highBitsBuf, dev_highBitsBuf); + + cudaMemcpy(&maxBitIdx, dev_highBitsBuf, sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAErrorWithLine("memcpy maxBitIdx error!!!"); + + for (bitIdx = 0; bitIdx <= maxBitIdx; ++bitIdx) { + // compute enumerate + computeEnumerate << > >(n, dev_enumBuf, dev_originalData, passMask); + // scan + cudaMemcpy(dev_falseBuf, dev_enumBuf, n * sizeof(int), cudaMemcpyDeviceToDevice); + checkCUDAErrorWithLine("memcpy enum to false error!!!"); + + scan(n, dev_falseBuf, dev_enumBuf, true); + checkCUDAErrorWithLine("scan error!!!"); + // compute totalFalses + computeTotalFalses<<>>(n, dev_totalFalsesBuf, dev_enumBuf, dev_falseBuf); + // compute trues and scatter + computeTrueAndScatter << > > (n, dev_postScatterBuf, dev_falseBuf, dev_originalData, dev_totalFalsesBuf, passMask, ascending); + + if (bitIdx < maxBitIdx) { + // if not last iteration + cudaMemcpy(dev_originalData, dev_postScatterBuf, n * sizeof(int), cudaMemcpyDeviceToDevice); + checkCUDAErrorWithLine("memcpy postScatter to originalData error!!!"); + + // move passMask to next bit + passMask <<= 1; + } + } + + timer().endGpuTimer(); + + cudaMemcpy(odata, dev_postScatterBuf, n * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAErrorWithLine("memcpy dev_postScatterBuf to host error!!!"); + + cudaFree(dev_highBitsBuf); + checkCUDAErrorWithLine("free dev_highBitsBuf error!!!"); + + cudaFree(dev_originalData); + checkCUDAErrorWithLine("free dev_originalData error!!!"); + + cudaFree(dev_enumBuf); + checkCUDAErrorWithLine("free dev_enumBuf error!!!"); + + cudaFree(dev_falseBuf); + checkCUDAErrorWithLine("free dev_falseBuf error!!!"); + + cudaFree(dev_totalFalsesBuf); + checkCUDAErrorWithLine("free dev_scatteredSize error!!!"); + + cudaFree(dev_postScatterBuf); + checkCUDAErrorWithLine("free dev_postScatterBuf error!!!"); + + } + + } +} diff --git a/stream_compaction/radixSort.h b/stream_compaction/radixSort.h new file mode 100644 index 0000000..1b72397 --- /dev/null +++ b/stream_compaction/radixSort.h @@ -0,0 +1,12 @@ +#pragma once + +#include "common.h" + +namespace StreamCompaction { + namespace RadixSort { + StreamCompaction::Common::PerformanceTimer& timer(); + + void radixSort(int n, int *odata, const int *idata, bool ascending); + + } +} diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 36b732d..f095ebe 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -6,6 +6,8 @@ #include "common.h" #include "thrust.h" +#define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__) + namespace StreamCompaction { namespace Thrust { using StreamCompaction::Common::PerformanceTimer; @@ -18,11 +20,30 @@ namespace StreamCompaction { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO use `thrust::exclusive_scan` - // example: for device_vectors dv_in and dv_out: + int *dev_idata; + + cudaMalloc((void **)&dev_idata, n * sizeof(int)); + checkCUDAErrorWithLine("malloc dev_idata!!!"); + + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAErrorWithLine("memcpy dev_idata from host!!!"); + + thrust::device_ptr dev_thrust_idata(dev_idata); + + // pass in cpu pointers here + + thrust::device_vector dev_vec_idata(dev_idata, dev_idata + n); + + timer().startGpuTimer(); + thrust::exclusive_scan(dev_vec_idata.begin(), dev_vec_idata.end(), dev_vec_idata.begin()); // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); - timer().endGpuTimer(); + timer().endGpuTimer(); + + thrust::copy(dev_vec_idata.begin(), dev_vec_idata.end(), odata); + + cudaFree(dev_idata); + checkCUDAErrorWithLine("free dev_idata!!!"); + } } }