diff --git a/README.md b/README.md index b71c458..f039fb3 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,76 @@ CUDA Stream Compaction ====================== -**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** +**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2 - Stream Compaction** -* (TODO) YOUR NAME HERE -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Joseph Klinger +* Tested on: Windows 10, i5-7300HQ (4 CPUs) @ ~2.50GHz, GTX 1050 6030MB (Personal Machine) -### (TODO: Your README) +### README -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +This project consists of a series of implementations of the inclusive/exclusive scan and stream compaction across the CPU and GPU. +We implement a sequential CPU scan, a non-work-efficient GPU scan (naive), and an actually work-efficient GPU scan. +A scan is a prefix sum (assuming an array of integers for now), meaning that the index i in the output array will consist of the sum of each previous element +in the input array. Here's a concrete example of each kind of scan, taken from the slides by Patrick Cozzi and Shehzan Mohammed [here](https://docs.google.com/presentation/d/1ETVONA7QDM-WqsEj4qVOGD6Kura5I6E9yqH-7krnwZ0/edit#slide=id.p27) + +![](img/scans.png) + +How do we scan effectively? Well, the naive way would be to run a kernel on every element of the threads and check if a particular element should contribute to the sum. The following image, taken from [this](https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html) GPU Gem +shows the algorithm in execution for a sample input array: + +![](img/naive.png) + +Ideally, we don't want to have the GPU do any work for elements that don't contribute to the sum during a later iteration of the algorithm. This naive algorithm requires O(n * log(n)) adds. Turns out, +there is an algorithm that can give us a scan with only O(n) adds (images taken from the aforementioned slides from Patrick and Shehzan): + +In the first step, we perform an "up-sweep" where we can compute some partial sums: + +![](img/upsweep.png) + +After that, we perform a "down-sweep" that completes the other half of our scan: + +![](img/downsweep.png) + +Now that we have a fast way to scan in parallel, we can perform stream compaction in a reasonable amount of time. + +Stream Compaction utilizes scans as a way to reduce an array of any integers to an array consisting only of the integers that meet a certain criteria (say, not equal to 0). +Taking some images from the same slides as before, here is step 1 to stream compaction - create an array of 1s and 0s indicating whether or not we want to keep a certain element, then +run an exclusive scan on that array: + +![](img/compact.png) + +So, in this example, we want to keep elements a, c, d, and g. + +As it turns out, for each element in the input array that has a 1 in the intermediate array, the corresponding value in the summed array is that element's index in the final output array. +This step is called scatter: + +![](img/compact2.png) + +Now we are left with our desired array. + +### Analysis +Here are the analysis results from my implementations: + +![](img/graphAllScans.png) + +Here we can get a general vibe that the naive parallel scan didn't work so well, the CPU scan performed decently (relying on the cache isn't that bad in this case where we are simply iterating down the array sequentially!), adding the work-efficient optimization is a must, and thrust wins easily. + +One major aspect to note is the difference between the two work-efficient parallel scans. One utilizes a naive arrangement of block launching and the other is more intelligent (only launching +threads as needed). To be more specific, during the work-efficient algorithm, we are not updating every element of the input array during every iteration of the algorithm - in fact, we are updating +half as many for each iteration. So, we should only have as many threads do work as are needed. + +In the following graph, we can see that this optimization is what allows the parallel implementation to outperform the CPU implementation: + +![](img/graphGPUandCPU.png) + +Now the work-efficient parallel scan is good, but still, the 3rd party Thrust implementation still by far outperforms all other implementations (as you can see in the first graph too): + +![](img/graphGPU.png) + +There are further optimizations that can be added to the work-efficient parallel scan that can help it compete with Thrust, such as utilizing shared memory to minimize reads from global memory (a major slowdown), but they weren't completed for this project (yet!). + +Here is the raw output from the program: +![](img/output1.png) + +![](img/output2.png) \ No newline at end of file diff --git a/img/compact.png b/img/compact.png new file mode 100644 index 0000000..1c86bc6 Binary files /dev/null and b/img/compact.png differ diff --git a/img/compact2.png b/img/compact2.png new file mode 100644 index 0000000..228dcc0 Binary files /dev/null and b/img/compact2.png differ diff --git a/img/downsweep.png b/img/downsweep.png new file mode 100644 index 0000000..fde90f3 Binary files /dev/null and b/img/downsweep.png differ diff --git a/img/graphAllScans.png b/img/graphAllScans.png new file mode 100644 index 0000000..6dd1605 Binary files /dev/null and b/img/graphAllScans.png differ diff --git a/img/graphGPU.png b/img/graphGPU.png new file mode 100644 index 0000000..af41494 Binary files /dev/null and b/img/graphGPU.png differ diff --git a/img/graphGPUandCPU.png b/img/graphGPUandCPU.png new file mode 100644 index 0000000..8878dd6 Binary files /dev/null and b/img/graphGPUandCPU.png differ diff --git a/img/naive.png b/img/naive.png new file mode 100644 index 0000000..d8e5fc8 Binary files /dev/null and b/img/naive.png differ diff --git a/img/output1.png b/img/output1.png new file mode 100644 index 0000000..d258c5d Binary files /dev/null and b/img/output1.png differ diff --git a/img/output2.png b/img/output2.png new file mode 100644 index 0000000..4076fea Binary files /dev/null and b/img/output2.png differ diff --git a/img/scans.png b/img/scans.png new file mode 100644 index 0000000..3b5addd Binary files /dev/null and b/img/scans.png differ diff --git a/img/upsweep.png b/img/upsweep.png new file mode 100644 index 0000000..09c74df Binary files /dev/null and b/img/upsweep.png differ diff --git a/src/main.cpp b/src/main.cpp index 7305641..dce0ad8 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,11 +13,59 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 12; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int a[SIZE], b[SIZE], c[SIZE]; +// CUDA Device Properties Stuff + +// Print device properties +void printDevProp(cudaDeviceProp devProp) +{ + printf("Major revision number: %d\n", devProp.major); + printf("Minor revision number: %d\n", devProp.minor); + printf("Name: %s\n", devProp.name); + printf("Total global memory: %u\n", devProp.totalGlobalMem); + printf("Total shared memory per block: %u\n", devProp.sharedMemPerBlock); + printf("Total registers per block: %d\n", devProp.regsPerBlock); + printf("Warp size: %d\n", devProp.warpSize); + printf("Maximum memory pitch: %u\n", devProp.memPitch); + printf("Maximum threads per block: %d\n", devProp.maxThreadsPerBlock); + for (int i = 0; i < 3; ++i) + printf("Maximum dimension %d of block: %d\n", i, devProp.maxThreadsDim[i]); + for (int i = 0; i < 3; ++i) + printf("Maximum dimension %d of grid: %d\n", i, devProp.maxGridSize[i]); + printf("Clock rate: %d\n", devProp.clockRate); + printf("Total constant memory: %u\n", devProp.totalConstMem); + printf("Texture alignment: %u\n", devProp.textureAlignment); + printf("Concurrent copy and execution: %s\n", (devProp.deviceOverlap ? "Yes" : "No")); + printf("Number of multiprocessors: %d\n", devProp.multiProcessorCount); + printf("Kernel execution timeout: %s\n", (devProp.kernelExecTimeoutEnabled ? "Yes" : "No")); + return; +} + int main(int argc, char* argv[]) { + // CUDA Device Properties - http://www.cs.fsu.edu/~xyuan/cda5125/examples/lect24/devicequery.cu + // Number of CUDA devices + /*int devCount; + cudaGetDeviceCount(&devCount); + printf("CUDA Device Query...\n"); + printf("There are %d CUDA devices.\n", devCount); + + // Iterate through devices + for (int i = 0; i < devCount; ++i) + { + // Get device properties + printf("\nCUDA Device #%d\n", i); + cudaDeviceProp devProp; + cudaGetDeviceProperties(&devProp, i); + printDevProp(devProp); + } + + printf("\nPress any key to exit..."); + char c1; + scanf("%c", &c1);*/ + // Scan tests printf("\n"); @@ -25,7 +73,7 @@ int main(int argc, char* argv[]) { printf("** SCAN TESTS **\n"); printf("****************\n"); - genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case + genOnesArray(SIZE - 1, a); // Leave a 0 at the end to test that edge case a[SIZE - 1] = 0; printArray(SIZE, a, true); @@ -49,42 +97,42 @@ int main(int argc, char* argv[]) { printDesc("naive scan, power-of-two"); StreamCompaction::Naive::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("naive scan, non-power-of-two"); StreamCompaction::Naive::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, power-of-two"); StreamCompaction::Efficient::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); - + zeroArray(SIZE, c); printDesc("work-efficient scan, non-power-of-two"); StreamCompaction::Efficient::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); - + zeroArray(SIZE, c); printDesc("thrust scan, power-of-two"); StreamCompaction::Thrust::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("thrust scan, non-power-of-two"); StreamCompaction::Thrust::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); printf("\n"); @@ -129,15 +177,15 @@ int main(int argc, char* argv[]) { printDesc("work-efficient compact, power-of-two"); count = StreamCompaction::Efficient::compact(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedCount, b, c); - + zeroArray(SIZE, c); printDesc("work-efficient compact, non-power-of-two"); count = StreamCompaction::Efficient::compact(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); system("pause"); // stop Win32 console from closing on exit -} +} \ No newline at end of file diff --git a/src/testing_helpers.hpp b/src/testing_helpers.hpp index ae94ca6..0189cd3 100644 --- a/src/testing_helpers.hpp +++ b/src/testing_helpers.hpp @@ -51,6 +51,12 @@ void genArray(int n, int *a, int maxval) { } } +void genOnesArray(int n, int *a) { + for (int i = 0; i < n; i++) { + a[i] = 1; + } +} + void printArray(int n, int *a, bool abridged = false) { printf(" [ "); for (int i = 0; i < n; i++) { diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 05ce667..256b0df 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -1,15 +1,15 @@ #include #include "cpu.h" -#include "common.h" +#include "common.h" namespace StreamCompaction { namespace CPU { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } /** @@ -18,9 +18,15 @@ namespace StreamCompaction { * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first. */ void scan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); + timer().startCpuTimer(); + + odata[0] = 0; + for (int i = 1; i < n; ++i) + { + odata[i] = idata[i - 1] + odata[i - 1]; + } + + timer().endCpuTimer(); } /** @@ -29,10 +35,18 @@ namespace StreamCompaction { * @returns the number of elements remaining after compaction. */ int compactWithoutScan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); - return -1; + timer().startCpuTimer(); + int index = 0; + for (int i = 0; i < n; ++i) + { + if (idata[i]) + { + odata[index] = idata[i]; + index++; + } + } + timer().endCpuTimer(); + return index; } /** @@ -41,10 +55,38 @@ namespace StreamCompaction { * @returns the number of elements remaining after compaction. */ int compactWithScan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); - return -1; + timer().startCpuTimer(); + + int* tempScanArray = (int*)malloc(sizeof(int) * n); + int* tempScanResultArray = (int*)malloc(sizeof(int) * n); + + // Create temporary array + for (int i = 0; i < n; ++i) + { + tempScanArray[i] = (idata[i]) ? 1 : 0; + } + + // Included exclusive scan implementation here in order to avoid the conflict with multiple timers: + tempScanResultArray[0] = 0; + for (int i = 1; i < n; ++i) + { + tempScanResultArray[i] = tempScanArray[i - 1] + tempScanResultArray[i - 1]; + } + + // Scatter + for (int i = 0; i < n; ++i) + { + if (tempScanArray[i]) + { + odata[tempScanResultArray[i]] = idata[i]; + } + } + + int compactSize = (tempScanArray[n - 1]) ? tempScanResultArray[n - 1] + 1 : tempScanResultArray[n - 1]; + timer().endCpuTimer(); + delete []tempScanArray; + delete []tempScanResultArray; + return compactSize; } } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 36c5ef2..318583f 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -5,20 +5,177 @@ namespace StreamCompaction { namespace Efficient { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; + } + + __global__ void kernWorkEfficientScanUpSweep(int n, int d, int* data) + { + /*int blockId = blockIdx.y * gridDim.x + blockIdx.x; + int index = blockId * blockDim.x + threadIdx.x;*/ + int index = blockIdx.x * blockDim.x + threadIdx.x; + int twoPowD = 1 << (d); + int twoPowDPlusOne = 1 << (d + 1); + + if (index >= (n / twoPowDPlusOne)) + { + return; + } + + index = (index + 1) * twoPowDPlusOne - 1; + int add = data[index - twoPowD]; + data[index] += add; + } + + __global__ void kernWorkEfficientScanDownSweep(int n, int d, int* data) + { + /*int blockId = blockIdx.y * gridDim.x + blockIdx.x; + int index = blockId * blockDim.x + threadIdx.x;*/ + int index = blockIdx.x * blockDim.x + threadIdx.x; + int twoPowD = 1 << (d); + int twoPowDPlusOne = 1 << (d + 1); + + if (index >= (n / twoPowDPlusOne)) + { + return; + } + + index = (index + 1) * twoPowDPlusOne - 1; + + int t = data[index - twoPowD]; + data[index - twoPowD] = data[index]; + data[index] += t; + } + + __global__ void kernSetFinalValue(int n, int* data) + { + /*int blockId = blockIdx.y * gridDim.x + blockIdx.x; + int index = blockId * blockDim.x + threadIdx.x;*/ + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= n) + { + return; + } + + if (index == n - 1) + { + data[n - 1] = 0; + return; + } + } + + __global__ void kernSetTempArray(int n, int* odata, const int* idata) + { + /*int blockId = blockIdx.y * gridDim.x + blockIdx.x; + int index = blockId * blockDim.x + threadIdx.x;*/ + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= n) + { + return; + } + + odata[index] = (idata[index]) ? 1 : 0; + } + + __global__ void kernScatter(int n, int* odata, const int* tempArray, const int* scannedTempArray, const int* idata) + { + /*int blockId = blockIdx.y * gridDim.x + blockIdx.x; + int index = blockId * blockDim.x + threadIdx.x;*/ + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= n) + { + return; + } + + if (tempArray[index]) + { + odata[scannedTempArray[index]] = idata[index]; + } + else + { + return; + } } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + int nLog2 = ilog2ceil(n); + + // Pad the array with zeroes in the event that it isn't a power of two + int newN = pow(2, nLog2); + int* paddedArray = (int*) malloc(sizeof(int) * newN); + + for (int i = 0; i < newN; ++i) + { + if (i < newN - n) + { + paddedArray[i] = 0; + } + else + { + paddedArray[i] = idata[i - newN + n]; + } + } + + int numArrayBytes = sizeof(int) * newN; + + // Global memory arrays + int* dev_data; + + // Kernel configuration + float blockSize = 512.f; + dim3 threadsPerBlock(blockSize); + float numBlocks = (((float)newN) + blockSize - 1.f) / blockSize; + dim3 fullBlocksPerGrid((unsigned int) numBlocks); + + cudaMalloc((void**)&dev_data, numArrayBytes); + checkCUDAError("cudaMalloc dev_idata failed!"); + + // Copy the input data array to the device + cudaMemcpy(dev_data, paddedArray, numArrayBytes, cudaMemcpyHostToDevice); + checkCUDAError("memcpy failed!"); + timer().startGpuTimer(); - // TODO + + // Perform the work efficient scan + for (int d = 0; d <= nLog2 - 1; ++d) + { + if (n == 1) break; + float numBlocksDynamic = (((float) newN) / pow(2, d + 1) + blockSize - 1.f) / blockSize; + dim3 fullBlocksPerGrid_efficient((unsigned int) numBlocksDynamic); + kernWorkEfficientScanUpSweep << < fullBlocksPerGrid, threadsPerBlock >> > (newN, d, dev_data); + } + + // Set the final value of the array to 0 as the first step of the down sweep + // Replace this is a device to host memcpy - ask Josh + kernSetFinalValue << < fullBlocksPerGrid, threadsPerBlock >> > (newN, dev_data); + + for (int d = nLog2 - 1; d >= 0; --d) + { + if (n == 1) break; + float numBlocksDynamic = (((float) newN) / pow(2, d + 1) + blockSize - 1.f) / blockSize; + dim3 fullBlocksPerGrid_efficient((unsigned int) numBlocksDynamic); + kernWorkEfficientScanDownSweep << < fullBlocksPerGrid, threadsPerBlock >> > (newN, d, dev_data); + } + timer().endGpuTimer(); + + cudaMemcpy(paddedArray, dev_data, numArrayBytes, cudaMemcpyDeviceToHost); + checkCUDAError("memcpy failed4!"); + + // De-pad the zeroes from the array + for (int i = 0; i < n; ++i) + { + odata[i] = paddedArray[i + newN - n]; + } + + cudaFree(dev_data); + delete(paddedArray); } /** @@ -31,10 +188,111 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { + + int nLog2 = ilog2ceil(n); + + // Pad the array with zeroes in the event that it isn't a power of two + int newN = pow(2, nLog2); + int* paddedArray = (int*) malloc(sizeof(int) * newN); + + for (int i = 0; i < newN; ++i) + { + if (i < newN - n) + { + paddedArray[i] = 0; + } + else + { + paddedArray[i] = idata[i - newN + n]; + } + } + + int numArrayBytes = sizeof(int) * newN; + + // Global memory arrays + int* dev_idata; + int* dev_odata; + int* dev_tempArray; + int* dev_sumTempArray; + + // Kernel configuration + float blockSize = 64.f; + dim3 threadsPerBlock(blockSize); + dim3 fullBlocksPerGrid((((float)newN) + blockSize - 1.f) / blockSize); + + // CUDA Mallocs + cudaMalloc((void**)&dev_odata, numArrayBytes); + checkCUDAError("cudaMalloc dev_odata failed!", __LINE__); + + cudaMalloc((void**)&dev_idata, numArrayBytes); + checkCUDAError("cudaMalloc dev_idata failed!", __LINE__); + + cudaMalloc((void**)&dev_tempArray, numArrayBytes); + checkCUDAError("cudaMalloc dev_tempArray failed!", __LINE__); + + cudaMalloc((void**)&dev_sumTempArray, numArrayBytes); + checkCUDAError("cudaMalloc dev_sumTempArray failed!", __LINE__); + + // Copy the input data array to the device + cudaMemcpy(dev_idata, paddedArray, numArrayBytes, cudaMemcpyHostToDevice); + checkCUDAError("memcpy failed!", __LINE__); + + kernSetTempArray << < fullBlocksPerGrid, threadsPerBlock >> > (newN, dev_tempArray, dev_idata); + + // Copy the input data array to the output array (needs the initial data to be correct) + cudaMemcpy(dev_sumTempArray, dev_tempArray, numArrayBytes, cudaMemcpyDeviceToDevice); + checkCUDAError("memcpy failed!", __LINE__); + timer().startGpuTimer(); - // TODO + + // Perform the work efficient scan - up sweep + for (int d = 0; d <= nLog2 - 1; ++d) + { + dim3 fullBlocksPerGrid_efficient((((float) newN) / pow(2, d + 1) + blockSize - 1.f) / blockSize); + kernWorkEfficientScanUpSweep << < fullBlocksPerGrid, threadsPerBlock >> > (newN, d, dev_sumTempArray); + } + + // Set the final value of the array to 0 as the first step of the down sweep + kernSetFinalValue << < fullBlocksPerGrid, threadsPerBlock >> > (newN, dev_sumTempArray); + + // Perform the work efficient scan - up sweep + for (int d = nLog2 - 1; d >= 0; --d) + { + dim3 fullBlocksPerGrid_efficient((((float) newN) / pow(2, d + 1) + blockSize - 1.f) / blockSize); + kernWorkEfficientScanDownSweep << < fullBlocksPerGrid, threadsPerBlock >> > (newN, d, dev_sumTempArray); + } + + // Scatter + kernScatter << < fullBlocksPerGrid, threadsPerBlock >> > (newN, dev_odata, dev_tempArray, dev_sumTempArray, dev_idata); + timer().endGpuTimer(); - return -1; + + cudaMemcpy(paddedArray, dev_sumTempArray, numArrayBytes, cudaMemcpyDeviceToHost); + checkCUDAError("memcpy failed4!", __LINE__); + + int compactSize = paddedArray[newN - 1]; + + cudaMemcpy(paddedArray, dev_tempArray, numArrayBytes, cudaMemcpyDeviceToHost); + checkCUDAError("memcpy failed4!", __LINE__); + + if (paddedArray[newN - 1]) + { + compactSize++; + } + + cudaMemcpy(paddedArray, dev_odata, numArrayBytes, cudaMemcpyDeviceToHost); + checkCUDAError("memcpy failed4!", __LINE__); + + for (int i = 0; i < n; ++i) + { + odata[i] = paddedArray[i]; + } + + cudaFree(dev_idata); + cudaFree(dev_tempArray); + cudaFree(dev_sumTempArray); + cudaFree(dev_odata); + return compactSize; } } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 9218f8e..1341057 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -5,21 +5,123 @@ namespace StreamCompaction { namespace Naive { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; + } + + __global__ void kernNaiveScan(int n, int powTwo, int *odata, const int* idata) + { + int index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= n) + { + return; + } + + if (index >= powTwo) + { + odata[index] = idata[index - powTwo] + idata[index]; + } + else + { + odata[index] = idata[index]; + } + } + + __global__ void kernExclToInclScan(int n, int* odata, const int* idata) + { + int index = threadIdx.x + blockIdx.x * blockDim.x; + if (index >= n) + { + return; + } + + if (index == 0) + { + odata[0] == 0; + return; + } + + odata[index] = idata[index - 1]; } - // TODO: __global__ /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ - void scan(int n, int *odata, const int *idata) { + void scan(int n, int *odata, const int *idata) + { + // Global memory arrays + int* dev_parallelArrayA; + int* dev_parallelArrayB; + int* dev_odata; + + int numArrayBytes = sizeof(int) * n; + + // CUDA Mallocs + cudaMalloc((void**)&dev_odata, numArrayBytes); + checkCUDAError("cudaMalloc dev_odata failed!", __LINE__); + + cudaMalloc((void**)&dev_parallelArrayA, numArrayBytes); + checkCUDAError("cudaMalloc dev_parallelArrayA failed!", __LINE__); + + cudaMalloc((void**)&dev_parallelArrayB, numArrayBytes); + checkCUDAError("cudaMalloc dev_parallelArrayB failed!", __LINE__); + + // Copy the input data array to the device + cudaMemcpy(dev_parallelArrayA, idata, numArrayBytes, cudaMemcpyHostToDevice); + checkCUDAError("memcpy failed!", __LINE__); + + // Kernel configuration + float blockSize = 64.f; + dim3 threadsPerBlock(blockSize); + dim3 fullBlocksPerGrid((((float)n) + blockSize - 1.f) / blockSize); + int nLog2 = ilog2ceil(n); + bool isReadingFromA = true; + timer().startGpuTimer(); - // TODO + + for (int d = 1; d <= nLog2; ++d) + { + int powTwo = pow(2, d - 1); + if (isReadingFromA) + { + // Perform the *INCLUSIVE* scan + kernNaiveScan << < fullBlocksPerGrid, threadsPerBlock >> > (n, powTwo, dev_parallelArrayB, dev_parallelArrayA); + cudaMemcpy(dev_parallelArrayA, dev_parallelArrayB, numArrayBytes, cudaMemcpyDeviceToDevice); + checkCUDAError("memcpy failed!", __LINE__); + } + else + { + // Perform the *INCLUSIVE* scan + kernNaiveScan << < fullBlocksPerGrid, threadsPerBlock >> > (n, powTwo, dev_parallelArrayA, dev_parallelArrayB); + cudaMemcpy(dev_parallelArrayB, dev_parallelArrayA, numArrayBytes, cudaMemcpyDeviceToDevice); + checkCUDAError("memcpy failed!", __LINE__); + } + isReadingFromA = !isReadingFromA; + } + timer().endGpuTimer(); + + // Copy the output data back into the host array + if (isReadingFromA) + { + kernExclToInclScan << < fullBlocksPerGrid, threadsPerBlock >> > (n, dev_odata, dev_parallelArrayA); + cudaMemcpy(odata, dev_odata, numArrayBytes, cudaMemcpyDeviceToHost); + checkCUDAError("memcpy failed!", __LINE__); + } + else + { + kernExclToInclScan << < fullBlocksPerGrid, threadsPerBlock >> > (n, dev_odata, dev_parallelArrayB); + cudaMemcpy(odata, dev_odata, numArrayBytes, cudaMemcpyDeviceToHost); + checkCUDAError("memcpy failed!", __LINE__); + } + + // CUDA Frees + cudaFree(dev_parallelArrayA); + cudaFree(dev_parallelArrayB); + cudaFree(dev_odata); } } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 36b732d..912ac01 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -8,21 +8,35 @@ namespace StreamCompaction { namespace Thrust { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + + thrust::host_vector host_idata(idata, idata + n); + + thrust::device_vector dev_idata = host_idata; + thrust::device_vector dev_odata(n); + timer().startGpuTimer(); - // TODO use `thrust::exclusive_scan` - // example: for device_vectors dv_in and dv_out: - // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + + thrust::exclusive_scan(dev_idata.begin(), dev_idata.end(), dev_odata.begin()); + timer().endGpuTimer(); + + timer().startGpuTimer(); + + thrust::exclusive_scan(dev_idata.begin(), dev_idata.end(), dev_odata.begin()); + + timer().endGpuTimer(); + + thrust::copy(dev_odata.begin(), dev_odata.end(), odata); } } }