CIS565-Fall-2017 · klingerj · Sep 15, 2017 · Sep 17, 2017 · Sep 18, 2017 · Sep 20, 2017
diff --git a/README.md b/README.md
@@ -1,13 +1,76 @@
 CUDA Stream Compaction
 ======================
 
-**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
+**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2 - Stream Compaction**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Joseph Klinger
+* Tested on: Windows 10, i5-7300HQ (4 CPUs) @ ~2.50GHz, GTX 1050 6030MB (Personal Machine)
 
-### (TODO: Your README)
+### README
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+This project consists of a series of implementations of the inclusive/exclusive scan and stream compaction across the CPU and GPU.
+We implement a sequential CPU scan, a non-work-efficient GPU scan (naive), and an actually work-efficient GPU scan.
 
+A scan is a prefix sum (assuming an array of integers for now), meaning that the index i in the output array will consist of the sum of each previous element
+in the input array. Here's a concrete example of each kind of scan, taken from the slides by Patrick Cozzi and Shehzan Mohammed [here](https://docs.google.com/presentation/d/1ETVONA7QDM-WqsEj4qVOGD6Kura5I6E9yqH-7krnwZ0/edit#slide=id.p27)
+
+![](img/scans.png)
+
+How do we scan effectively? Well, the naive way would be to run a kernel on every element of the threads and check if a particular element should contribute to the sum. The following image, taken from [this](https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html) GPU Gem
+shows the algorithm in execution for a sample input array:
+
+![](img/naive.png)
+
+Ideally, we don't want to have the GPU do any work for elements that don't contribute to the sum during a later iteration of the algorithm. This naive algorithm requires O(n * log(n)) adds. Turns out,
+there is an algorithm that can give us a scan with only O(n) adds (images taken from the aforementioned slides from Patrick and Shehzan):
+
+In the first step, we perform an "up-sweep" where we can compute some partial sums:
+
+![](img/upsweep.png)
+
+After that, we perform a "down-sweep" that completes the other half of our scan:
+
+![](img/downsweep.png)
+
+Now that we have a fast way to scan in parallel, we can perform stream compaction in a reasonable amount of time.
+
+Stream Compaction utilizes scans as a way to reduce an array of any integers to an array consisting only of the integers that meet a certain criteria (say, not equal to 0).
+Taking some images from the same slides as before, here is step 1 to stream compaction - create an array of 1s and 0s indicating whether or not we want to keep a certain element, then
+run an exclusive scan on that array:
+
+![](img/compact.png)
+
+So, in this example, we want to keep elements a, c, d, and g.
+
+As it turns out, for each element in the input array that has a 1 in the intermediate array, the corresponding value in the summed array is that element's index in the final output array.
+This step is called scatter:
+
+![](img/compact2.png)
+
+Now we are left with our desired array.
+
+### Analysis
+Here are the analysis results from my implementations:
+
+![](img/graphAllScans.png)
+
+Here we can get a general vibe that the naive parallel scan didn't work so well, the CPU scan performed decently (relying on the cache isn't that bad in this case where we are simply iterating down the array sequentially!), adding the work-efficient optimization is a must, and thrust wins easily.
+
+One major aspect to note is the difference between the two work-efficient parallel scans. One utilizes a naive arrangement of block launching and the other is more intelligent (only launching
+threads as needed). To be more specific, during the work-efficient algorithm, we are not updating every element of the input array during every iteration of the algorithm - in fact, we are updating
+half as many for each iteration. So, we should only have as many threads do work as are needed.
+
+In the following graph, we can see that this optimization is what allows the parallel implementation to outperform the CPU implementation:
+
+![](img/graphGPUandCPU.png)
+
+Now the work-efficient parallel scan is good, but still, the 3rd party Thrust implementation still by far outperforms all other implementations (as you can see in the first graph too):
+
+![](img/graphGPU.png)
+
+There are further optimizations that can be added to the work-efficient parallel scan that can help it compete with Thrust, such as utilizing shared memory to minimize reads from global memory (a major slowdown), but they weren't completed for this project (yet!).
+
+Here is the raw output from the program:
+![](img/output1.png)
+
+![](img/output2.png)
diff --git a/img/compact.png b/img/compact.png
diff --git a/img/compact2.png b/img/compact2.png
diff --git a/img/downsweep.png b/img/downsweep.png
diff --git a/img/graphAllScans.png b/img/graphAllScans.png
diff --git a/img/graphGPU.png b/img/graphGPU.png
diff --git a/img/graphGPUandCPU.png b/img/graphGPUandCPU.png
diff --git a/img/naive.png b/img/naive.png
diff --git a/img/output1.png b/img/output1.png
diff --git a/img/output2.png b/img/output2.png
diff --git a/img/scans.png b/img/scans.png
diff --git a/img/upsweep.png b/img/upsweep.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,19 +13,67 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 12; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int a[SIZE], b[SIZE], c[SIZE];
 
+// CUDA Device Properties Stuff
+
+// Print device properties
+void printDevProp(cudaDeviceProp devProp)
+{
+    printf("Major revision number:         %d\n", devProp.major);
+    printf("Minor revision number:         %d\n", devProp.minor);
+    printf("Name:                          %s\n", devProp.name);
+    printf("Total global memory:           %u\n", devProp.totalGlobalMem);
+    printf("Total shared memory per block: %u\n", devProp.sharedMemPerBlock);
+    printf("Total registers per block:     %d\n", devProp.regsPerBlock);
+    printf("Warp size:                     %d\n", devProp.warpSize);
+    printf("Maximum memory pitch:          %u\n", devProp.memPitch);
+    printf("Maximum threads per block:     %d\n", devProp.maxThreadsPerBlock);
+    for (int i = 0; i < 3; ++i)
+        printf("Maximum dimension %d of block:  %d\n", i, devProp.maxThreadsDim[i]);
+    for (int i = 0; i < 3; ++i)
+        printf("Maximum dimension %d of grid:   %d\n", i, devProp.maxGridSize[i]);
+    printf("Clock rate:                    %d\n", devProp.clockRate);
+    printf("Total constant memory:         %u\n", devProp.totalConstMem);
+    printf("Texture alignment:             %u\n", devProp.textureAlignment);
+    printf("Concurrent copy and execution: %s\n", (devProp.deviceOverlap ? "Yes" : "No"));
+    printf("Number of multiprocessors:     %d\n", devProp.multiProcessorCount);
+    printf("Kernel execution timeout:      %s\n", (devProp.kernelExecTimeoutEnabled ? "Yes" : "No"));
+    return;
+}
+
 int main(int argc, char* argv[]) {
+    // CUDA Device Properties - http://www.cs.fsu.edu/~xyuan/cda5125/examples/lect24/devicequery.cu
+    // Number of CUDA devices
+    /*int devCount;
+    cudaGetDeviceCount(&devCount);
+    printf("CUDA Device Query...\n");
+    printf("There are %d CUDA devices.\n", devCount);
+
+    // Iterate through devices
+    for (int i = 0; i < devCount; ++i)
+    {
+        // Get device properties
+        printf("\nCUDA Device #%d\n", i);
+        cudaDeviceProp devProp;
+        cudaGetDeviceProperties(&devProp, i);
+        printDevProp(devProp);
+    }
+
+    printf("\nPress any key to exit...");
+    char c1;
+    scanf("%c", &c1);*/
+
     // Scan tests
 
     printf("\n");
     printf("****************\n");
     printf("** SCAN TESTS **\n");
     printf("****************\n");
 
-    genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
+    genOnesArray(SIZE - 1, a);  // Leave a 0 at the end to test that edge case
     a[SIZE - 1] = 0;
     printArray(SIZE, a, true);
 
@@ -49,42 +97,42 @@ int main(int argc, char* argv[]) {
     printDesc("naive scan, power-of-two");
     StreamCompaction::Naive::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("naive scan, non-power-of-two");
     StreamCompaction::Naive::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, power-of-two");
     StreamCompaction::Efficient::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
-
+    
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, non-power-of-two");
     StreamCompaction::Efficient::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
-
+    
     zeroArray(SIZE, c);
     printDesc("thrust scan, power-of-two");
     StreamCompaction::Thrust::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("thrust scan, non-power-of-two");
     StreamCompaction::Thrust::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     printf("\n");
@@ -129,15 +177,15 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient compact, power-of-two");
     count = StreamCompaction::Efficient::compact(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
-
+    
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, non-power-of-two");
     count = StreamCompaction::Efficient::compact(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
     system("pause"); // stop Win32 console from closing on exit
-}
+}
diff --git a/src/testing_helpers.hpp b/src/testing_helpers.hpp
@@ -51,6 +51,12 @@ void genArray(int n, int *a, int maxval) {
     }
 }
 
+void genOnesArray(int n, int *a) {
+    for (int i = 0; i < n; i++) {
+        a[i] = 1;
+    }
+}
+
 void printArray(int n, int *a, bool abridged = false) {
     printf("    [ ");
     for (int i = 0; i < n; i++) {

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -1,15 +1,15 @@
 #include <cstdio>
 #include "cpu.h"
 
-#include "common.h"
+#include "common.h"
 
 namespace StreamCompaction {
     namespace CPU {
-        using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
-	        static PerformanceTimer timer;
-	        return timer;
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+            static PerformanceTimer timer;
+            return timer;
         }
 
         /**
@@ -18,9 +18,15 @@ namespace StreamCompaction {
          * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
          */
         void scan(int n, int *odata, const int *idata) {
-	        timer().startCpuTimer();
-            // TODO
-	        timer().endCpuTimer();
+            timer().startCpuTimer();
+
+            odata[0] = 0;
+            for (int i = 1; i < n; ++i)
+            {
+                odata[i] = idata[i - 1] + odata[i - 1];
+            }
+
+            timer().endCpuTimer();
         }
 
         /**
@@ -29,10 +35,18 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithoutScan(int n, int *odata, const int *idata) {
-	        timer().startCpuTimer();
-            // TODO
-	        timer().endCpuTimer();
-            return -1;
+            timer().startCpuTimer();
+            int index = 0;
+            for (int i = 0; i < n; ++i)
+            {
+                if (idata[i])
+                {
+                    odata[index] = idata[i];
+                    index++;
+                }
+            }
+            timer().endCpuTimer();
+            return index;
         }
 
         /**
@@ -41,10 +55,38 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithScan(int n, int *odata, const int *idata) {
-	        timer().startCpuTimer();
-	        // TODO
-	        timer().endCpuTimer();
-            return -1;
+            timer().startCpuTimer();
+
+            int* tempScanArray = (int*)malloc(sizeof(int) * n);
+            int* tempScanResultArray = (int*)malloc(sizeof(int) * n);
+
+            // Create temporary array
+            for (int i = 0; i < n; ++i)
+            {
+                tempScanArray[i] = (idata[i]) ? 1 : 0;
+            }
+
+            // Included exclusive scan implementation here in order to avoid the conflict with multiple timers:
+            tempScanResultArray[0] = 0;
+            for (int i = 1; i < n; ++i)
+            {
+                tempScanResultArray[i] = tempScanArray[i - 1] + tempScanResultArray[i - 1];
+            }
+
+            // Scatter
+            for (int i = 0; i < n; ++i)
+            {
+                if (tempScanArray[i])
+                {
+                    odata[tempScanResultArray[i]] = idata[i];
+                }
+            }
+
+            int compactSize = (tempScanArray[n - 1]) ? tempScanResultArray[n - 1] + 1 : tempScanResultArray[n - 1];
+            timer().endCpuTimer();
+            delete []tempScanArray;
+            delete []tempScanResultArray;
+            return compactSize;
         }
     }
 }