Adds the first three lessons to the repo (#2)

* firing up the intermediate-level tutorial by refreshing the templates * starts writing the first chapter * starts the memory hierarchy section * adds more text to the registers subsubsection * adds text to the shared memory subsubsection * starts the constant memory subsubsection * adds more text to the shared memory subsubsection * finishes the constant memory subsubsection * finishes the texture memory subsubsection * starts the memory management subsection * adds text to device memory manacement subsection * proceeds with zero-copy memory subsubsection * starts the unified memory subsubsection * finishes the first draft of CUDA memory model chapter * rearranges the lessons and adds some text to the profiling chapter * adds some text to CLI section * adds materials regarding to Nsight Compute * finishes the CLI subsection of the Nsight Systems section * finishes the nsys CLI section * adds minor changes before starting the GUI section * adds the memory hierarchy figure * removes minor typo * adds figures to CUDA Memory Model chapter * starts working on nsys GUI profiler section * adds two subsections for the nsys-ui section * adds more text to the nsys-gui section * continues the GUI profiling subsection * minor changes * finishes the Nsight Systems section * starts writing the Nsight Compute section * adds some text to the Nsight Compute section * starts the CLI subsection * finishes the first draft of the CLI subsection * starts writing the Nsight Compute GUI section * continues to write the direct profiling in ncu-ui subsubsection * adds a new reference (NVTX) * continues writing the ncu-ui subsbusection * wraps up the first draft of the profiler chapter * starts writing the performance optimization lesson * works on the maximization of the device utilization section * adds new material regarding concurrency
MolSSI-Education · Jun 12, 2024 · 8eae1f1 · 8eae1f1
1 parent 5a7036d
commit 8eae1f1
Show file tree

Hide file tree

Showing 17 changed files with 619 additions and 22 deletions.
diff --git a/_episodes/01-profiling.md b/_episodes/01-profiling.md
diff --git a/_episodes/02-cuda-memory-model.md b/_episodes/02-cuda-memory-model.md
@@ -474,7 +474,7 @@ Note that this method should only be used within file- and global scopes. Manage
 [``](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gd228014f19cc0975ebe3e0dd2af6dd1b) as
 
 ~~~
-cudaError_t cudaMallocManaged (void** devPtr, size_t size, unsigned int flags = cudaMemAttachGlobal);
+cudaError_t cudaMallocManaged(void** devPtr, size_t size, unsigned int flags = cudaMemAttachGlobal);
 ~~~
 {: .language-cuda}
 
@@ -502,4 +502,4 @@ allocation of the zero-copy memory causes the memory access performance to suffe
 Unified memory system, on the other hand, does not suffer from the aforementioned issues as it can automatically migrate data, on demand, between host 
 and device in order to enhance the locality and ultimately, performance.
 
-{% include links.md %}
+{% include links.md %}
diff --git a/_episodes/03-guidelines.md b/_episodes/03-guidelines.md
@@ -0,0 +1,109 @@
+---
+title: "Performance Guidelines and Optimization Strategies"
+teaching: 60
+exercises: 0
+questions:
+- ""
+- ""
+- ""
+objectives:
+- ""
+- ""
+- ""
+keypoints:
+- ""
+- ""
+- ""
+---
+
+- [1. Recommended Strategies for Performance Optimization](#1-recommended-strategies-for-performance-optimization)
+  - [1.1. Maximization of the Device Utilization](#11-maximization-of-the-device-utilization)
+  - [1.2. Maximization of the Memory Throughput](#12-maximization-of-the-memory-throughput)
+  - [1.3. Maximization of the Instruction Throughput](#13-maximization-of-the-instruction-throughput)
+  - [1.4. Minimization of the Memory Thrashing](#14-minimization-of-the-memory-thrashing)
+
+## 1. Recommended Strategies for Performance Optimization
+
+[NVIDIA Performance Guidelines](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#performance-guidelines) offers the
+following basic strategies for performance optimization of an application:
+
+- Maximization of parallel execution in order to achieve maximum utilization of resources on the device(s)
+- Optimization of the device memory usage in order to maximize the memory throughput
+- Improvement of instruction usage in order to gain maximum instruction throughput, and
+- Minimization of memory thrashing
+
+The maximum performance gains are usually program/system dependent. For example, attempts to improve the performance of a kernel
+which is mostly limited by its memory access will not be possibly impactful. As such, all performance optimization efforts
+should be guided by quantitative analysis tools such as [NVIDIA Nsight Systems](https://docs.nvidia.com/nsight-systems/index.html)
+and [Nsight Compute](https://docs.nvidia.com/nsight-compute/2021.2/index.html) profilers offering a wide variety of performance
+metrics for CUDA parallel programs. For instance, Nsight Compute profiler offers [GPU Speed of Light section](https://docs.nvidia.com/
+nsight-compute/2021.2/ProfilingGuide/index.html#sections-and-rules) consisting of metrics which provide a high-level overview of
+GPU's memory and compute throughput in terms of achieved utilization percentage with respect to the maximum theoretical limit of
+the metric being measured. As such these metrics offer a great deal of information indicating how much performance improvement is
+possible for a kernel.
+
+In the following sections, let us briefly overview the performance optimization strategies mentioned above.
+
+### 1.1. Maximization of the Device Utilization
+
+In order to maximize the utilization of resources on the device, the developer must expose the program's code to as much parallelism across
+different logical levels of the system as possible. These levels involve: (i) the [application](https://docs.nvidia.com/cuda/
+cuda-c-programming-guide/index.html#application-level), (ii) the [device](https://docs.nvidia.com/cuda/cuda-c-programming-guide/
+index.html#device-level), and (iii) the [microprocessor](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#multiprocessor-level).
+
+Adopting asynchronous CUDA APIs and streams through The main goal at the application level is to maximize concurrency in parallel execution
+between the host, device(s). As such, one attempts to allocate as much parallel work to the device and serial work to the host as possible.
+
+> ## Note
+>
+> Sometimes the parallelism must be broken for threads to synchronize and share the data among themselves.
+> If the threads belong to the same thread-block, the synchronization can be performed *via* `__syncthreads()` and the data is
+> shared through the shared memory within a single kernel execution. However, threads from separate blocks must share the data
+> *via* different kernel executions *through* lower band-width global memory. Thus, the second less-performant scenario should
+> be minimized due to the kernel execution overheads and slower global memory transfers.
+{: .discussion}
+
+The following [list of asynchronous CUDA operations](https://docs.nvidia.com/cuda/cuda-c-programming-guide/
+index.html#asynchronous-concurrent-execution) can be performed independently and concurrently
+
+- host computations
+- device computations
+- HtoD memory transfer operations
+- DtoH memory transfer operations
+- memory transfer operations in individual devices
+- memory transfer operations between two or multiple devices
+
+The CUDA library's asynchronous function calls allows users to dispatch multiple device operations and distribute them in
+queues based on the resource availability. Decreasing the device management workload pressure on the host though benefiting
+from concurrency makes it available for taking part in other simultaneous tasks which might improve the performance in
+general.
+
+Some GPUs with compute capability of 2.0 and higher can launch multiple kernels, concurrently. The possibility of concurrent
+kernel execution can be queried from the device's property enum variable [`concurrentKernels`](https://docs.nvidia.com/cuda/
+cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp_18e2fe2a3b264901816874516af12a097). The maximum number of
+concurrent kernel execution is also dependent on the device's compute capability and can be found in
+[CUDA Toolkit's documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.
+html#features-and-technical-specifications__technical-specifications-per-compute-capability). In addition to the concurrent
+execution of multiple kernels, the data transfer/memory copy between the host and the device as well as intra-device operations
+can also be executed asynchronously among themselves or with kernel launches. Device's property enumeration variable
+[`asyncEngineCount`](https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.
+html#structcudaDeviceProp_105a89c028bee8fe480d0f44ddd43357b) can be queried to see whether the concurrent kernel execution and
+data transfer is supported on the available device(s).
+
+> ## Note
+>
+> The host memory must be page-locked if involved in the overlapped memory copy/data transfer operations.
+{: .discussion}
+
+In CUDA applications, concurrent operations including data transfers and kernel executions can be handled through
+[**streams**](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#streams). Streams are sequences of instructions
+which execute in order. The completion of independent instructions launched in different streams can be guaranteed *via*
+synchronization commands.
+
+### 1.2. Maximization of the Memory Throughput
+
+### 1.3. Maximization of the Instruction Throughput
+
+### 1.4. Minimization of the Memory Thrashing
+
+{% include links.md %}
diff --git a/fig/ncu-ui_connect.png b/fig/ncu-ui_connect.png
diff --git a/fig/ncu-ui_main.png b/fig/ncu-ui_main.png
diff --git a/fig/ncu-ui_rep1.png b/fig/ncu-ui_rep1.png
diff --git a/fig/ncu-ui_rep2.png b/fig/ncu-ui_rep2.png
diff --git a/fig/ncu-ui_sections.png b/fig/ncu-ui_sections.png
diff --git a/fig/ncu-ui_source.png b/fig/ncu-ui_source.png
diff --git a/fig/ncu-ui_summary.png b/fig/ncu-ui_summary.png
diff --git a/fig/nsys-ui_main.png b/fig/nsys-ui_main.png
diff --git a/fig/nsys-ui_main2.png b/fig/nsys-ui_main2.png
diff --git a/fig/nsys-ui_rep1.png b/fig/nsys-ui_rep1.png
diff --git a/fig/nsys-ui_rep2.png b/fig/nsys-ui_rep2.png
diff --git a/fig/nsys-ui_rep3.png b/fig/nsys-ui_rep3.png
diff --git a/fig/nsys-ui_rep4.png b/fig/nsys-ui_rep4.png
diff --git a/reference.md b/reference.md
@@ -2,6 +2,16 @@
 layout: reference
 ---
 
+## Blog Posts
+
+1. [Crovella, B. **Using Nsight Compute to Inspect your Kernels** (NVIDIA, 2019)](https://developer.nvidia.com/blog/using-nsight-compute-to-inspect-your-kernels)
+
+2. [McMillan, S. **Transitioning to Nsight Systems from NVIDIA Visual Profiler / nvprof** (NVIDIA, 2019) ](https://developer.nvidia.com/blog/transitioning-nsight-systems-nvidia-visual-profiler-nvprof/)
+
+3. [Wilper, H. **Migrating to NVIDIA Nsight Tools from NVVP and Nvprof** (NVIDIA, 2019)](https://developer.nvidia.com/blog/migrating-nvidia-nsight-tools-nvvp-nvprof/)
+
+4. [Kraus, J. **CUDA Pro Tip: Generate Custom Application Profile Timelines with NVTX**](https://developer.nvidia.com/blog/cuda-pro-tip-generate-custom-application-profile-timelines-nvtx/)
+
 ## Further Readings
 
 1. [Cheng, J.; Grossman, M.; McKercher, T. **Professional CUDA C Programming** (Wiley, Indianapolis IN, USA, 2014), ISBN: 978-1-118-73932-7](https://www.wiley.com/en-us/Professional+CUDA+C+Programming-p-9781118739327)