-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adds the first three lessons to the repo (#2)
* firing up the intermediate-level tutorial by refreshing the templates * starts writing the first chapter * starts the memory hierarchy section * adds more text to the registers subsubsection * adds text to the shared memory subsubsection * starts the constant memory subsubsection * adds more text to the shared memory subsubsection * finishes the constant memory subsubsection * finishes the texture memory subsubsection * starts the memory management subsection * adds text to device memory manacement subsection * proceeds with zero-copy memory subsubsection * starts the unified memory subsubsection * finishes the first draft of CUDA memory model chapter * rearranges the lessons and adds some text to the profiling chapter * adds some text to CLI section * adds materials regarding to Nsight Compute * finishes the CLI subsection of the Nsight Systems section * finishes the nsys CLI section * adds minor changes before starting the GUI section * adds the memory hierarchy figure * removes minor typo * adds figures to CUDA Memory Model chapter * starts working on nsys GUI profiler section * adds two subsections for the nsys-ui section * adds more text to the nsys-gui section * continues the GUI profiling subsection * minor changes * finishes the Nsight Systems section * starts writing the Nsight Compute section * adds some text to the Nsight Compute section * starts the CLI subsection * finishes the first draft of the CLI subsection * starts writing the Nsight Compute GUI section * continues to write the direct profiling in ncu-ui subsubsection * adds a new reference (NVTX) * continues writing the ncu-ui subsbusection * wraps up the first draft of the profiler chapter * starts writing the performance optimization lesson * works on the maximization of the device utilization section * adds new material regarding concurrency
- Loading branch information
1 parent
5a7036d
commit 8eae1f1
Showing
17 changed files
with
619 additions
and
22 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
--- | ||
title: "Performance Guidelines and Optimization Strategies" | ||
teaching: 60 | ||
exercises: 0 | ||
questions: | ||
- "" | ||
- "" | ||
- "" | ||
objectives: | ||
- "" | ||
- "" | ||
- "" | ||
keypoints: | ||
- "" | ||
- "" | ||
- "" | ||
--- | ||
|
||
- [1. Recommended Strategies for Performance Optimization](#1-recommended-strategies-for-performance-optimization) | ||
- [1.1. Maximization of the Device Utilization](#11-maximization-of-the-device-utilization) | ||
- [1.2. Maximization of the Memory Throughput](#12-maximization-of-the-memory-throughput) | ||
- [1.3. Maximization of the Instruction Throughput](#13-maximization-of-the-instruction-throughput) | ||
- [1.4. Minimization of the Memory Thrashing](#14-minimization-of-the-memory-thrashing) | ||
|
||
## 1. Recommended Strategies for Performance Optimization | ||
|
||
[NVIDIA Performance Guidelines](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#performance-guidelines) offers the | ||
following basic strategies for performance optimization of an application: | ||
|
||
- Maximization of parallel execution in order to achieve maximum utilization of resources on the device(s) | ||
- Optimization of the device memory usage in order to maximize the memory throughput | ||
- Improvement of instruction usage in order to gain maximum instruction throughput, and | ||
- Minimization of memory thrashing | ||
|
||
The maximum performance gains are usually program/system dependent. For example, attempts to improve the performance of a kernel | ||
which is mostly limited by its memory access will not be possibly impactful. As such, all performance optimization efforts | ||
should be guided by quantitative analysis tools such as [NVIDIA Nsight Systems](https://docs.nvidia.com/nsight-systems/index.html) | ||
and [Nsight Compute](https://docs.nvidia.com/nsight-compute/2021.2/index.html) profilers offering a wide variety of performance | ||
metrics for CUDA parallel programs. For instance, Nsight Compute profiler offers [GPU Speed of Light section](https://docs.nvidia.com/ | ||
nsight-compute/2021.2/ProfilingGuide/index.html#sections-and-rules) consisting of metrics which provide a high-level overview of | ||
GPU's memory and compute throughput in terms of achieved utilization percentage with respect to the maximum theoretical limit of | ||
the metric being measured. As such these metrics offer a great deal of information indicating how much performance improvement is | ||
possible for a kernel. | ||
|
||
In the following sections, let us briefly overview the performance optimization strategies mentioned above. | ||
|
||
### 1.1. Maximization of the Device Utilization | ||
|
||
In order to maximize the utilization of resources on the device, the developer must expose the program's code to as much parallelism across | ||
different logical levels of the system as possible. These levels involve: (i) the [application](https://docs.nvidia.com/cuda/ | ||
cuda-c-programming-guide/index.html#application-level), (ii) the [device](https://docs.nvidia.com/cuda/cuda-c-programming-guide/ | ||
index.html#device-level), and (iii) the [microprocessor](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#multiprocessor-level). | ||
|
||
Adopting asynchronous CUDA APIs and streams through The main goal at the application level is to maximize concurrency in parallel execution | ||
between the host, device(s). As such, one attempts to allocate as much parallel work to the device and serial work to the host as possible. | ||
|
||
> ## Note | ||
> | ||
> Sometimes the parallelism must be broken for threads to synchronize and share the data among themselves. | ||
> If the threads belong to the same thread-block, the synchronization can be performed *via* `__syncthreads()` and the data is | ||
> shared through the shared memory within a single kernel execution. However, threads from separate blocks must share the data | ||
> *via* different kernel executions *through* lower band-width global memory. Thus, the second less-performant scenario should | ||
> be minimized due to the kernel execution overheads and slower global memory transfers. | ||
{: .discussion} | ||
|
||
The following [list of asynchronous CUDA operations](https://docs.nvidia.com/cuda/cuda-c-programming-guide/ | ||
index.html#asynchronous-concurrent-execution) can be performed independently and concurrently | ||
|
||
- host computations | ||
- device computations | ||
- HtoD memory transfer operations | ||
- DtoH memory transfer operations | ||
- memory transfer operations in individual devices | ||
- memory transfer operations between two or multiple devices | ||
|
||
The CUDA library's asynchronous function calls allows users to dispatch multiple device operations and distribute them in | ||
queues based on the resource availability. Decreasing the device management workload pressure on the host though benefiting | ||
from concurrency makes it available for taking part in other simultaneous tasks which might improve the performance in | ||
general. | ||
|
||
Some GPUs with compute capability of 2.0 and higher can launch multiple kernels, concurrently. The possibility of concurrent | ||
kernel execution can be queried from the device's property enum variable [`concurrentKernels`](https://docs.nvidia.com/cuda/ | ||
cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp_18e2fe2a3b264901816874516af12a097). The maximum number of | ||
concurrent kernel execution is also dependent on the device's compute capability and can be found in | ||
[CUDA Toolkit's documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index. | ||
html#features-and-technical-specifications__technical-specifications-per-compute-capability). In addition to the concurrent | ||
execution of multiple kernels, the data transfer/memory copy between the host and the device as well as intra-device operations | ||
can also be executed asynchronously among themselves or with kernel launches. Device's property enumeration variable | ||
[`asyncEngineCount`](https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp. | ||
html#structcudaDeviceProp_105a89c028bee8fe480d0f44ddd43357b) can be queried to see whether the concurrent kernel execution and | ||
data transfer is supported on the available device(s). | ||
|
||
> ## Note | ||
> | ||
> The host memory must be page-locked if involved in the overlapped memory copy/data transfer operations. | ||
{: .discussion} | ||
|
||
In CUDA applications, concurrent operations including data transfers and kernel executions can be handled through | ||
[**streams**](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#streams). Streams are sequences of instructions | ||
which execute in order. The completion of independent instructions launched in different streams can be guaranteed *via* | ||
synchronization commands. | ||
|
||
### 1.2. Maximization of the Memory Throughput | ||
|
||
### 1.3. Maximization of the Instruction Throughput | ||
|
||
### 1.4. Minimization of the Memory Thrashing | ||
|
||
{% include links.md %} |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters