diff --git a/_episodes/01-profiling.md b/_episodes/01-profiling.md index 45822d5..d36d54e 100644 --- a/_episodes/01-profiling.md +++ b/_episodes/01-profiling.md @@ -1,31 +1,53 @@ --- -title: "Profiling with NVIDIA Nsight" +title: "NVIDIA Nsight Profilers" teaching: 60 exercises: 0 questions: - "What is profiling? Why and how is it useful for parallelization?" - "What are NVIDIA Nsight Systems and Nsight Compute? What do they do and how can I use them?" +- "What is the difference between Nsight Systems/Compute's command line interface (CLI) and graphical user interface (GUI) profilers?" objectives: -- "Mastering best practices in profiling-driven approach in CUDA C/C++ programming" +- "Leaning the fundamentals of NVIDIA Nsight Systems CLI profiler" +- "Basic familiarity with NVIDIA Nsight Systems GUI profiler" +- "Mastering the basics of the NVIDIA Nsight Compute CLI profiler" +- "NVIDIA Nsight Compute GUI profiler" keypoints: -- "NVIDIA Nsight Systems" -- "Profiling-driven CUDA C/C++ programming" -- "APOD application design cycle" +- "NVIDIA Nsight Systems CLI profiler" +- "NVIDIA Nsight Systems GUI profiler" +- "NVIDIA Nsight Compute CLI profiler" +- "NVIDIA Nsight Compute GUI profiler" --- -- [1. Overview](#1-overview) -- [2. NVIDIA Nsight Systems](#2-nvidia-nsight-systems) - - [2.1. Command Line Interface Profiler](#21-command-line-interface-profiler) - - [2.1.1. CUDA API Statistics](#211-cuda-api-statistics) - - [2.1.2. CUDA Kernel Statistics](#212-cuda-kernel-statistics) - - [2.1.3. CUDA Memory Operations Statistics](#213-cuda-memory-operations-statistics) - - [2.1.4. Operating System Runtime API Statistics](#214-operating-system-runtime-api-statistics) - - [2.2. Graphical User Interface Profiler](#22-graphical-user-interface-profiler) -- [3. NVIDIA Nsight Compute](#3-nvidia-nsight-compute) - - [3.1. Command Line Interface Profiler](#31-command-line-interface-profiler) - - [3.2. Graphical User Interface Profiler](#32-graphical-user-interface-profiler) -- [4. Example: Vector Addition (AXPY)](#4-example-vector-addition-axpy) -- [5. Example: Matrix Addition](#5-example-matrix-addition) +> ## Software/Hardware Specifications +> All profiling chapters adopt the latest version of NVIDIA Nsight Systems +> [(2021.3.1.54)](https://developer.nvidia.com/gameworksdownload#?dn=nsight-systems-2021-3-1-54) +> and Nsight Compute [(2021.2.2.0)](https://developer.nvidia.com/gameworksdownload#?dn=nsight-compute-2021-2-2) +> at the time of writing this tutorial. The aforementioned version of Nsight Systems supports important +> emerging features such as [**Expert System**](https://docs.nvidia.com/nsight-systems/UserGuide/index.html#expert-systems) +> designed for an automatic detection of performance optimization opportunities in parallel application's +> profile within both Command Line- (CLI) and Graphical User Interface (GUI) frameworks. +{: .callout} + +> ## Table of Contents +> - [1. Overview](#1-overview) +> - [2. NVIDIA Nsight Systems](#2-nvidia-nsight-systems) +> - [2.1. Command Line Interface Profiler](#21-command-line-interface-profiler) +> - [2.1.1. CUDA API Statistics](#211-cuda-api-statistics) +> - [2.1.2. CUDA Kernel Statistics](#212-cuda-kernel-statistics) +> - [2.1.3. CUDA Memory Operations Statistics](#213-cuda-memory-operations-statistics) +> - [2.1.4. Operating System Runtime API Statistics](#214-operating-system-runtime-api-statistics) +> - [2.2. Graphical User Interface Profiler](#22-graphical-user-interface-profiler) +> - [2.2.1. Direct Performance Analysis using Nsight Systems' GUI Profiler](#221-direct-performance-analysis-using-nsight-systems-gui-profiler) +> - [2.2.2. Importing Report Files in Nsight Systems' GUI Profiler](#222-importing-report-files-in-nsight-systems-gui-profiler) +> - [3. NVIDIA Nsight Compute](#3-nvidia-nsight-compute) +> - [3.1. Command Line Interface Profiler](#31-command-line-interface-profiler) +> - [3.1.1. GPU Speed of Light](#311-gpu-speed-of-light) +> - [3.1.2. Launch Statistics](#312-launch-statistics) +> - [3.1.3. Occupancy](#313-occupancy) +> - [3.2. Graphical User Interface Profiler](#32-graphical-user-interface-profiler) +> - [3.2.1. Direct Performance Analysis using Nsight Compute's GUI Profiler](#321-direct-performance-analysis-using-nsight-computes-gui-profiler) +> - [3.2.2. Importing Report Files in Nsight Compute's GUI Profiler](#322-importing-report-files-in-nsight-computes-gui-profiler) +{: .prereq} ## 1. Overview @@ -318,16 +340,472 @@ In the following sections, we overview the main aspects of the NVIDIA Nsight Sys ### 2.2. Graphical User Interface Profiler +There are two main scenarios for working with Nsight Systems' GUI framework: +- Generating the reports files (.qdrep) by the CLI profiler and import them into the Nsight Systems' GUI profiler for further analysis +- [Profiling the workload directly from the GUI](https://docs.nvidia.com/nsight-systems/UserGuide/index.html#gui-profiling) + +In the following subsections, we briefly overview each of these use cases. + +#### 2.2.1. Direct Performance Analysis using Nsight Systems' GUI Profiler + +Let's fire up the Nsight Systems' GUI profiler application by running the following command + +~~~ +$ nsys-ui +~~~ +{: .language-bash} + +This should open up the application panel which looks like the figure below. + +![Figure 1](../fig/nsys-ui_main.png) + +The *Project Explorer* panel on the left side of the screen will contain all projects (*e.g.*, Project 1) and report files in a tree +structure. The Nsight System makes it possible to compare two or multiple profiling report files and/or projects within the same environment +and side-by-side fashion. + +> ## Note: +> On a Linux OS, all project folders are physically stored in *~/.nsightsystems/Projects/*. +{: .discussion} + +At this stage, the host machine for performing the profiling process should be specified. In this case, Nsight System's GUI profiler has already +identified the localhost as an available profiling target and notified us with a message: "You have the (localhost) target available." +Simply click on the "select" button, shown at the end of this sentence. Alternatively, one can click on the *Manage targets* wrench icon or *Select +target for profiling...* dropdown menu button. Clicking on the *Configure targets...* in this menu gets you to the Manage targets window which allows +one to connect to a remote machine for profiling. + +As soon as the target machine is selected, a rather large list of checkboxes show up that allows the users to customize their profiling process and +specify which type of information should be collected and reported as a result of profiling process. As shown in the figure below, similar to the +CLI profiler command, it is mandatory to specify the working directory and the target application executable to be profiled. The aforementioned +parameters can be specified in their corresponding entries within the *Sample target process/Target application* drop-down menu combobox. + +![Figure 2](../fig/nsys-ui_main2.png) + +Let us press the start button to begin profiling the program. After finishing the profiling process, the results will be shown in a new window as illustrated +below. + +![Figure 3](../fig/nsys-ui_rep1.png) + +Project Explorer panel shows that the generated report file (Report 1.qdrep) is now part of Project 1. Therefore, we should be able to find the report file +in the *~/.nsightsystems/Projects/Project 1/*, by default. In setting up our profiling preferences, we chose to collect only CUDA API, memory and kernel operations +during the execution of the program by the profiler to simplify the analysis. As such, only two timeline channels are available for our inspection within the +*Timeline View* of the main central panel: (i) Threads, and (ii) CUDA Hardware (HW). The numbers following the CUDA HW refer to BUS location (here, 0000:01:00.0) +and the GPU device name/model (in this case, GeForce GTX 1650). + +The default zoom level as well as the sizing of each panel section might not be ideal for a convenient analysis. The zoom level can easily be modified +by either using the horizontal slide button at the top of the Timeline View panel, or using the ctrl + mouse scroll button. Alternatively, the area of interest +in the timeline panel can be highlighted by selecting the starting time through a mouse left-click and dragging it to the final selected point in time and releasing +the mouse button. Then, right-click on the selected area and select the *Filter and Zoom in* or *Zoom into Selection* from the menu. The right-click menu also +provides an incremental or reset-to-default option as well for reversing the aforementioned operations. + +![Figure 4](../fig/nsys-ui_rep2.png) + +The GUI offers an even more convenient and productive way of inspecting the timelines. The message at the center of the *Events View* panel at the bottom of the +screen says: *"Right-click a time-line row and select "Show in Events View" to see events here"*. Let's slide the zoom button at the top of the scree to widen the +timelines a little and then, follow the aforementioned instructions in the message. The resulting screen should look like the following screenshot + +![Figure 5](../fig/nsys-ui_rep3.png) + +Since we have right-clicked on the CUDA HW, which includes both *Memory* and *Kernels* sub-timelines, both CUDA memory and kernel operation statistics +are included in the report table within the Events View panel. Clicking on each row, which corresponds to an operation on the device, will populate the +right-bottom corner *Description* panel. The populated panel provides a summary that corresponds to the selected row in the Event View. The exact same description +can also be obtained through hovering on each block within each CUDA HW timeline as shown below. + +![Figure 6](../fig/nsys-ui_rep4.png) + +Using either Timeline View or Event View panels, one can describe the main parts of the [vector sum example](https://github.com/MolSSI-Education/ +gpu_programming_beginner/tree/gh-pages/src/gpu_vector_sum/v2_cudaCode) CUDA application: there are three HtoD data transfers to the d_a, d_b and +d_c arrays, allocated on the device, (`arraySumOnDevice`) kernel launch, and transferring the results back form DtoH. + +#### 2.2.2. Importing Report Files in Nsight Systems' GUI Profiler + +The Nsight System GUI profiler can be also employed to import the report files generated by the CLI profiler or the GUI profiler itself, +as shown in the previous subsection. The report file can be easily opened from the File/Open menu. Since our description of the profiler in the +previous subsection also remains valid for the present subsection, we will not repeat anything further. + +> ## Note: +> Experienced users who used to work with NVIDIA Visual Profiler (nvvp) and NVIDIA profiler (nvprof) will probably notice a lot of +> similarities between their user interface with those of Nsight Systems. Both nvvp and nvprof will be deprecated in future CUDA releases +> and will not be supported by new GPUs with compute capability 8.0 and higher. For more details, see NVIDIA developers' blog posts on +> [*Migrating to NVIDIA Nsight Tools from NVVP and Nvprof*](https://developer.nvidia.com/blog/migrating-nvidia-nsight-tools-nvvp-nvprof/) +> and [*Transitioning to Nsight Systems from NVIDIA Visual Profiler/nvprof*](https://developer.nvidia.com/blog/ +> transitioning-nsight-systems-nvidia-visual-profiler-nvprof/). +{: .discussion} ## 3. NVIDIA Nsight Compute +[Nsight Compute](https://docs.nvidia.com/nsight-compute/2021.2/index.html) is a highly efficient interactive kernel profiler, which similar to +NVIDIA Nsight Systems, provides both [CLI](https://docs.nvidia.com/nsight-compute/2021.2/NsightComputeCli/index.html) and +[GUI](https://docs.nvidia.com/nsight-compute/2021.2/NsightCompute/index.html). Nsight Compute offers a comprehensive list of performance metrics, +API debugging tools, and the possibility of comparing profiling reports all at the same place. The list of available metrics is huge and can be queried +using + +~~~ +$ ncu --query-metrics +~~~ +{: .language-bash} + +which generates the following output (only the first few lines are shown) + +~~~ +Device TU117 +------------------------------------------------------- ---------------------------------------------------------------------------- +Metric Name Metric Description +------------------------------------------------------- ---------------------------------------------------------------------------- +dram__bytes # of bytes accessed in DRAM +dram__bytes_read # of bytes read from DRAM +dram__bytes_write # of bytes written to DRAM +dram__cycles_active # of cycles where DRAM was active +dram__cycles_active_read # of cycles where DRAM was active for reads +dram__cycles_active_write # of cycles where DRAM was active for writes +dram__cycles_elapsed # of elapsed DRAM memory clock cycles +dram__cycles_in_frame # of cycles in user-defined frame +dram__cycles_in_region # of cycles in user-defined region +dram__sectors # of sectors accessed in DRAM +dram__sectors_read # of sectors read from DRAM +dram__sectors_write # of sectors written to DRAM +dram__throughput DRAM throughput +... +~~~ +{: .output} + +At this point, instead of printing the output to the console, it might be easier to store them in a text file by appending the aforementioned +command with `>> .txt`. Once we explain the metrics' naming conventions, such queries can be performed much more efficiently, +based on the logical units involved with the measured metrics. For example, we can narrow down our search only to those metrics pertinent to +device's main dynamic random access memory (DRAM) and denoted by the prefix `dram` as + +~~~ +$ ncu --query-metrics | grep dram__ +~~~ +{: .language-bash} + +The resulting output is similar to the one shown above without the remaining parts of it. + ### 3.1. Command Line Interface Profiler +Having a long list of available metrics can be very helpful as it allows choosing from a fine-tuned variety of options to analyze the +performance of a parallel application. However, for an inexperienced user, making the right choice of metric(s) for performance +analysis can be overwhelming. NVIDIA deals with this problem by using pre-defined [sets and sections](https://docs.nvidia.com/nsight-compute/ +2021.2/ProfilingGuide/index.html#sets-and-sections) of logically associated [metrics](https://docs.nvidia.com/nsight-compute/2021.2/ +ProfilingGuide/index.html#metric-collection). The current list of available sections can be queried *via* + +~~~ +$ ncu --list-sections +~~~ +{: .language-bash} + +which yields + +~~~ +--------------------------------- ------------------------------------- ------- -------------------------------------------------- +Identifier Display Name Enabled Filename +--------------------------------- ------------------------------------- ------- -------------------------------------------------- +ComputeWorkloadAnalysis Compute Workload Analysis no ...20.3.0/Sections/ComputeWorkloadAnalysis.section +InstructionStats Instruction Statistics no ...2020.3.0/Sections/InstructionStatistics.section +LaunchStats Launch Statistics yes ...pute/2020.3.0/Sections/LaunchStatistics.section +MemoryWorkloadAnalysis Memory Workload Analysis no ...020.3.0/Sections/MemoryWorkloadAnalysis.section +MemoryWorkloadAnalysis_Chart Memory Workload Analysis Chart no ...0/Sections/MemoryWorkloadAnalysis_Chart.section +MemoryWorkloadAnalysis_Deprecated (Deprecated) Memory Workload Analysis no ...tions/MemoryWorkloadAnalysis_Deprecated.section +MemoryWorkloadAnalysis_Tables Memory Workload Analysis Tables no .../Sections/MemoryWorkloadAnalysis_Tables.section +Nvlink NVLink no ...Nsight Compute/2020.3.0/Sections/Nvlink.section +Occupancy Occupancy yes ...ght Compute/2020.3.0/Sections/Occupancy.section +SchedulerStats Scheduler Statistics no ...e/2020.3.0/Sections/SchedulerStatistics.section +SourceCounters Source Counters no ...ompute/2020.3.0/Sections/SourceCounters.section +SpeedOfLight GPU Speed Of Light yes ... Compute/2020.3.0/Sections/SpeedOfLight.section +SpeedOfLight_RooflineChart GPU Speed Of Light Roofline Chart no ...3.0/Sections/SpeedOfLight_RooflineChart.section +WarpStateStats Warp State Statistics no ...e/2020.3.0/Sections/WarpStateStatistics.section +~~~ +{: .output} + +Each section is composed of sets of metrics allowing users to choose between faster but less detailed profiles +and slower but more comprehensive metric collections. Available sets can be listed *via* + +~~~ +$ ncu --list-sets +~~~ +{: .language-bash} + +which gives + +~~~ +Identifier Sections Enabled Estimated Metrics +---------- --------------------------------------------------------------------------- ------- ----------------- +default LaunchStats, Occupancy, SpeedOfLight yes 36 +detailed ComputeWorkloadAnalysis, InstructionStats, LaunchStats, MemoryWorkloadAnaly no 173 + sis, Nvlink, Occupancy, SchedulerStats, SourceCounters, SpeedOfLight, Speed + OfLight_RooflineChart, WarpStateStats +full ComputeWorkloadAnalysis, InstructionStats, LaunchStats, MemoryWorkloadAnaly no 178 + sis, MemoryWorkloadAnalysis_Chart, MemoryWorkloadAnalysis_Tables, Nvlink, O + ccupancy, SchedulerStats, SourceCounters, SpeedOfLight, SpeedOfLight_Roofli + neChart, WarpStateStats +source SourceCounters no 56 +~~~ +{: .output} + +The the first row in the available sets (or the third column in the available sections) table shown above indicates that the metrics that +Nsight Compute CLI profiler collects by default include high-level GPU utilization, occupancy and static launch data. The latter two do not +require kernel launch replay. The latter two are regularly available without [replaying the kernel launch](https://docs.nvidia.com/nsight-compute/ +2021.2/ProfilingGuide/index.html#kernel-replay). When no options such as `--set`, `--section` and no `--metrics` are provided to the CLI profiler, +the Nsight Compute will only collect the default set of metrics. Although the full set of sections can be collected using `--set full` option, +it is important to keep in mind that the number and type of the selected metrics directly affects the profiling overhead and performance. + +Let us once again, run the CLI profiler on the [vector sum example](https://github.com/MolSSI-Education/gpu_programming_beginner/tree/gh-pages/src/ +gpu_vector_sum/v2_cudaCode) in order to collect the default set of metrics for the `arraySumOnDevice` kernel. + +~~~ +$ ncu -o output vecSum +~~~ +{: .language-bash} + +This command exports the profiling results to the `output.ncu-rep` file. In the absence of the option and the output filename, the results will be +printed on the console screen and stored in temporary files which will be deleted after finishing the execution. Apart from the results of the +application print statements as well as profiler logs (denited by `==PROF==`), + +~~~ +Kicking off /home/sina/MOLSSI/gpu_programming_beginner/src/gpu_vector_sum/v2_cudaCode/vecSum + +==PROF== Connected to process 18985 (/home/sina/MOLSSI/gpu_programming_beginner/src/gpu_vector_sum/v2_cudaCode/vecSum) +GPU device GeForce GTX 1650 with index (0) is set! + +Vector size: 16777216 floats (64 MB) + +Elapsed time for dataInitializer: 0.765827 second(s) +Elapsed time for arraySumOnHost: 0.064471 second(s) +==PROF== Profiling "arraySumOnDevice" - 1: 0%....50%....100% - 8 passes +Elapsed time for arraySumOnDevice <<< 16384, 1024 >>>: 0.570959 second(s) + +Arrays are equal. + +==PROF== Disconnected from process 18985 +[18985] vecSum@127.0.0.1 + arraySumOnDevice(float*, float*, float*, int), 2021-Oct-05 13:48:45, Context 1, Stream 7 +~~~ +{: .output} + +the resulting profiler output consist of three main sections: (i) GPU Speed of Light, (ii) kernel launch statistics, and (iii) Occupancy. + +#### 3.1.1. GPU Speed of Light + +[GPU Speed of Light section](https://docs.nvidia.com/nsight-compute/2021.2/ProfilingGuide/index.html#sections-and-rules) offers a high-level +summary of device's memory resource and compute throughput in terms of achieved utilization percentage with respect to the maximum theoretical +limit of the metric being measured. The following table shows the GPU Speed of Light section of the default metric collection output for the +Nsight Compute CLI profiler + +~~~ + Section: GPU Speed Of Light + ---------------------------------------------------------------------- --------------- ------------------------------ + DRAM Frequency cycle/nsecond 4.00 + SM Frequency cycle/nsecond 1.41 + Elapsed Cycles cycle 2,499,985 + Memory [%] % 89.44 + SOL DRAM % 89.44 + Duration msecond 1.77 + SOL L1/TEX Cache % 13.15 + SOL L2 Cache % 33.45 + SM Active Cycles cycle 2,269,457.25 + SM [%] % 10.49 + ---------------------------------------------------------------------- --------------- ------------------------------ + OK The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To + further improve performance, work will likely need to be shifted from the most utilized to another unit. + Start by analyzing workloads in the Memory Workload Analysis section. +~~~ +{: .output} + +Note that Nsight Compute profiler notifies the user about the `arraySumOnDevice` kernel with the adopted execution configuration +which achieved more than 80% of the theoretical limits for the memory/compute throughput. The profiler also provides additional +recommendation(s) for further performance improvement. We will talk about strategies recommended by the profiler in the upcoming +lessons. + +#### 3.1.2. Launch Statistics + +The [Launch Statistics section](https://docs.nvidia.com/nsight-compute/2021.2/ProfilingGuide/index.html#sections-and-rules) offers +the details of the adopted execution configuration in the launched kernel being profiled such as number of threads in blocks, total +number of threads, total number of blocks *etc.*. + +~~~ + Section: Launch Statistics + ---------------------------------------------------------------------- --------------- ------------------------------ + Block Size 1,024 + Function Cache Configuration cudaFuncCachePreferNone + Grid Size 16,384 + Registers Per Thread register/thread 16 + Shared Memory Configuration Size Kbyte 32.77 + Driver Shared Memory Per Block byte/block 0 + Dynamic Shared Memory Per Block byte/block 0 + Static Shared Memory Per Block byte/block 0 + Threads thread 16,777,216 + Waves Per SM 1,024 + ---------------------------------------------------------------------- --------------- ------------------------------ +~~~ +{: .output} + +#### 3.1.3. Occupancy + +The multiprocessor occupancy is defined as the ration of active warps to the maximum number of warps supported on the multiprocessor of the +GPU. Alternatively, occupancy can be defined as device's ability to process warps that is actively in use. + +One way to calculate the multiprocessor occupancy is by using the [CUDA Occupancy Calculator](https://docs.nvidia.com/cuda/ +cuda-occupancy-calculator/index.html) which is a .xls spreadsheet file with pre-defined macros. The user populates the required +fields in the spreadsheet and it calculates and returns the multiprocessor occupancy. Another way to calculate the multiprocessor occupancy +is to use Nsight Compute kernel profiler. The occupancy metric is automatically collected and tabulated in the Occupancy section in the +output of the Nsight Compute CLI profiler as follows + +~~~ + Section: Occupancy + ---------------------------------------------------------------------- --------------- ------------------------------ + Block Limit SM block 16 + Block Limit Registers block 4 + Block Limit Shared Mem block 16 + Block Limit Warps block 1 + Theoretical Active Warps per SM warp 32 + Theoretical Occupancy % 100 + Achieved Occupancy % 90.37 + Achieved Active Warps Per SM warp 28.92 + ---------------------------------------------------------------------- --------------- ------------------------------ +~~~ +{: .output} + +This table illustrates that the `arraySumOnDevice` kernel in the [vector sum example](https://github.com/MolSSI-Education/gpu_programming_beginner/ +tree/gh-pages/src/gpu_vector_sum/v2_cudaCode) activates more than 90% of the available warps on each streaming multiprocessor (SM) or equivalently, +activates more than 90% of the available warps per SM. + +> ## Note: +> Higher values of occupancy does not always translate to higher performance. However, low occupancy always indicates GPU's reduced ability to hide +> latencies and thus, performance degradation. Note that large gaps between achieved and theoretical occupancy during kernel execution implies an imbalance +> workload. +{: .discussion} + ### 3.2. Graphical User Interface Profiler -## 4. Example: Vector Addition (AXPY) +Similar to its Nsight Systems counterpart, Nsight Compute GUI also provides two main scenarios for the performance analysis of the CUDA kernel(s): + +- Generating the reports files (.ncu-rep) by the CLI profiler and import them into the Nsight Compute's GUI for further analysis +- [Profiling the workload directly from the GUI](https://docs.nvidia.com/nsight-compute/2021.2/NsightCompute/index.html#quick-start) + +In the following subsections, we briefly overview each of these use cases. + +#### 3.2.1. Direct Performance Analysis using Nsight Compute's GUI Profiler + +Let us start the Nsight Compute's GUI profiler application by running the following command + +~~~ +$ ncu-ui +~~~ +{: .language-bash} + +This will open up the main application window and an embedded *Welcome Page* window which looks like the figure below. This window gives you multiple options to +start your performance analysis either by opening/creating a project file or use the last session's settings to get straight to the profiling settings +without an existing project. + +![Figure 7](../fig/ncu-ui_main.png) + +Clicking on the Continue button under *Quick Launch* gets us to the *Connect to process* [connection dialog](https://docs.nvidia.com/nsight-compute/2021.2/ +NsightCompute/index.html#connection-dialog) as shown below + +![Figure 8](../fig/ncu-ui_connect.png) + +The Connect to process window has to panels: (i) *Target Platform*, and (ii) *Activity*. The Target Platform panel's interface is very similar to +what we saw in Section [2.2.1. Direct Performance Analysis using Nsight Systems' GUI Profiler](#221-direct-performance-analysis-using-nsight-systems-gui-profiler) +for Nsight System - The user needs to specify the target platform and the local/[remote](https://docs.nvidia.com/nsight-compute/2021.2/NsightCompute/index.html#remote-connections) hosting machine on which the profiling process is going to be performed. +The absolute address to the parallel application's executable file must be specified in the *Application Executable* field. Any additional options to the +executable can be passed to the executable *via* the *Command Line Arguments* field. + +The Activity panel offers two distinct profiling modes: + +- [**Interactive Profile**](https://docs.nvidia.com/nsight-compute/2021.2/NsightCompute/index.html#quick-start-interactive) +- [**Profile (non-interactive)**](https://docs.nvidia.com/nsight-compute/2021.2/NsightCompute/index.html#quick-start-non-interactive) + +The Interactive Profile mode allows users to interactively control the target application execution, and isolate and profile the kernels of interest in a +step-by-step fashion. For the purpose of this tutorial, we will not get into the details of [Interactive Profile Activity](https://docs.nvidia.com/ +nsight-compute/2021.2/NsightCompute/index.html#quick-start-interactive). + +After selecting the (non-interactive) Profile mode from the Activity panel, fill in the name and absolute address of the output report file in the +*Output File* field within the *Common* tab. This maps to specifying the output file name using the `--export` or `-o` CLI options. The provided name +will be appended by the`ncu-rep` suffix. The *Target Processes* field specifies the process that is going to be profiled: (i) *Application Only*, +which only profiles the application itself, and (ii) *All*, which tells the Nsight Compute to profile the target application and all its child processes. +Setting the options for the Target Process field maps to specifying the `--target-processes` CLI option. In the [*Replay Mode*](https://docs.nvidia.com/ +nsight-compute/2021.2/ProfilingGuide/index.html#kernel-replay) field, the user can choose between *Kernel* or *Application* options. In the Kernel mode, +individual kernel launches are replayed for metric collections within the course of a single execution of the target application. However, in the +Application mode, the target application itself will be replayed for multiple times which allows the collection of additional data for kernel launches. +Setting the Replay Mode option is equivalent to specifying the `--replay-mode` CLI option. The remaining fields also map to their +[CLI profiler counterpart options](https://docs.nvidia.com/nsight-compute/2021.2/NsightComputeCli/index.html#command-line-options). The entire profiling +operation command with specified options is automatically generated and can be copied from the *Command Line* text box. + +The *Filter* tab allows users to select the target kernels to be profiled and includes kernel regex filer, the number of kernel launches to be skipped, +and the total number of kernel executions to be profiled. The *Sections* tab allows the specification of metric [section(s)](https://docs.nvidia.com/ +nsight-compute/2021.2/ProfilingGuide/index.html#sets-and-sections) to be collected for each kernel launch. Hovering over each section provides its +description in a pop-up tooltip. + +![Figure 9](../fig/ncu-ui_sections.png) + +The *Sampling* tab allows the modification of sampling options for each kernel execution. The *Other* tab offers options for the application of rules, +controlling the profiling process, custom metrics and options for collecting [NVIDIA Tools Extension (NVTX)](https://developer.nvidia.com/blog/ +cuda-pro-tip-generate-custom-application-profile-timelines-nvtx/) information. It is time to press the launch button at the bottom-right corner of the +to fire up the profiling process. + +After the Nsight Compute finishes the profiling process, the resulting report file shows up under the corresponding project file in the *Project Explorer* panel. +The Nsight Compute profiler report file contains multiple [report pages](https://docs.nvidia.com/nsight-compute/2021.2/NsightCompute/index.html#profiler-report-pages): + +- [**Session**](https://docs.nvidia.com/nsight-compute/2021.2/NsightCompute/index.html#profiler-report-session-page) +- [**Summary**](https://docs.nvidia.com/nsight-compute/2021.2/NsightCompute/index.html#profiler-report-summary-page) +- [**Details**](https://docs.nvidia.com/nsight-compute/2021.2/NsightCompute/index.html#profiler-report-details-page) +- [**Source**](https://docs.nvidia.com/nsight-compute/2021.2/NsightCompute/index.html#profiler-report-source-page) +- [**Comments**](https://docs.nvidia.com/nsight-compute/2021.2/NsightCompute/index.html#profiler-report-comments-page) +- [**NVTX**](https://docs.nvidia.com/nsight-compute/2021.2/NsightCompute/index.html#profiler-report-nvtx-page) +- [**Raw**](https://docs.nvidia.com/nsight-compute/2021.2/NsightCompute/index.html#profiler-report-raw-page) + +The aforementioned report pages can be accessed from the *Page* dropdown button at the top of the report panel. The *Session* page provides preliminary +information about the hosting machine, active process ID(s) to be profiled and device attributes. Switching between various launch instances highlights the +corresponding device attributes. The *Summary* page overviews a list of collected results across all kernel executions. A screenshot of the Summary page +from profiling the `arraySumOnDevice` kernel in the [vector sum example](https://github.com/MolSSI-Education/gpu_programming_beginner/tree/gh-pages/src/ +gpu_vector_sum/v2_cudaCode) is provided below + +![Figure 10](../fig/ncu-ui_summary.png) + +The *Details* page is where the Nsight Compute's GUI profiler automatically lands on by default, once the profiling process is finished. + +![Figure 11](../fig/ncu-ui_rep1.png) + +Similar to the output of the **ncu** CLI profiler we saw in Section [3.1. Command Line Interface Profiler](#31-command-line-interface-profiler), there +are three metric sections that the profiler collected by default: GPU Speed of Light, kernel Launch Statistics and multiprocessor Occupancy. The header +in each section provides a brief description of the metrics being measured within that section. + +In comparison with the Nsight Compute CLI profiler, GUI also provides one or more *body* subsections within the Speed of Light section: *SOL breakdown* +and *SOL chart* for GPU compute/memory utilization. These additional information can be accessed by clicking on the GPU Speed of Light section's dropdown +arrow button + +![Figure 12](../fig/ncu-ui_rep2.png) + +Hovering the mouse pointer over any of the breakdown items provides the description of the corresponding metric in a pop-up tooltip balloon. The user can +easily customize each section. See the [documentation](https://docs.nvidia.com/nsight-compute/2021.2/CustomizationGuide/index.html#abstract) for further +details. + +By default, Nsight Compute applies all applicable *rules* to the results once the profiling process in complete. The resulting operation data from applying +these rules will be shown as *Recommendations*. These rule results, often marked by a warning icon, are mostly informative and give warnings on performance +problems and guide the user throughout the performance optimization process. Comments can also be added to each section of the Details page by clicking on the +comment (ballon) button at the right-hand side of the header tabs. These comments will be summarized in the *Comments* page. + +The *Source* page incorporates the Shader Assembly (SASS), high-level code with pertinent metrics and [Parallel Thread Execution (PTX) Instruction Set +Architecture (ISA)](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#abstract). The main focus of the Source page is only on the SASS +functions that were called during the kernel launch. + +![Figure 13](../fig/ncu-ui_source.png) + +At this point, we do not go over the details of the Source page. However, the interested reader is referred to the [Nsight Compute +documentation](https://docs.nvidia.com/nsight-compute/2021.2/NsightCompute/index.html#profiler-report-source-page) for further details. + +All thread-specific NVTX context data pertinent to each kernel launch is collected in the *NVTX* page if the NVTX support is enabled before starting the +profiling process. The *Raw* page tabulates all collected metrics for each kernel execution and allows the exportation of the results in the CSV format +for further investigation. + +#### 3.2.2. Importing Report Files in Nsight Compute's GUI Profiler + +Similar to the Nsight System GUI profiler which can be used to import the report files generated by its CLI or GUI profiler counterparts, +the Nsight Compute GUI can also be adopted to import the report files generated by its CLI/GUI profiler counterparts. The user can import the +report files either through the *Open Files* button within the *Files* menu(or using `Ctrl+O` shortcut keys. -## 5. Example: Matrix Addition +The NVIDIA Nsight Compute [Project files](https://docs.nvidia.com/nsight-compute/2021.2/NsightCompute/index.html#projects) (with `ncu-proj` extension) +can host multiple report files and also incorporate notes and source codes for future reference. At any given NVIDIA Nsight Compute session, only one +Project file can be open and all collected reports will be assigned to the current project. {% include links.md %} \ No newline at end of file diff --git a/_episodes/02-cuda-memory-model.md b/_episodes/02-cuda-memory-model.md index 83d4f43..9682923 100644 --- a/_episodes/02-cuda-memory-model.md +++ b/_episodes/02-cuda-memory-model.md @@ -474,7 +474,7 @@ Note that this method should only be used within file- and global scopes. Manage [``](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gd228014f19cc0975ebe3e0dd2af6dd1b) as ~~~ -cudaError_t cudaMallocManaged (void** devPtr, size_t size, unsigned int flags = cudaMemAttachGlobal); +cudaError_t cudaMallocManaged(void** devPtr, size_t size, unsigned int flags = cudaMemAttachGlobal); ~~~ {: .language-cuda} @@ -502,4 +502,4 @@ allocation of the zero-copy memory causes the memory access performance to suffe Unified memory system, on the other hand, does not suffer from the aforementioned issues as it can automatically migrate data, on demand, between host and device in order to enhance the locality and ultimately, performance. -{% include links.md %} \ No newline at end of file +{% include links.md %} diff --git a/_episodes/03-guidelines.md b/_episodes/03-guidelines.md new file mode 100644 index 0000000..a4fa7a1 --- /dev/null +++ b/_episodes/03-guidelines.md @@ -0,0 +1,109 @@ +--- +title: "Performance Guidelines and Optimization Strategies" +teaching: 60 +exercises: 0 +questions: +- "" +- "" +- "" +objectives: +- "" +- "" +- "" +keypoints: +- "" +- "" +- "" +--- + +- [1. Recommended Strategies for Performance Optimization](#1-recommended-strategies-for-performance-optimization) + - [1.1. Maximization of the Device Utilization](#11-maximization-of-the-device-utilization) + - [1.2. Maximization of the Memory Throughput](#12-maximization-of-the-memory-throughput) + - [1.3. Maximization of the Instruction Throughput](#13-maximization-of-the-instruction-throughput) + - [1.4. Minimization of the Memory Thrashing](#14-minimization-of-the-memory-thrashing) + +## 1. Recommended Strategies for Performance Optimization + +[NVIDIA Performance Guidelines](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#performance-guidelines) offers the +following basic strategies for performance optimization of an application: + +- Maximization of parallel execution in order to achieve maximum utilization of resources on the device(s) +- Optimization of the device memory usage in order to maximize the memory throughput +- Improvement of instruction usage in order to gain maximum instruction throughput, and +- Minimization of memory thrashing + +The maximum performance gains are usually program/system dependent. For example, attempts to improve the performance of a kernel +which is mostly limited by its memory access will not be possibly impactful. As such, all performance optimization efforts +should be guided by quantitative analysis tools such as [NVIDIA Nsight Systems](https://docs.nvidia.com/nsight-systems/index.html) +and [Nsight Compute](https://docs.nvidia.com/nsight-compute/2021.2/index.html) profilers offering a wide variety of performance +metrics for CUDA parallel programs. For instance, Nsight Compute profiler offers [GPU Speed of Light section](https://docs.nvidia.com/ +nsight-compute/2021.2/ProfilingGuide/index.html#sections-and-rules) consisting of metrics which provide a high-level overview of +GPU's memory and compute throughput in terms of achieved utilization percentage with respect to the maximum theoretical limit of +the metric being measured. As such these metrics offer a great deal of information indicating how much performance improvement is +possible for a kernel. + +In the following sections, let us briefly overview the performance optimization strategies mentioned above. + +### 1.1. Maximization of the Device Utilization + +In order to maximize the utilization of resources on the device, the developer must expose the program's code to as much parallelism across +different logical levels of the system as possible. These levels involve: (i) the [application](https://docs.nvidia.com/cuda/ +cuda-c-programming-guide/index.html#application-level), (ii) the [device](https://docs.nvidia.com/cuda/cuda-c-programming-guide/ +index.html#device-level), and (iii) the [microprocessor](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#multiprocessor-level). + +Adopting asynchronous CUDA APIs and streams through The main goal at the application level is to maximize concurrency in parallel execution +between the host, device(s). As such, one attempts to allocate as much parallel work to the device and serial work to the host as possible. + +> ## Note +> +> Sometimes the parallelism must be broken for threads to synchronize and share the data among themselves. +> If the threads belong to the same thread-block, the synchronization can be performed *via* `__syncthreads()` and the data is +> shared through the shared memory within a single kernel execution. However, threads from separate blocks must share the data +> *via* different kernel executions *through* lower band-width global memory. Thus, the second less-performant scenario should +> be minimized due to the kernel execution overheads and slower global memory transfers. +{: .discussion} + +The following [list of asynchronous CUDA operations](https://docs.nvidia.com/cuda/cuda-c-programming-guide/ +index.html#asynchronous-concurrent-execution) can be performed independently and concurrently + +- host computations +- device computations +- HtoD memory transfer operations +- DtoH memory transfer operations +- memory transfer operations in individual devices +- memory transfer operations between two or multiple devices + +The CUDA library's asynchronous function calls allows users to dispatch multiple device operations and distribute them in +queues based on the resource availability. Decreasing the device management workload pressure on the host though benefiting +from concurrency makes it available for taking part in other simultaneous tasks which might improve the performance in +general. + +Some GPUs with compute capability of 2.0 and higher can launch multiple kernels, concurrently. The possibility of concurrent +kernel execution can be queried from the device's property enum variable [`concurrentKernels`](https://docs.nvidia.com/cuda/ +cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp_18e2fe2a3b264901816874516af12a097). The maximum number of +concurrent kernel execution is also dependent on the device's compute capability and can be found in +[CUDA Toolkit's documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index. +html#features-and-technical-specifications__technical-specifications-per-compute-capability). In addition to the concurrent +execution of multiple kernels, the data transfer/memory copy between the host and the device as well as intra-device operations +can also be executed asynchronously among themselves or with kernel launches. Device's property enumeration variable +[`asyncEngineCount`](https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp. +html#structcudaDeviceProp_105a89c028bee8fe480d0f44ddd43357b) can be queried to see whether the concurrent kernel execution and +data transfer is supported on the available device(s). + +> ## Note +> +> The host memory must be page-locked if involved in the overlapped memory copy/data transfer operations. +{: .discussion} + +In CUDA applications, concurrent operations including data transfers and kernel executions can be handled through +[**streams**](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#streams). Streams are sequences of instructions +which execute in order. The completion of independent instructions launched in different streams can be guaranteed *via* +synchronization commands. + +### 1.2. Maximization of the Memory Throughput + +### 1.3. Maximization of the Instruction Throughput + +### 1.4. Minimization of the Memory Thrashing + +{% include links.md %} diff --git a/fig/ncu-ui_connect.png b/fig/ncu-ui_connect.png new file mode 100644 index 0000000..5cc26e9 Binary files /dev/null and b/fig/ncu-ui_connect.png differ diff --git a/fig/ncu-ui_main.png b/fig/ncu-ui_main.png new file mode 100644 index 0000000..9a84f74 Binary files /dev/null and b/fig/ncu-ui_main.png differ diff --git a/fig/ncu-ui_rep1.png b/fig/ncu-ui_rep1.png new file mode 100644 index 0000000..0232eee Binary files /dev/null and b/fig/ncu-ui_rep1.png differ diff --git a/fig/ncu-ui_rep2.png b/fig/ncu-ui_rep2.png new file mode 100644 index 0000000..0001eb7 Binary files /dev/null and b/fig/ncu-ui_rep2.png differ diff --git a/fig/ncu-ui_sections.png b/fig/ncu-ui_sections.png new file mode 100644 index 0000000..9a76ae1 Binary files /dev/null and b/fig/ncu-ui_sections.png differ diff --git a/fig/ncu-ui_source.png b/fig/ncu-ui_source.png new file mode 100644 index 0000000..c9b30e9 Binary files /dev/null and b/fig/ncu-ui_source.png differ diff --git a/fig/ncu-ui_summary.png b/fig/ncu-ui_summary.png new file mode 100644 index 0000000..a354801 Binary files /dev/null and b/fig/ncu-ui_summary.png differ diff --git a/fig/nsys-ui_main.png b/fig/nsys-ui_main.png new file mode 100644 index 0000000..39eb1ce Binary files /dev/null and b/fig/nsys-ui_main.png differ diff --git a/fig/nsys-ui_main2.png b/fig/nsys-ui_main2.png new file mode 100644 index 0000000..125c512 Binary files /dev/null and b/fig/nsys-ui_main2.png differ diff --git a/fig/nsys-ui_rep1.png b/fig/nsys-ui_rep1.png new file mode 100644 index 0000000..0e5c404 Binary files /dev/null and b/fig/nsys-ui_rep1.png differ diff --git a/fig/nsys-ui_rep2.png b/fig/nsys-ui_rep2.png new file mode 100644 index 0000000..38a7f5e Binary files /dev/null and b/fig/nsys-ui_rep2.png differ diff --git a/fig/nsys-ui_rep3.png b/fig/nsys-ui_rep3.png new file mode 100644 index 0000000..ed58abb Binary files /dev/null and b/fig/nsys-ui_rep3.png differ diff --git a/fig/nsys-ui_rep4.png b/fig/nsys-ui_rep4.png new file mode 100644 index 0000000..3e604e4 Binary files /dev/null and b/fig/nsys-ui_rep4.png differ diff --git a/reference.md b/reference.md index a56a058..16058af 100644 --- a/reference.md +++ b/reference.md @@ -2,6 +2,16 @@ layout: reference --- +## Blog Posts + +1. [Crovella, B. **Using Nsight Compute to Inspect your Kernels** (NVIDIA, 2019)](https://developer.nvidia.com/blog/using-nsight-compute-to-inspect-your-kernels) + +2. [McMillan, S. **Transitioning to Nsight Systems from NVIDIA Visual Profiler / nvprof** (NVIDIA, 2019) ](https://developer.nvidia.com/blog/transitioning-nsight-systems-nvidia-visual-profiler-nvprof/) + +3. [Wilper, H. **Migrating to NVIDIA Nsight Tools from NVVP and Nvprof** (NVIDIA, 2019)](https://developer.nvidia.com/blog/migrating-nvidia-nsight-tools-nvvp-nvprof/) + +4. [Kraus, J. **CUDA Pro Tip: Generate Custom Application Profile Timelines with NVTX**](https://developer.nvidia.com/blog/cuda-pro-tip-generate-custom-application-profile-timelines-nvtx/) + ## Further Readings 1. [Cheng, J.; Grossman, M.; McKercher, T. **Professional CUDA C Programming** (Wiley, Indianapolis IN, USA, 2014), ISBN: 978-1-118-73932-7](https://www.wiley.com/en-us/Professional+CUDA+C+Programming-p-9781118739327)