-
Notifications
You must be signed in to change notification settings - Fork 3k
Performance Investigation
Sherlock edited this page Mar 31, 2021
·
21 revisions
- Make sure you are using RelWithDebInfo build. (Debug build is significantly slower and shouldn't be used for benchmarking or performance investigation)
- Turn on logging by setting
self._verbosity = Verbosity.INFO (or even Verbosity.VERBOSE)
in ortmodule.py - Turn on model dumping by setting
self._save_onnx = True
andself._save_onnx_prefix = '<MODEL NAME>'
in ortmodule.py - Turn on dumping optimized graph by adding
session_options.optimized_model_filepath = '<MODEL NAME>_optimized'
in ortmodule.py
Notice
*_inference.onnx
is onnx model directly coming out from the exporter without any graph transformation.
*_inference_optimized.onnx
(need to dump this)
*_training.onnx
is the training graph built on top of *_inference_optimized.onnx graph
*_optimized.onnx
is the final optimized training graph, the actual graph executed by the execution engine.
- Excessive memcpy nodes
Action: Look for 'memcpy' in the*_optimized.onnx
.- If the CUDA kernel is missing for an op, you will commonly see a node sandwiched by MemcpyToHost and MemcpyFromHost nodes
-
nvprof
- try run with/without --print-gpu-summary
- try --profile-child-processes
- Action: profile a training run
-
Visual Profiler UI
- Use ruler to measure a time span
- Identify the top hitters in kernels
- Compare two sets of profiling results to identify the performance gap
- Can you identify the start/end of a train_step from the timeline view?
-
torch profiler
-
Linux perf
Please use the learning roadmap on the home wiki page for building general understanding of ORT.