Performance Investigation

Preparation Step

Make sure you are using RelWithDebInfo build. (Debug build is significantly slower and shouldn't be used for benchmarking or performance investigation)
Turn on logging by setting self._verbosity = Verbosity.INFO (or even Verbosity.VERBOSE) in ortmodule.py
Turn on model dumping by setting self._save_onnx = True and self._save_onnx_prefix = '<MODEL NAME>' in ortmodule.py
Turn on dumping optimized graph by adding session_options.optimized_model_filepath = '<MODEL NAME>_optimized' in ortmodule.py

Notice
*_inference.onnx is onnx model directly coming out from the exporter without any graph transformation.
*_inference_optimized.onnx (need to dump this)
*_training.onnx is the training graph built on top of *_inference_optimized.onnx graph
*_optimized.onnx is the final optimized training graph, the actual graph executed by the execution engine.

Common performance problems

Excessive memcpy nodes
Action: Look for 'memcpy' in the *_optimized.onnx.
- If the CUDA kernel is missing for an op, you will commonly see a node sandwiched by MemcpyToHost and MemcpyFromHost nodes

Profiling Tools

nvprof
- try run with/without --print-gpu-summary
- try --profile-child-processes
- Action: profile a training run
Visual Profiler UI
- Use ruler to measure a time span
- Identify the top hitters in kernels
- Compare two sets of profiling results to identify the performance gap
- Can you identify the start/end of a train_step from the timeline view?
torch profiler
Linux perf

Please use the learning roadmap on the home wiki page for building general understanding of ORT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Investigation

Preparation Step

Common performance problems

Profiling Tools

Navigation by topic

Upcoming Release Roadmap

Glossary

Development

Common Tasks

Dependencies

Core Architecture

Feature Details

Inferencing

Training

Clone this wiki locally