Skip to content

Performance Investigation

Sherlock edited this page Mar 31, 2021 · 21 revisions

Preparation Step

  • Make sure you are using RelWithDebInfo build. (Debug build is significantly slower and shouldn't be used for benchmarking or performance investigation)
  • Turn on logging by setting self._verbosity = Verbosity.INFO (or even Verbosity.VERBOSE) in ortmodule.py
  • Turn on model dumping by setting self._save_onnx = True and self._save_onnx_prefix = '<MODEL NAME>' in ortmodule.py
  • Turn on dumping optimized graph by adding session_options.optimized_model_filepath = '<MODEL NAME>_optimized' in ortmodule.py

Notice
*_inference.onnx is onnx model directly coming out from the exporter without any graph transformation.
*_inference_optimized.onnx (need to dump this)
*_training.onnx is the training graph built on top of *_inference_optimized.onnx graph
*_optimized.onnx is the final optimized training graph, the actual graph executed by the execution engine.

Common performance problems

  • Excessive memcpy nodes
    Action: Look for 'memcpy' in the *_optimized.onnx.
    • If the CUDA kernel is missing for an op, you will commonly see a node sandwiched by MemcpyToHost and MemcpyFromHost nodes

Profiling Tools

  • nvprof

    • try run with/without --print-gpu-summary
    • try --profile-child-processes
    • Action: profile a training run
  • Visual Profiler UI

    • Use ruler to measure a time span
    • Identify the top hitters in kernels
    • Compare two sets of profiling results to identify the performance gap
    • Can you identify the start/end of a train_step from the timeline view?
  • torch profiler

  • Linux perf