Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[xla] hlo_computation: compact instructions' vector on Cleanup()
tl;dr: this gives a 1.26x compilation time speedup for a large, dense model in XLA:GPU. The largest perf leaf seen in profiles of a large, dense model is related to computing the post order. Surprisingly, it is not the DFS itself what's most expensive; rather, most of the time is spent on scanning through HloComputation::Instructions() to identify DFS roots. The reason this scan becomes expensive as instructions are removed is that the vector holding HloInstructionInfo (introduced in cl/600130708 || 247280ab727) is not shrunk as it flows through the pipeline, making us having to walk through many deleted "tombstone" entries. Here is the histogram of # of tombstones encountered during post order computations for this model: ``` [ 1 - 1,536,345) ****************************** (1,300,248) [1,536,345 - 3,072,690) (2) [3,072,690 - 4,609,034) (364) [4,609,034 - 6,145,378) (10,443) ``` To ameliorate this, this CL shrinks the vector periodically, so far only between passes. This is done by running compaction on the vector during HloComputation::Cleanup(), which is called after every pass. The cost of compaction is made proportional to the number of deleted entries by swapping--if needed--each tombstone with the rightmost (within the vector) non-deleted entry. This brings the number of seen tombstones down significantly: ``` [ 1 - 327,699) ****************************** (937,541) [ 327,699 - 655,396) (308) [ 655,396 - 983,094) (0) [ 983,094 - 1,310,792) (1) ``` Note: we could further improve compaction by calling Cleanup() from some passes, instead of just between passes. However, that would not yield a significant gain; at least for this model, scanning the instructions' vector now takes ~1% of total time (vs. ~17% before). PiperOrigin-RevId: 619057964
- Loading branch information