[xla] hlo_computation: compact instructions' vector on Cleanup()

tl;dr: this gives a 1.26x compilation time speedup for a large, dense model in XLA:GPU. The largest perf leaf seen in profiles of a large, dense model is related to computing the post order. Surprisingly, it is not the DFS itself what's most expensive; rather, most of the time is spent on scanning through HloComputation::Instructions() to identify DFS roots. The reason this scan becomes expensive as instructions are removed is that the vector holding HloInstructionInfo (introduced in cl/600130708 || 247280ab727) is not shrunk as it flows through the pipeline, making us having to walk through many deleted "tombstone" entries. Here is the histogram of # of tombstones encountered during post order computations for this model: ``` [ 1 - 1,536,345) ****************************** (1,300,248) [1,536,345 - 3,072,690) (2) [3,072,690 - 4,609,034) (364) [4,609,034 - 6,145,378) (10,443) ``` To ameliorate this, this CL shrinks the vector periodically, so far only between passes. This is done by running compaction on the vector during HloComputation::Cleanup(), which is called after every pass. The cost of compaction is made proportional to the number of deleted entries by swapping--if needed--each tombstone with the rightmost (within the vector) non-deleted entry. This brings the number of seen tombstones down significantly: ``` [ 1 - 327,699) ****************************** (937,541) [ 327,699 - 655,396) (308) [ 655,396 - 983,094) (0) [ 983,094 - 1,310,792) (1) ``` Note: we could further improve compaction by calling Cleanup() from some passes, instead of just between passes. However, that would not yield a significant gain; at least for this model, scanning the instructions' vector now takes ~1% of total time (vs. ~17% before). PiperOrigin-RevId: 619057964
openxla · Apr 4, 2024 · b7a78fd · b7a78fd
1 parent 55cdde9
commit b7a78fd
Show file tree

Hide file tree

Showing 2 changed files with 63 additions and 11 deletions.
diff --git a/xla/hlo/ir/hlo_computation.cc b/xla/hlo/ir/hlo_computation.cc
@@ -167,10 +167,10 @@ HloComputation::~HloComputation() {
     CHECK(async_start_->async_wrapped_computation() == this);
     async_start_->ClearCalledComputations();
   }
+  Cleanup();
   for (const auto& i : instructions_) {
     delete i.inst();
   }
-  Cleanup();
 }
 
 void HloComputation::SetInstruction(HloInstruction* instruction,
@@ -472,10 +472,65 @@ Status HloComputation::RemoveInstructionImpl(HloInstruction* instruction,
   info->inst_ =
       nullptr;  // Leave a hole: this is no longer part of "instructions()"
   instruction_indices_.erase(inst_it);
-  instruction->index_in_parent_ = ~0u;
+  DCHECK_EQ(instructions_.size() - to_be_deleted_.size(),
+            instruction_indices_.size())
+      << "instructions_.size(): " << instructions_.size()
+      << ", to_be_deleted_.size(): " << to_be_deleted_.size();
   return OkStatus();
 }
 
+void HloComputation::Cleanup() {
+  if (to_be_deleted_.empty()) return;
+
+  // Given that there are instructions to be deleted, there must be at least one
+  // instruction not marked for deletion. Otherwise we have deleted *all*
+  // instructions, which is probably a bug.
+  DCHECK(!instruction_indices_.empty());
+
+  // Replacement, i.e. the rightmost "unmarked" (a.k.a. not marked for deletion)
+  // entry in the vector.
+  HloInstructionInfo* replacement = &instructions_.back();
+  for (HloInstruction* marked_instruction : to_be_deleted_) {
+    int marked_index = marked_instruction->index_in_parent_;
+    HloInstructionInfo* marked = &instructions_[marked_index];
+    DCHECK(marked->inst() == nullptr);
+
+    delete marked_instruction;
+
+    // Find the first unmarked entry to the left of 'replacement', if needed.
+    while (replacement >= instructions_.data() &&
+           replacement->inst() == nullptr) {
+      --replacement;
+    }
+    DCHECK_GE(replacement, instructions_.data());
+
+    // Nothing to do if 'marked' is already to the right of 'replacement'.
+    if (marked > replacement) continue;
+
+    // Replace the marked entry with the unmarked one.
+    HloInstruction* unmarked_instruction = replacement->inst();
+    int unmarked_index = marked_index;
+    // Small optimization: instead of std::swap(), just overwrite *marked. This
+    // requires us to also decrement 'replacement' to avoid reusing the
+    // unmarked entry we just copied.
+    *marked = *replacement;
+    --replacement;
+
+    // Update reverse mapping.
+    auto it = instruction_indices_.find(unmarked_instruction);
+    DCHECK(it != instruction_indices_.end());
+    it->second = unmarked_index;
+    unmarked_instruction->index_in_parent_ = unmarked_index;
+  }
+
+  DCHECK_EQ(instructions_.size() - to_be_deleted_.size(),
+            instruction_indices_.size())
+      << "instructions_.size(): " << instructions_.size()
+      << ", to_be_deleted_.size(): " << to_be_deleted_.size();
+  to_be_deleted_.clear();
+  instructions_.resize(instruction_indices_.size());
+}
+
 void HloComputation::set_root_instruction(HloInstruction* new_root_instruction,
                                           bool accept_different_shape) {
   // The shape of the root (ignoring layout) is an invariant of the computation

diff --git a/xla/hlo/ir/hlo_computation.h b/xla/hlo/ir/hlo_computation.h
@@ -841,16 +841,13 @@ class HloComputation {
     return execution_thread_ == HloInstruction::kMainExecutionThread;
   }
 
-  // Deallocate instructions that are marked by "RemoveInstruction". The two
-  // stage clean up process is designed such that HloPass can have stable
-  // internal pointers to HloInstructions while we create and remove
+  // Deallocates instructions that are marked by "RemoveInstruction" and
+  // compacts the instructions_ vector by removing the deleted instructions'
+  // entries (a.k.a. tombstones).
+  // This two-stage clean up process is designed such that HloPass can have
+  // stable internal pointers to HloInstructions while we create and remove
   // HloInstructions in a pass.
-  void Cleanup() {
-    for (HloInstruction* it : to_be_deleted_) {
-      delete it;
-    }
-    to_be_deleted_.clear();
-  }
+  void Cleanup();
 
   // Returns true if a given instruction is marked dead in this computation.
   bool IsMarkedAsDead(const HloInstruction* inst);