Skip to content

Commit

Permalink
[xla] hlo_computation: compact instructions' vector on Cleanup()
Browse files Browse the repository at this point in the history
tl;dr: this gives a 1.26x compilation time speedup for a large, dense
model in XLA:GPU.

The largest perf leaf seen in profiles of a large, dense model
is related to computing the post order. Surprisingly, it is not
the DFS itself what's most expensive; rather, most of the time is
spent on scanning through HloComputation::Instructions() to identify
DFS roots.

The reason this scan becomes expensive as instructions are removed
is that the vector holding HloInstructionInfo (introduced in
cl/600130708 || 247280ab727)
is not shrunk as it flows through the pipeline, making us having
to walk through many deleted "tombstone" entries. Here is the
histogram of # of tombstones encountered during post order
computations for this model:

```
[        1 - 1,536,345) ****************************** (1,300,248)
[1,536,345 - 3,072,690)  (2)
[3,072,690 - 4,609,034)  (364)
[4,609,034 - 6,145,378)  (10,443)
```

To ameliorate this, this CL shrinks the vector periodically,
so far only between passes. This is done by running compaction
on the vector during HloComputation::Cleanup(), which is called
after every pass. The cost of compaction is made proportional to
the number of deleted entries by swapping--if needed--each tombstone
with the rightmost (within the vector) non-deleted entry.

This brings the number of seen tombstones down significantly:

```
[        1 -   327,699) ****************************** (937,541)
[  327,699 -   655,396)  (308)
[  655,396 -   983,094)  (0)
[  983,094 - 1,310,792)  (1)
```

Note: we could further improve compaction by calling Cleanup()
from some passes, instead of just between passes. However, that
would not yield a significant gain; at least for this model,
scanning the instructions' vector now takes ~1% of total time
(vs. ~17% before).
PiperOrigin-RevId: 619057964
  • Loading branch information
cota authored and copybara-github committed Apr 4, 2024
1 parent 55cdde9 commit b7a78fd
Show file tree
Hide file tree
Showing 2 changed files with 63 additions and 11 deletions.
59 changes: 57 additions & 2 deletions xla/hlo/ir/hlo_computation.cc
Original file line number Diff line number Diff line change
Expand Up @@ -167,10 +167,10 @@ HloComputation::~HloComputation() {
CHECK(async_start_->async_wrapped_computation() == this);
async_start_->ClearCalledComputations();
}
Cleanup();
for (const auto& i : instructions_) {
delete i.inst();
}
Cleanup();
}

void HloComputation::SetInstruction(HloInstruction* instruction,
Expand Down Expand Up @@ -472,10 +472,65 @@ Status HloComputation::RemoveInstructionImpl(HloInstruction* instruction,
info->inst_ =
nullptr; // Leave a hole: this is no longer part of "instructions()"
instruction_indices_.erase(inst_it);
instruction->index_in_parent_ = ~0u;
DCHECK_EQ(instructions_.size() - to_be_deleted_.size(),
instruction_indices_.size())
<< "instructions_.size(): " << instructions_.size()
<< ", to_be_deleted_.size(): " << to_be_deleted_.size();
return OkStatus();
}

void HloComputation::Cleanup() {
if (to_be_deleted_.empty()) return;

// Given that there are instructions to be deleted, there must be at least one
// instruction not marked for deletion. Otherwise we have deleted *all*
// instructions, which is probably a bug.
DCHECK(!instruction_indices_.empty());

// Replacement, i.e. the rightmost "unmarked" (a.k.a. not marked for deletion)
// entry in the vector.
HloInstructionInfo* replacement = &instructions_.back();
for (HloInstruction* marked_instruction : to_be_deleted_) {
int marked_index = marked_instruction->index_in_parent_;
HloInstructionInfo* marked = &instructions_[marked_index];
DCHECK(marked->inst() == nullptr);

delete marked_instruction;

// Find the first unmarked entry to the left of 'replacement', if needed.
while (replacement >= instructions_.data() &&
replacement->inst() == nullptr) {
--replacement;
}
DCHECK_GE(replacement, instructions_.data());

// Nothing to do if 'marked' is already to the right of 'replacement'.
if (marked > replacement) continue;

// Replace the marked entry with the unmarked one.
HloInstruction* unmarked_instruction = replacement->inst();
int unmarked_index = marked_index;
// Small optimization: instead of std::swap(), just overwrite *marked. This
// requires us to also decrement 'replacement' to avoid reusing the
// unmarked entry we just copied.
*marked = *replacement;
--replacement;

// Update reverse mapping.
auto it = instruction_indices_.find(unmarked_instruction);
DCHECK(it != instruction_indices_.end());
it->second = unmarked_index;
unmarked_instruction->index_in_parent_ = unmarked_index;
}

DCHECK_EQ(instructions_.size() - to_be_deleted_.size(),
instruction_indices_.size())
<< "instructions_.size(): " << instructions_.size()
<< ", to_be_deleted_.size(): " << to_be_deleted_.size();
to_be_deleted_.clear();
instructions_.resize(instruction_indices_.size());
}

void HloComputation::set_root_instruction(HloInstruction* new_root_instruction,
bool accept_different_shape) {
// The shape of the root (ignoring layout) is an invariant of the computation
Expand Down
15 changes: 6 additions & 9 deletions xla/hlo/ir/hlo_computation.h
Original file line number Diff line number Diff line change
Expand Up @@ -841,16 +841,13 @@ class HloComputation {
return execution_thread_ == HloInstruction::kMainExecutionThread;
}

// Deallocate instructions that are marked by "RemoveInstruction". The two
// stage clean up process is designed such that HloPass can have stable
// internal pointers to HloInstructions while we create and remove
// Deallocates instructions that are marked by "RemoveInstruction" and
// compacts the instructions_ vector by removing the deleted instructions'
// entries (a.k.a. tombstones).
// This two-stage clean up process is designed such that HloPass can have
// stable internal pointers to HloInstructions while we create and remove
// HloInstructions in a pass.
void Cleanup() {
for (HloInstruction* it : to_be_deleted_) {
delete it;
}
to_be_deleted_.clear();
}
void Cleanup();

// Returns true if a given instruction is marked dead in this computation.
bool IsMarkedAsDead(const HloInstruction* inst);
Expand Down

0 comments on commit b7a78fd

Please sign in to comment.