[BOLT][AArch64] Introduce SPE mode in BasicAggregation #120741

paschalis-mpeis · 2024-12-20T14:58:41Z

BOLT gains the ability to process Arm SPE data using the BasicAggregation format.

Example usage is:

perf2bolt -p perf.data -o perf.boltdata --nl --spe BINARY

New branch data and compatibility:

perf since Linux 6.13 reports for SPE branch pairs (PC → TGT) where:

PC:
- it is the source branch; may be taken or not-taken.
- Due to the nature of how SPE operates and what it can collect, any filtering on the
  PC (i.e., to consider only the taken branches) would result in a data loss that
  BOLT cannot later infer.
TGT:
- it is the target address of the destination block.
- this is the new information that perf can now report.

DataAggregator processes this information by creating two basic samples.
Any other event types will have ADDR field set to 0x0. For those a single sample
will be created.
Such events can be either SPE or non-SPE, like l1d-access and cycles respectively.

The format of the input perf entries is:

PID   EVENT-TYPE   ADDR   IP

When on SPE mode and:

host is not AArch64, BOLT will exit with a relevant message
ADDR field is unavailable, BOLT will exit with a relevant message
no branch pairs were recorded, BOLT will present a warning

Examples of generating profiling data for the SPE mode:

Profiles can be captured with perf on AArch64 machines with SPE enabled.
They can be combined with other events, SPE or not.
In the future we might restrict processing to just the branch packets.

Capture only SPE branch data events:

perf record -e 'arm_spe_0/branch_filter=1/u' -- BINARY

Using more filters, some jitter, and specify count to control overheads/quality:

perf record -e 'arm_spe_0/branch_filter=1,load_filter=0,store_filter=0,jitter=1/u' -c 10007 -- BINARY

Capture any SPE events:

perf record -e 'arm_spe_0//u' -- BINARY

Capture any SPE events and cycles

perf record -e 'arm_spe_0//u' -e cycles:u -- BINARY

llvmbot · 2024-12-20T14:59:19Z

@llvm/pr-subscribers-bolt

Author: Paschalis Mpeis (paschalis-mpeis)

Changes

BOLT gains the ability to process branch target information generated by
Arm SPE data, using the BasicAggregation format.

Example usage is:

perf2bolt -p perf.data -o perf.boltdata --nl --spe BINARY

New branch data and compatibility:

SPE branch entries in perf data contain a branch pair (IP -> ADDR)
for the source and destination branches. DataAggregator processes those
by creating two basic samples. Any other event types will have ADDR
field set to 0x0. For those a single sample will be created. Such
events can be either SPE or non-SPE, like l1d-access and cycles
respectively.

The format of the input perf entries is:

PID   EVENT-TYPE   ADDR   IP

When on SPE mode and:

host is not AArch64, BOLT will exit with a relevant message
ADDR field is unavailable, BOLT will exit with a relevant message
no branch pairs were recorded, BOLT will present a warning

Examples of generating profiling data for the SPE mode:

Profiles can be captured with perf on AArch64 machines with SPE enabled.
They can be combined with other events, SPE or not.

Capture only SPE branch data events:

perf record -e 'arm_spe_0/branch_filter=1/u' -- BINARY

Capture any SPE events:

perf record -e 'arm_spe_0//u' -- BINARY

Capture any SPE events and cycles

perf record -e 'arm_spe_0//u' -e cycles:u -- BINARY

More filters, jitter, and specify count to control overheads/quality.

perf record -e 'arm_spe_0/branch_filter=1,load_filter=0,store_filter=0,jitter=1/u' -c 10007 -- BINARY

Full diff: https://github.com/llvm/llvm-project/pull/120741.diff

7 Files Affected:

(modified) bolt/include/bolt/Profile/DataAggregator.h (+14)
(modified) bolt/lib/Profile/DataAggregator.cpp (+132-11)
(added) bolt/test/perf2bolt/AArch64/perf2bolt-spe.test (+14)
(added) bolt/test/perf2bolt/X86/perf2bolt-spe.test (+9)
(modified) bolt/tools/driver/llvm-bolt.cpp (+9)
(modified) bolt/unittests/Profile/CMakeLists.txt (+14)
(added) bolt/unittests/Profile/PerfSpeEvents.cpp (+173)

diff --git a/bolt/include/bolt/Profile/DataAggregator.h b/bolt/include/bolt/Profile/DataAggregator.h
index 320623cfa15af1..be6e0fbd6347a0 100644
--- a/bolt/include/bolt/Profile/DataAggregator.h
+++ b/bolt/include/bolt/Profile/DataAggregator.h
@@ -78,6 +78,8 @@ class DataAggregator : public DataReader {
   static bool checkPerfDataMagic(StringRef FileName);
 
 private:
+  friend struct PerfSpeEventsTestHelper;
+
   struct PerfBranchSample {
     SmallVector<LBREntry, 32> LBR;
     uint64_t PC;
@@ -294,6 +296,15 @@ class DataAggregator : public DataReader {
   /// and a PC
   ErrorOr<PerfBasicSample> parseBasicSample();
 
+  /// Parse an Arm SPE entry into the non-lbr format by generating two basic
+  /// samples. The format of an input SPE entry is:
+  /// ```
+  /// PID   EVENT-TYPE   ADDR   IP
+  /// ```
+  /// SPE branch events will have 'ADDR' set to a branch target address while
+  /// other perf or SPE events will have it set to zero.
+  ErrorOr<std::pair<PerfBasicSample,PerfBasicSample>> parseSpeAsBasicSamples();
+
   /// Parse a single perf sample containing a PID associated with an IP and
   /// address.
   ErrorOr<PerfMemSample> parseMemSample();
@@ -343,6 +354,9 @@ class DataAggregator : public DataReader {
   /// Process non-LBR events.
   void processBasicEvents();
 
+  /// Parse Arm SPE events into the non-LBR format.
+  std::error_code parseSpeAsBasicEvents();
+
   /// Parse the full output generated by perf script to report memory events.
   std::error_code parseMemEvents();
 
diff --git a/bolt/lib/Profile/DataAggregator.cpp b/bolt/lib/Profile/DataAggregator.cpp
index 2b02086e3e0c99..7038ca5b1452ab 100644
--- a/bolt/lib/Profile/DataAggregator.cpp
+++ b/bolt/lib/Profile/DataAggregator.cpp
@@ -49,6 +49,13 @@ static cl::opt<bool>
                      cl::desc("aggregate basic samples (without LBR info)"),
                      cl::cat(AggregatorCategory));
 
+cl::opt<bool> ArmSPE(
+    "spe",
+    cl::desc(
+        "Enable Arm SPE mode. Used in conjuction with no-lbr mode, ie `--spe "
+        "--nl`"),
+    cl::cat(AggregatorCategory));
+
 static cl::opt<std::string>
     ITraceAggregation("itrace",
                       cl::desc("Generate LBR info with perf itrace argument"),
@@ -180,11 +187,19 @@ void DataAggregator::start() {
 
   findPerfExecutable();
 
-  if (opts::BasicAggregation) {
-    launchPerfProcess("events without LBR",
-                      MainEventsPPI,
+  if (opts::ArmSPE) {
+    if (!opts::BasicAggregation) {
+      errs() << "PERF2BOLT-ERROR: Arm SPE mode is combined only with "
+                "BasicAggregation.\n";
+      exit(1);
+    }
+    launchPerfProcess("branch events with SPE", MainEventsPPI,
+                      "script -F pid,event,ip,addr --itrace=i1i",
+                      /*Wait = */ false);
+  } else if (opts::BasicAggregation) {
+    launchPerfProcess("events without LBR", MainEventsPPI,
                       "script -F pid,event,ip",
-                      /*Wait = */false);
+                      /*Wait = */ false);
   } else if (!opts::ITraceAggregation.empty()) {
     std::string ItracePerfScriptArgs = llvm::formatv(
         "script -F pid,ip,brstack --itrace={0}", opts::ITraceAggregation);
@@ -192,10 +207,9 @@ void DataAggregator::start() {
                       ItracePerfScriptArgs.c_str(),
                       /*Wait = */ false);
   } else {
-    launchPerfProcess("branch events",
-                      MainEventsPPI,
+    launchPerfProcess("branch events", MainEventsPPI,
                       "script -F pid,ip,brstack",
-                      /*Wait = */false);
+                      /*Wait = */ false);
   }
 
   // Note: we launch script for mem events regardless of the option, as the
@@ -531,14 +545,20 @@ Error DataAggregator::preprocessProfile(BinaryContext &BC) {
               "not read one from input binary\n";
   }
 
-  auto ErrorCallback = [](int ReturnCode, StringRef ErrBuf) {
+  const Regex NoData("Samples for '.*' event do not have ADDR attribute set. "
+                     "Cannot print 'addr' field.");
+
+  auto ErrorCallback = [&NoData](int ReturnCode, StringRef ErrBuf) {
+    if (opts::ArmSPE && NoData.match(ErrBuf)) {
+      errs() << "PERF2BOLT-ERROR: perf data are incompatible for Arm SPE mode "
+                "consumption. ADDR attribute is unset.\n";
+      exit(1);
+    }
     errs() << "PERF-ERROR: return code " << ReturnCode << "\n" << ErrBuf;
     exit(1);
   };
 
   auto MemEventsErrorCallback = [&](int ReturnCode, StringRef ErrBuf) {
-    Regex NoData("Samples for '.*' event do not have ADDR attribute set. "
-                 "Cannot print 'addr' field.");
     if (!NoData.match(ErrBuf))
       ErrorCallback(ReturnCode, ErrBuf);
   };
@@ -579,7 +599,8 @@ Error DataAggregator::preprocessProfile(BinaryContext &BC) {
     exit(0);
   }
 
-  if ((!opts::BasicAggregation && parseBranchEvents()) ||
+  if (((!opts::BasicAggregation && !opts::ArmSPE) && parseBranchEvents()) ||
+      (opts::BasicAggregation && opts::ArmSPE && parseSpeAsBasicEvents()) ||
       (opts::BasicAggregation && parseBasicEvents()))
     errs() << "PERF2BOLT: failed to parse samples\n";
 
@@ -1226,6 +1247,66 @@ ErrorOr<DataAggregator::PerfBasicSample> DataAggregator::parseBasicSample() {
   return PerfBasicSample{Event.get(), Address};
 }
 
+ErrorOr<
+    std::pair<DataAggregator::PerfBasicSample, DataAggregator::PerfBasicSample>>
+DataAggregator::parseSpeAsBasicSamples() {
+  while (checkAndConsumeFS()) {
+  }
+
+  ErrorOr<int64_t> PIDRes = parseNumberField(FieldSeparator, true);
+  if (std::error_code EC = PIDRes.getError())
+    return EC;
+
+  constexpr PerfBasicSample EmptySample = PerfBasicSample{StringRef(), 0};
+  auto MMapInfoIter = BinaryMMapInfo.find(*PIDRes);
+  if (MMapInfoIter == BinaryMMapInfo.end()) {
+    consumeRestOfLine();
+    return std::make_pair(EmptySample, EmptySample);
+  }
+
+  while (checkAndConsumeFS()) {
+  }
+
+  ErrorOr<StringRef> Event = parseString(FieldSeparator);
+  if (std::error_code EC = Event.getError())
+    return EC;
+
+  while (checkAndConsumeFS()) {
+  }
+
+  ErrorOr<uint64_t> AddrResTo = parseHexField(FieldSeparator);
+  if (std::error_code EC = AddrResTo.getError())
+    return EC;
+  consumeAllRemainingFS();
+
+  ErrorOr<uint64_t> AddrResFrom = parseHexField(FieldSeparator, true);
+  if (std::error_code EC = AddrResFrom.getError())
+    return EC;
+
+  if (!checkAndConsumeNewLine()) {
+    reportError("expected end of line");
+    return make_error_code(llvm::errc::io_error);
+  }
+
+  auto genBasicSample = [&](uint64_t Address) {
+    // When fed with non SPE branch events the target address will be null.
+    // This is expected and ignored.
+    if (Address == 0x0)
+      return EmptySample;
+
+    if (!BC->HasFixedLoadAddress)
+      adjustAddress(Address, MMapInfoIter->second);
+    return PerfBasicSample{Event.get(), Address};
+  };
+
+  // Show more meaningful event names on boltdata.
+  if (Event->str() == "instructions:")
+    Event = *AddrResTo != 0x0 ? "branch-spe:" : "instruction-spe:";
+
+  return std::make_pair(genBasicSample(*AddrResFrom),
+                        genBasicSample(*AddrResTo));
+}
+
 ErrorOr<DataAggregator::PerfMemSample> DataAggregator::parseMemSample() {
   PerfMemSample Res{0, 0};
 
@@ -1703,6 +1784,46 @@ std::error_code DataAggregator::parseBasicEvents() {
   return std::error_code();
 }
 
+std::error_code DataAggregator::parseSpeAsBasicEvents() {
+  outs() << "PERF2BOLT: parsing SPE data as basic events (no LBR)...\n";
+  NamedRegionTimer T("parseSPEBasic", "Parsing SPE as basic events",
+                     TimerGroupName, TimerGroupDesc, opts::TimeAggregator);
+  uint64_t NumSpeBranchSamples = 0;
+
+  // Convert entries to one or two basic samples, depending on whether there is
+  // branch target information.
+  while (hasData()) {
+    auto SamplePair = parseSpeAsBasicSamples();
+    if (std::error_code EC = SamplePair.getError())
+      return EC;
+
+    auto registerSample = [this](const PerfBasicSample *Sample) {
+      if (!Sample->PC)
+        return;
+
+      if (BinaryFunction *BF = getBinaryFunctionContainingAddress(Sample->PC))
+        BF->setHasProfileAvailable();
+
+      ++BasicSamples[Sample->PC];
+      EventNames.insert(Sample->EventName);
+    };
+
+    if (SamplePair->first.PC != 0x0 && SamplePair->second.PC != 0x0)
+      ++NumSpeBranchSamples;
+
+    registerSample(&SamplePair->first);
+    registerSample(&SamplePair->second);
+  }
+
+  if (NumSpeBranchSamples == 0)
+    errs() << "PERF2BOLT-WARNING: no SPE branches found\n";
+  else
+    outs() << "PERF2BOLT: found " << NumSpeBranchSamples
+           << " SPE branch sample pairs.\n";
+
+  return std::error_code();
+}
+
 void DataAggregator::processBasicEvents() {
   outs() << "PERF2BOLT: processing basic events (without LBR)...\n";
   NamedRegionTimer T("processBasic", "Processing basic events", TimerGroupName,
diff --git a/bolt/test/perf2bolt/AArch64/perf2bolt-spe.test b/bolt/test/perf2bolt/AArch64/perf2bolt-spe.test
new file mode 100644
index 00000000000000..d7cea7ff769b8e
--- /dev/null
+++ b/bolt/test/perf2bolt/AArch64/perf2bolt-spe.test
@@ -0,0 +1,14 @@
+## Check that Arm SPE mode is available on AArch64 with BasicAggregation.
+
+REQUIRES: system-linux,perf,target=aarch64{{.*}}
+
+RUN: %clang %cflags %p/../../Inputs/asm_foo.s %p/../../Inputs/asm_main.c -o %t.exe
+RUN: touch %t.empty.perf.data
+RUN: perf2bolt -p %t.empty.perf.data -o %t.perf.boltdata --nl --spe --pa %t.exe 2>&1 | FileCheck %s --check-prefix=CHECK-SPE-NO-LBR
+
+CHECK-SPE-NO-LBR: PERF2BOLT: Starting data aggregation job
+
+RUN: perf record -e cycles -q -o %t.perf.data -- %t.exe
+RUN: not perf2bolt -p %t.perf.data -o %t.perf.boltdata --spe %t.exe 2>&1 | FileCheck %s --check-prefix=CHECK-SPE-LBR
+
+CHECK-SPE-LBR: PERF2BOLT-ERROR: Arm SPE mode is combined only with BasicAggregation.
diff --git a/bolt/test/perf2bolt/X86/perf2bolt-spe.test b/bolt/test/perf2bolt/X86/perf2bolt-spe.test
new file mode 100644
index 00000000000000..f31c17f411137d
--- /dev/null
+++ b/bolt/test/perf2bolt/X86/perf2bolt-spe.test
@@ -0,0 +1,9 @@
+## Check that Arm SPE mode is unavailable on X86.
+
+REQUIRES: system-linux,x86_64-linux
+
+RUN: %clang %cflags %p/../../Inputs/asm_foo.s %p/../../Inputs/asm_main.c -o %t.exe
+RUN: touch %t.empty.perf.data
+RUN: not perf2bolt -p %t.empty.perf.data -o %t.perf.boltdata --nl --spe --pa %t.exe 2>&1 | FileCheck %s
+
+CHECK: BOLT-ERROR: -spe is available only on AArch64.
diff --git a/bolt/tools/driver/llvm-bolt.cpp b/bolt/tools/driver/llvm-bolt.cpp
index efa06cd68cb997..60b813f6f858d7 100644
--- a/bolt/tools/driver/llvm-bolt.cpp
+++ b/bolt/tools/driver/llvm-bolt.cpp
@@ -51,6 +51,8 @@ static cl::opt<std::string> InputFilename(cl::Positional,
                                           cl::Required, cl::cat(BoltCategory),
                                           cl::sub(cl::SubCommand::getAll()));
 
+extern cl::opt<bool> ArmSPE;
+
 static cl::opt<std::string>
 InputDataFilename("data",
   cl::desc("<data file>"),
@@ -245,6 +247,13 @@ int main(int argc, char **argv) {
       if (Error E = RIOrErr.takeError())
         report_error(opts::InputFilename, std::move(E));
       RewriteInstance &RI = *RIOrErr.get();
+
+      if (opts::AggregateOnly && !RI.getBinaryContext().isAArch64() &&
+          opts::ArmSPE == 1) {
+        errs() << "BOLT-ERROR: -spe is available only on AArch64.\n";
+        exit(1);
+      }
+
       if (!opts::PerfData.empty()) {
         if (!opts::AggregateOnly) {
           errs() << ToolName
diff --git a/bolt/unittests/Profile/CMakeLists.txt b/bolt/unittests/Profile/CMakeLists.txt
index e0aa0926b49c03..ce01c6c4b949ee 100644
--- a/bolt/unittests/Profile/CMakeLists.txt
+++ b/bolt/unittests/Profile/CMakeLists.txt
@@ -1,11 +1,25 @@
+set(LLVM_LINK_COMPONENTS
+  DebugInfoDWARF
+  Object
+  ${LLVM_TARGETS_TO_BUILD}
+  )
+
 add_bolt_unittest(ProfileTests
   DataAggregator.cpp
+  PerfSpeEvents.cpp
 
   DISABLE_LLVM_LINK_LLVM_DYLIB
   )
 
 target_link_libraries(ProfileTests
   PRIVATE
+  LLVMBOLTCore
   LLVMBOLTProfile
+  LLVMTargetParser
+  LLVMTestingSupport
   )
 
+foreach (tgt ${BOLT_TARGETS_TO_BUILD})
+  string(TOUPPER "${tgt}" upper)
+  target_compile_definitions(ProfileTests PRIVATE "${upper}_AVAILABLE")
+endforeach()
diff --git a/bolt/unittests/Profile/PerfSpeEvents.cpp b/bolt/unittests/Profile/PerfSpeEvents.cpp
new file mode 100644
index 00000000000000..807a3bb1e07f40
--- /dev/null
+++ b/bolt/unittests/Profile/PerfSpeEvents.cpp
@@ -0,0 +1,173 @@
+//===- bolt/unittests/Profile/PerfSpeEvents.cpp ---------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifdef AARCH64_AVAILABLE
+
+#include "bolt/Core/BinaryContext.h"
+#include "bolt/Profile/DataAggregator.h"
+#include "llvm/BinaryFormat/ELF.h"
+#include "llvm/DebugInfo/DWARF/DWARFContext.h"
+#include "llvm/Support/CommandLine.h"
+#include "llvm/Support/TargetSelect.h"
+#include "gtest/gtest.h"
+
+using namespace llvm;
+using namespace llvm::bolt;
+using namespace llvm::object;
+using namespace llvm::ELF;
+
+namespace opts {
+extern cl::opt<std::string> ReadPerfEvents;
+} // namespace opts
+
+namespace llvm {
+namespace bolt {
+
+/// Perform checks on perf SPE branch events combined with other SPE or perf
+/// events.
+struct PerfSpeEventsTestHelper : public testing::Test {
+  void SetUp() override {
+    initalizeLLVM();
+    prepareElf();
+    initializeBOLT();
+  }
+
+protected:
+  void initalizeLLVM() {
+    llvm::InitializeAllTargetInfos();
+    llvm::InitializeAllTargetMCs();
+    llvm::InitializeAllAsmParsers();
+    llvm::InitializeAllDisassemblers();
+    llvm::InitializeAllTargets();
+    llvm::InitializeAllAsmPrinters();
+  }
+
+  void prepareElf() {
+    memcpy(ElfBuf, "\177ELF", 4);
+    ELF64LE::Ehdr *EHdr = reinterpret_cast<typename ELF64LE::Ehdr *>(ElfBuf);
+    EHdr->e_ident[llvm::ELF::EI_CLASS] = llvm::ELF::ELFCLASS64;
+    EHdr->e_ident[llvm::ELF::EI_DATA] = llvm::ELF::ELFDATA2LSB;
+    EHdr->e_machine = llvm::ELF::EM_AARCH64;
+    MemoryBufferRef Source(StringRef(ElfBuf, sizeof(ElfBuf)), "ELF");
+    ObjFile = cantFail(ObjectFile::createObjectFile(Source));
+  }
+
+  void initializeBOLT() {
+    Relocation::Arch = ObjFile->makeTriple().getArch();
+    BC = cantFail(BinaryContext::createBinaryContext(
+        ObjFile->makeTriple(), std::make_shared<orc::SymbolStringPool>(),
+        ObjFile->getFileName(), nullptr, /*IsPIC*/ false,
+        DWARFContext::create(*ObjFile.get()), {llvm::outs(), llvm::errs()}));
+    ASSERT_FALSE(!BC);
+  }
+
+  char ElfBuf[sizeof(typename ELF64LE::Ehdr)] = {};
+  std::unique_ptr<ObjectFile> ObjFile;
+  std::unique_ptr<BinaryContext> BC;
+
+  /// Return true when the expected \p SampleSize profile data are generated and
+  /// contain all the \p ExpectedEventNames.
+  bool checkEvents(uint64_t PID, size_t SampleSize,
+                   const StringSet<> &ExpectedEventNames) {
+    DataAggregator DA("<pseudo input>");
+    DA.ParsingBuf = opts::ReadPerfEvents;
+    DA.BC = BC.get();
+    DataAggregator::MMapInfo MMap;
+    DA.BinaryMMapInfo.insert(std::make_pair(PID, MMap));
+
+    DA.parseSpeAsBasicEvents();
+
+    for (auto &EE : ExpectedEventNames)
+      if (!DA.EventNames.contains(EE.first()))
+        return false;
+
+    return SampleSize == DA.BasicSamples.size();
+  }
+};
+
+} // namespace bolt
+} // namespace llvm
+
+// Check that DataAggregator can parseSpeAsBasicEvents for branch events when
+// combined with other event types.
+
+TEST_F(PerfSpeEventsTestHelper, SpeBranches) {
+  // Check perf input with SPE branch events.
+  // Example collection command:
+  // ```
+  // perf record -e 'arm_spe_0/branch_filter=1/u' -- BINARY
+  // ```
+
+  opts::ReadPerfEvents =
+      "1234          instructions:              a002    a001\n"
+      "1234          instructions:              b002    b001\n"
+      "1234          instructions:              c002    c001\n"
+      "1234          instructions:              d002    d001\n"
+      "1234          instructions:              e002    e001\n";
+
+  EXPECT_TRUE(checkEvents(1234, 10, {"branch-spe:"}));
+}
+
+TEST_F(PerfSpeEventsTestHelper, SpeBranchesAndCycles) {
+  // Check perf input with SPE branch events and cycles.
+  // Example collection command:
+  // ```
+  // perf record -e cycles:u -e 'arm_spe_0/branch_filter=1/u' -- BINARY
+  // ```
+
+  opts::ReadPerfEvents =
+      "1234          instructions:              a002    a001\n"
+      "1234              cycles:u:                 0    b001\n"
+      "1234              cycles:u:                 0    c001\n"
+      "1234          instructions:              d002    d001\n"
+      "1234          instructions:              e002    e001\n";
+
+  EXPECT_TRUE(checkEvents(1234, 8, {"branch-spe:", "cycles:u:"}));
+}
+
+TEST_F(PerfSpeEventsTestHelper, SpeAnyEventAndCycles) {
+  // Check perf input with any SPE event type and cycles.
+  // Example collection command:
+  // ```
+  // perf record -e cycles:u -e 'arm_spe_0//u' -- BINARY
+  // ```
+
+  opts::ReadPerfEvents =
+      "1234              cycles:u:                0     a001\n"
+      "1234              cycles:u:                0     b001\n"
+      "1234          instructions:                0     c001\n"
+      "1234          instructions:                0     d001\n"
+      "1234          instructions:              e002    e001\n";
+
+  EXPECT_TRUE(
+      checkEvents(1234, 6, {"cycles:u:", "instruction-spe:", "branch-spe:"}));
+}
+
+TEST_F(PerfSpeEventsTestHelper, SpeNoBranchPairsRecorded) {
+  // Check perf input that has no SPE branch pairs recorded.
+  // Example collection command:
+  // ```
+  // perf record -e cycles:u -e 'arm_spe_0/load_filter=1,branch_filter=0/u' --
+  // BINARY
+  // ```
+
+  testing::internal::CaptureStderr();
+  opts::ReadPerfEvents =
+      "1234          instructions:                 0    a001\n"
+      "1234              cycles:u:                 0    b001\n"
+      "1234          instructions:                 0    c001\n"
+      "1234              cycles:u:                 0    d001\n"
+      "1234          instructions:                 0    e001\n";
+
+  EXPECT_TRUE(checkEvents(1234, 5, {"instruction-spe:", "cycles:u:"}));
+
+  std::string Stderr = testing::internal::GetCapturedStderr();
+  EXPECT_EQ(Stderr, "PERF2BOLT-WARNING: no SPE branches found\n");
+}
+
+#endif

github-actions · 2024-12-20T15:01:57Z

✅ With the latest revision this PR passed the C/C++ code formatter.

paschalis-mpeis · 2024-12-20T16:43:24Z

This PR is an implementation of the (4a) approach of:

[AArch64] BOLT does not support SPE branch data #115333

We did some limited, quick testing and there was no clear winner between the two approaches, but the --spe flag is introduced in a way to accommodate both.

I believe @kaadam had some work on (4b)? Maybe at some point we could additionally have that merged, and community can test on a wider set of apps/workloads. I believe there won't be dramatic performance changes.

Please give SPE a try along with this patch and report any feedback. To check if SPE is available on your machine, see point (3) on the issue. Let us know if more information is needed on how to enable or use SPE!

yota9

Thanks for your amazing job!

bolt/lib/Profile/DataAggregator.cpp

bolt/tools/driver/llvm-bolt.cpp

bolt/test/perf2bolt/AArch64/perf2bolt-spe.test

bolt/lib/Profile/DataAggregator.cpp

paschalis-mpeis · 2025-01-15T15:28:20Z

Hey @yota9,

Thanks a lot for your review!

I addressed your comments except this one (left a comment there).
Please have another look and let me know of any further changes.

bolt/lib/Profile/DataAggregator.cpp

aaupov · 2025-01-16T23:26:37Z

Hi Paschalis, thank you for working on this.
The benefit that SPE has over IP sampling is the edge frequency information. So instead of creating two basic (IP) samples we should create branch samples (LBR) with stack depth one. Branch samples are later attached to CFG edges. This should improve the resulting performance when using SPE profiling.

aaupov · 2025-01-16T23:42:42Z

This PR is an implementation of the (4a) approach of:

[AArch64] BOLT does not support SPE branch data #115333

We did some limited, quick testing and there was no clear winner between the two approaches, but the --spe flag is introduced in a way to accommodate both.

Missed this comment. Am I reading it right that you didn't see a perf difference between registering SPE as two basic events or one branch event? In this case, can you please try -infer-fall-throughs option with the latter?

bolt/tools/driver/llvm-bolt.cpp

paschalis-mpeis · 2025-01-17T18:05:40Z

Hey Amir and Maks,

Thank you for taking a look at this!

Am I reading it right that you didn't see a perf difference between registering SPE as two basic events or one branch event? In this case, can you please try -infer-fall-throughs option with the latter?

Correct, in some preliminary internal tests we found both approaches to be close to each other.

Thanks for your suggestion to use -infer-fall-throughs. I thought LBR mode was inferring fall-through branches by default. But it looks like this has to be manually specified?

Let me share my understanding on the LBR format to see if I got this right:
Each LBR event gets a contiguous stack of taken branches. And any other branches that may lay in between them are known to be fall-throughs, which BOLT can infer. eg, if we have:

$\bf\textsf{\color{blue}TK1}$ -> $\textsf{\color{blue}TK2}$ -> $\textsf{\color{blue}TK3}$, then BOLT can propagate CFG hotness to:
$\textsf{\color{blue}TK1}$ -> FT1a -> FT1b .. -> $\textsf{\color{blue}TK2}$ -> FT2a -> FT2b .. -> $\textsf{\color{blue}TK3}$

SPE on the other hand is a statistical sampling method, meaning all collected packets are not captured contiguously. Each pair comes from a packet that looks like:

.  00000040:         PC 0xAB0 el2 ns=1
.  00000049:         PAD
.  00000053:         B COND
.  00000055:         EV RETIRED NOT-TAKEN
.  0000005a:         LAT 7 ISSUE
.  0000005d:         LAT 8 TOT
.  00000060:         TGT 0xAB4 el2 ns=1
.  00000069:         PAD
.  00000077:         TS 1234

(note: you can inspect native SPE packets w/ perf script -D)

From this example we have 0xAB0 -> 0xAB4 (a src/tgt pair), where 0xAB0 is a branch that was NOT-TAKEN.
The tgt 0xAB0 is a target address of some block (ie, not a branch). We have no information whether the branch of that target block will be taken or not. Therefore, my understanding is that we cannot infer any branches in-between src/tgt. And I believe that is why we found the two approaches to be close to each other.

Please do share your thoughts on this.

Do you think there are any other benefits when using the LBR format? It can additionally utilize prediction information (miss/hit), but we haven't found this to be that beneficial for the quite-limited SPE branch data (when compared to LBR traces).

aaupov · 2025-01-17T19:02:55Z

@paschalis-mpeis is there a way to configure SPE to only collect taken branches? My impression was that it's possible, e.g. based on this: https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/introduction-to-statistical-profiling-support-in-streamline

Event packets, which provide important information about each sampled instruction.
This information includes:
...
Was a mis-predicted or not-taken branch

But I couldn't find any info regarding configuring perf filter to collect it.

aaupov · 2025-01-17T19:05:45Z

Let me share my understanding on the LBR format to see if I got this right:
Each LBR event gets a contiguous stack of taken branches. And any other branches that may lay in between them are known to be fall-throughs, which BOLT can infer. eg, if we have:

Right, with taken branch stacks, we automatically "infer" fall-throughs between entries, and that becomes part of profile data that gets attached.

With SPE, if we're able to distinguish taken branches from not taken, I think we can similarly make that part of profile data so won't need infer-fall-through. If we can't distinguish them, but can filter by taken branches only, then we'd need to use infer-fall-throughs to assign fallthrough counts after taken branch counters are attached to a CFG.

paschalis-mpeis · 2025-01-20T14:44:05Z

is there a way to configure SPE to only collect taken branches?

What I believe you are asking here is to configure SPE to get us a pair of $\bf\textsf{\color{blue}SRC}$ -> $\textsf{\color{blue}TGT}$, where $\bf\textsf{\color{blue}SRC}$ and $\bf\textsf{\color{blue}TGT}$ are two taken branches captured in sequential order and are not necessarily directly linked in the CFG. In other words, make SPE act as an LBR-like buffer with a branch stack depth of 1.

I don't think that is possible. SPE does some periodic, non-contiguous, capture of events packets, in our case branches. Please consider the example below:

.  000007c0:  PC 0xAFF el2 ns=1
.  000007c9:  PAD
.  000007d3:  B COND
.  000007d5:  EV RETIRED NOT-TAKEN MISPRED
.  000007da:  LAT 12 ISSUE
.  000007dd:  LAT 13 TOT
.  000007e0:  TGT 0xB03 el2 ns=1
.  000007e9:  PAD
.  000007f7:  TS 12345

PC:
- is the instruction that was captured, in our case the source branch
- we can know whether the branch at PC was taken or not, and if it was a prediction miss/hit
- this aligns with the blogpost you've provided
TGT:
- is the target address of the landing block, ie where execution will continue next
- SPE has no information whether the next branch will be taken or not
- we cannot configure SPE to capture two taken branches in sequential program order, which are not necessarily directly linked in the CFG
PBT: optional HW feature pointing to the previous block (FEAT_SPE_PBT)
- it is a statistical profiling of the Previous Branch Target
- TMU there is no known HW implementation of this optional feature.

I could re-word some points in the PR/patch to make the above more clear.

(@mikewilliams-arm feel free to correct me if I missed anything)

paschalis-mpeis · 2025-01-20T14:56:03Z

With SPE, if we're able to distinguish taken branches from not taken, I think we can similarly make that part of profile data so won't need infer-fall-through. If we can't distinguish them, but can filter by taken branches only, then we'd need to use infer-fall-throughs to assign fallthrough counts after taken branch counters are attached to a CFG.

Currently, there is no such information but we could expose it with more follow-up patches on perf/linux. Please note that if we filter-out any non-taken branches, then we'll exclude information we cannot later infer.

Given the SPE limitations I've explained in previous comments, will this taken/not-taken additional information (or the infer flag) help propagating additional CFG hotness data?
Let's assume we captured the below entries:

Branch1 (taken) -> Block1
Branch2 (fallthrough) -> Block2

Regardless of whether the source branch was FT or Taken, we still don't know what will happen in Block1 and Block2 in terms of branching. I think BOLT will not be able to propagate information past these blocks, unless those are part of some extended basic block (EBB). In that case, it depends on how BOLT deals with EBBs and whether it lacks any support for them in BasicAggregation (thus giving an advantage to the LBR-format)?

mikewilliams-arm · 2025-01-20T17:34:46Z

is there a way to configure SPE to only collect taken branches?

For the avoidance of doubt, and benefit of anyone finding this and reading it out of context, you can configure SPE to collect only taken branches, but only from FEAT_SPEv1p2. That's a relatively new feature in the field. From looking at the kernel sources, you need to check for /sys/devices/arm_spe_0/format/inv_event_filter and the syntax would be something like perf record -e arm_spe_0/branch_filter=1,inv_event_filter=64/. (I might be wrong - I don't have access to such a system.)

You can always do this filtering post-hoc in software. perf record -e arm_spe_0/branch_filter=1/ should work on all SPE implementations, and according to Google's AI, 60% of branches are taken, so it's about a 66% overhead to store all the not taken branches and filter them out. perf script --inject=b used to do a poor job of preserving all the branch information through the injected events, making this harder to do. I believe that is something being looked into, if it's not already addressed.

However, even so, each sampled branch is exactly that - a single sampled branch. It does not collect sequences of branches other than through the aforementioned optional PBT extension. So, you can only infer that where you came from and where you branched to were executed.

paschalis-mpeis · 2025-01-21T09:32:57Z

Great, thanks a lot Michael for filling in with details!

Indeed the differences are subtle. I've answered a slightly different question, which I've now refined as it wasn't fully correct:

What I believe you are asking here is to configure SPE to get us a pair of $\bf\textsf{\color{blue}SRC}$ -> $\textsf{\color{blue}TGT}$, where $\bf\textsf{\color{blue}SRC}$ and $\bf\textsf{\color{blue}TGT}$ are two taken branches captured in sequential order and are not necessarily directly linked in the CFG.

Whether we filter-out the non-taken branches at the HW collection level (i.e., with the inv_event_filter interface), or in post-processing SW, the information loss holds:

Please note that if we filter-out any non-taken branches, then we'll exclude information we cannot later infer.

And this is because we'll end up with all the taken branch pairs that have direct links in the CFG.
In other words, we cannot get two branches that are indirect ancestors in the CFG, which would have left opportunities for inferring FTs.

ilinpv · 2025-02-03T20:43:19Z

bolt/lib/Profile/DataAggregator.cpp

+      ++NumSpeBranchSamples;
+
+    registerSample(&SamplePair->first);
+    registerSample(&SamplePair->second);


Am I correct in understanding that it is the case when we have sample for branch SRC -> TGT which was or was not be taken. However we increase hotness of SRC and TGT nodes in any case registering samples always for both nodes and not taking into account ratio of samples with this branch taken and not taken?

Hey Pavel,

Reading this back, you are concerned whether storing samples on TGT branches that are not NOT-TAKEN might increase hotness in a block that it shouldn't have. Correct?

That should not be a concern, as regardless of whether a branch is taken or not, the reported TGT is what was architecturally executed. In other words, NOT-TAKEN (or it's absence) characterizes what had happen in the src branch (PC), while TGT will always point to the path we end up taking.

So, for fall-through SPE packets, the TGT address would always be the next address from PC (ie, 0xA00 + 4, which is the instruction size in AArch64):

PC 0xA00 B COND EV RETIRED NOT-TAKEN TGT 0xA04

For taken branches, the TGT can be at a distance further than just 4 :

PC 0xA00 B COND EV RETIRED TGT 0xBBB

In my previous examples I was using mock addresses for PC/TGT, so I've updated any relevant examples to avoid confusion.

Right, thank you @paschalis-mpeis for clarifying about taken/not taken information and updating examples. @aaupov @maksfb would you like any additional explanations regarding SPE packets? Generally speaking SPE is providing event based sampling for branches and doesn't have enough information to create trace of N>1 branches and inferring fall throughs. We are aiming to add BRBE (Branch Record Buffer Extension) support for this in BOLT and provide branch stack trace like LBR with it.

Hi, thanks Paschalis for your example.
Maybe it's worth to highlight that the not-taken event is only related to conditional instruction (conditional branch or compare-and-branch), it tells that failed its condition code check, that's it. Since TGT (what you mentioned) "will always point to the path we end up taking", in this case presence of the not-taken event type is not relevant us, accordingly we will always get the 'taken paths'. Theoretically these branch information support our optimization, bolt will be able to rely on them.

Correct, thanks Adam. This is irrelevant to any unconditional branching (including call/ret).
Skipping 'non-taken' conditional branches is the optimization LBR/BRBE can do, as that can be inferred in post-processing.

paschalis-mpeis · 2025-02-12T20:47:28Z

Just adding that we are in the process of upstreaming 'brstack' support for SPE, which would handle the PBT feature nicely for us, and would sit nicely within bolt sources. The latest revision of these patches could also be found here:

https://github.com/Leo-Yan/linux/tree/perf_arm_spe_branch_flags_v2

Once upstreamed, we can adapt the patch to work for the LBR-format (cc: @kaadam).

BOLT gains the ability to process branch target information generated by Arm SPE data, using the `BasicAggregation` format. Example usage is: ```bash perf2bolt -p perf.data -o perf.boltdata --nl --spe BINARY ``` New branch data and compatibility: --- SPE branch entries in perf data contain a branch pair (`IP` -> `ADDR`) for the source and destination branches. DataAggregator processes those by creating two basic samples. Any other event types will have `ADDR` field set to `0x0`. For those a single sample will be created. Such events can be either SPE or non-SPE, like `l1d-access` and `cycles` respectively. The format of the input perf entries is: ``` PID EVENT-TYPE ADDR IP ``` When on SPE mode and: - host is not `AArch64`, BOLT will exit with a relevant message - `ADDR` field is unavailable, BOLT will exit with a relevant message - no branch pairs were recorded, BOLT will present a warning Examples of generating profiling data for the SPE mode: --- Profiles can be captured with perf on AArch64 machines with SPE enabled. They can be combined with other events, SPE or not. Capture only SPE branch data events: ```bash perf record -e 'arm_spe_0/branch_filter=1/u' -- BINARY ``` Capture any SPE events: ```bash perf record -e 'arm_spe_0//u' -- BINARY ``` Capture any SPE events and cycles ```bash perf record -e 'arm_spe_0//u' -e cycles:u -- BINARY ``` More filters, jitter, and specify count to control overheads/quality. ```bash perf record -e 'arm_spe_0/branch_filter=1,load_filter=0,store_filter=0,jitter=1/u' -c 10007 -- BINARY ```

paschalis-mpeis · 2025-02-27T13:09:27Z

Forced push to rebase to latest main to address conflicts (PCs/IPs were removed from the LBR samples).

Will be proceeding soon with an LBR patch, which for now will be stacked on top of this PR (cc: @kaadam).

paschalis-mpeis requested review from aaupov, maksfb, rafaelauler, ayermolo, dcci and yota9 as code owners December 20, 2024 14:58

llvmbot added the BOLT label Dec 20, 2024

yota9 reviewed Dec 20, 2024

View reviewed changes

paschalis-mpeis commented Dec 20, 2024

View reviewed changes

bolt/lib/Profile/DataAggregator.cpp Outdated Show resolved Hide resolved

paschalis-mpeis requested a review from yota9 January 15, 2025 15:29

maksfb reviewed Jan 16, 2025

View reviewed changes

bolt/lib/Profile/DataAggregator.cpp Show resolved Hide resolved

maksfb reviewed Jan 17, 2025

View reviewed changes

bolt/tools/driver/llvm-bolt.cpp Outdated Show resolved Hide resolved

ilinpv reviewed Feb 3, 2025

View reviewed changes

paschalis-mpeis mentioned this pull request Feb 17, 2025

[AArch64] BOLT does not support SPE branch data #115333

Open

paschalis-mpeis and others added 2 commits February 27, 2025 11:55

clang-format fix

0782911

paschalis-mpeis added 2 commits February 27, 2025 11:55

Addressing reviewers (1)

e74d2ae

Addressing reviewers (2)

47a986d

paschalis-mpeis force-pushed the users/paschalis-mpeis/bolt-spe-mode branch from 10f7219 to 47a986d Compare February 27, 2025 13:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BOLT][AArch64] Introduce SPE mode in BasicAggregation #120741

[BOLT][AArch64] Introduce SPE mode in BasicAggregation #120741

paschalis-mpeis commented Dec 20, 2024 •

edited

Loading

llvmbot commented Dec 20, 2024

New branch data and compatibility:

Examples of generating profiling data for the SPE mode:

github-actions bot commented Dec 20, 2024 •

edited

Loading

paschalis-mpeis commented Dec 20, 2024

yota9 left a comment

paschalis-mpeis commented Jan 15, 2025

aaupov commented Jan 16, 2025

aaupov commented Jan 16, 2025

paschalis-mpeis commented Jan 17, 2025 •

edited

Loading

aaupov commented Jan 17, 2025

aaupov commented Jan 17, 2025 •

edited

Loading

paschalis-mpeis commented Jan 20, 2025 •

edited

Loading

paschalis-mpeis commented Jan 20, 2025 •

edited

Loading

mikewilliams-arm commented Jan 20, 2025

paschalis-mpeis commented Jan 21, 2025 •

edited

Loading

ilinpv Feb 3, 2025

paschalis-mpeis Feb 5, 2025

ilinpv Feb 10, 2025

kaadam Feb 17, 2025 •

edited

Loading

paschalis-mpeis Feb 17, 2025

paschalis-mpeis commented Feb 12, 2025 •

edited

Loading

paschalis-mpeis commented Feb 27, 2025

[BOLT][AArch64] Introduce SPE mode in BasicAggregation #120741

Are you sure you want to change the base?

[BOLT][AArch64] Introduce SPE mode in BasicAggregation #120741

Conversation

paschalis-mpeis commented Dec 20, 2024 • edited Loading

New branch data and compatibility:

Examples of generating profiling data for the SPE mode:

llvmbot commented Dec 20, 2024

New branch data and compatibility:

Examples of generating profiling data for the SPE mode:

github-actions bot commented Dec 20, 2024 • edited Loading

paschalis-mpeis commented Dec 20, 2024

yota9 left a comment

Choose a reason for hiding this comment

paschalis-mpeis commented Jan 15, 2025

aaupov commented Jan 16, 2025

aaupov commented Jan 16, 2025

paschalis-mpeis commented Jan 17, 2025 • edited Loading

aaupov commented Jan 17, 2025

aaupov commented Jan 17, 2025 • edited Loading

paschalis-mpeis commented Jan 20, 2025 • edited Loading

paschalis-mpeis commented Jan 20, 2025 • edited Loading

mikewilliams-arm commented Jan 20, 2025

paschalis-mpeis commented Jan 21, 2025 • edited Loading

ilinpv Feb 3, 2025

Choose a reason for hiding this comment

paschalis-mpeis Feb 5, 2025

Choose a reason for hiding this comment

ilinpv Feb 10, 2025

Choose a reason for hiding this comment

kaadam Feb 17, 2025 • edited Loading

Choose a reason for hiding this comment

paschalis-mpeis Feb 17, 2025

Choose a reason for hiding this comment

paschalis-mpeis commented Feb 12, 2025 • edited Loading

paschalis-mpeis commented Feb 27, 2025

paschalis-mpeis commented Dec 20, 2024 •

edited

Loading

github-actions bot commented Dec 20, 2024 •

edited

Loading

paschalis-mpeis commented Jan 17, 2025 •

edited

Loading

aaupov commented Jan 17, 2025 •

edited

Loading

paschalis-mpeis commented Jan 20, 2025 •

edited

Loading

paschalis-mpeis commented Jan 20, 2025 •

edited

Loading

paschalis-mpeis commented Jan 21, 2025 •

edited

Loading

kaadam Feb 17, 2025 •

edited

Loading

paschalis-mpeis commented Feb 12, 2025 •

edited

Loading