Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BOLT][AArch64] Introduce SPE mode in BasicAggregation #120741

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

paschalis-mpeis
Copy link
Member

@paschalis-mpeis paschalis-mpeis commented Dec 20, 2024

BOLT gains the ability to process Arm SPE data using the BasicAggregation format.

Example usage is:

perf2bolt -p perf.data -o perf.boltdata --nl --spe BINARY

New branch data and compatibility:

perf since Linux 6.13 reports for SPE branch pairs (PCTGT) where:

  • PC:
    • it is the source branch; may be taken or not-taken.
    • Due to the nature of how SPE operates and what it can collect, any filtering on the
      PC (i.e., to consider only the taken branches) would result in a data loss that
      BOLT cannot later infer.
  • TGT:
    • it is the target address of the destination block.
    • this is the new information that perf can now report.

DataAggregator processes this information by creating two basic samples.
Any other event types will have ADDR field set to 0x0. For those a single sample
will be created.
Such events can be either SPE or non-SPE, like l1d-access and cycles respectively.

The format of the input perf entries is:

PID   EVENT-TYPE   ADDR   IP

When on SPE mode and:

  • host is not AArch64, BOLT will exit with a relevant message
  • ADDR field is unavailable, BOLT will exit with a relevant message
  • no branch pairs were recorded, BOLT will present a warning

Examples of generating profiling data for the SPE mode:

Profiles can be captured with perf on AArch64 machines with SPE enabled.
They can be combined with other events, SPE or not.
In the future we might restrict processing to just the branch packets.

Capture only SPE branch data events:

perf record -e 'arm_spe_0/branch_filter=1/u' -- BINARY

Using more filters, some jitter, and specify count to control overheads/quality:

perf record -e 'arm_spe_0/branch_filter=1,load_filter=0,store_filter=0,jitter=1/u' -c 10007 -- BINARY

Capture any SPE events:

perf record -e 'arm_spe_0//u' -- BINARY

Capture any SPE events and cycles

perf record -e 'arm_spe_0//u' -e cycles:u -- BINARY

@llvmbot
Copy link
Member

llvmbot commented Dec 20, 2024

@llvm/pr-subscribers-bolt

Author: Paschalis Mpeis (paschalis-mpeis)

Changes

BOLT gains the ability to process branch target information generated by
Arm SPE data, using the BasicAggregation format.

Example usage is:

perf2bolt -p perf.data -o perf.boltdata --nl --spe BINARY

New branch data and compatibility:

SPE branch entries in perf data contain a branch pair (IP -> ADDR)
for the source and destination branches. DataAggregator processes those
by creating two basic samples. Any other event types will have ADDR
field set to 0x0. For those a single sample will be created. Such
events can be either SPE or non-SPE, like l1d-access and cycles
respectively.

The format of the input perf entries is:

PID   EVENT-TYPE   ADDR   IP

When on SPE mode and:

  • host is not AArch64, BOLT will exit with a relevant message
  • ADDR field is unavailable, BOLT will exit with a relevant message
  • no branch pairs were recorded, BOLT will present a warning

Examples of generating profiling data for the SPE mode:

Profiles can be captured with perf on AArch64 machines with SPE enabled.
They can be combined with other events, SPE or not.

Capture only SPE branch data events:

perf record -e 'arm_spe_0/branch_filter=1/u' -- BINARY

Capture any SPE events:

perf record -e 'arm_spe_0//u' -- BINARY

Capture any SPE events and cycles

perf record -e 'arm_spe_0//u' -e cycles:u -- BINARY

More filters, jitter, and specify count to control overheads/quality.

perf record -e 'arm_spe_0/branch_filter=1,load_filter=0,store_filter=0,jitter=1/u' -c 10007 -- BINARY

Full diff: https://github.com/llvm/llvm-project/pull/120741.diff

7 Files Affected:

  • (modified) bolt/include/bolt/Profile/DataAggregator.h (+14)
  • (modified) bolt/lib/Profile/DataAggregator.cpp (+132-11)
  • (added) bolt/test/perf2bolt/AArch64/perf2bolt-spe.test (+14)
  • (added) bolt/test/perf2bolt/X86/perf2bolt-spe.test (+9)
  • (modified) bolt/tools/driver/llvm-bolt.cpp (+9)
  • (modified) bolt/unittests/Profile/CMakeLists.txt (+14)
  • (added) bolt/unittests/Profile/PerfSpeEvents.cpp (+173)
diff --git a/bolt/include/bolt/Profile/DataAggregator.h b/bolt/include/bolt/Profile/DataAggregator.h
index 320623cfa15af1..be6e0fbd6347a0 100644
--- a/bolt/include/bolt/Profile/DataAggregator.h
+++ b/bolt/include/bolt/Profile/DataAggregator.h
@@ -78,6 +78,8 @@ class DataAggregator : public DataReader {
   static bool checkPerfDataMagic(StringRef FileName);
 
 private:
+  friend struct PerfSpeEventsTestHelper;
+
   struct PerfBranchSample {
     SmallVector<LBREntry, 32> LBR;
     uint64_t PC;
@@ -294,6 +296,15 @@ class DataAggregator : public DataReader {
   /// and a PC
   ErrorOr<PerfBasicSample> parseBasicSample();
 
+  /// Parse an Arm SPE entry into the non-lbr format by generating two basic
+  /// samples. The format of an input SPE entry is:
+  /// ```
+  /// PID   EVENT-TYPE   ADDR   IP
+  /// ```
+  /// SPE branch events will have 'ADDR' set to a branch target address while
+  /// other perf or SPE events will have it set to zero.
+  ErrorOr<std::pair<PerfBasicSample,PerfBasicSample>> parseSpeAsBasicSamples();
+
   /// Parse a single perf sample containing a PID associated with an IP and
   /// address.
   ErrorOr<PerfMemSample> parseMemSample();
@@ -343,6 +354,9 @@ class DataAggregator : public DataReader {
   /// Process non-LBR events.
   void processBasicEvents();
 
+  /// Parse Arm SPE events into the non-LBR format.
+  std::error_code parseSpeAsBasicEvents();
+
   /// Parse the full output generated by perf script to report memory events.
   std::error_code parseMemEvents();
 
diff --git a/bolt/lib/Profile/DataAggregator.cpp b/bolt/lib/Profile/DataAggregator.cpp
index 2b02086e3e0c99..7038ca5b1452ab 100644
--- a/bolt/lib/Profile/DataAggregator.cpp
+++ b/bolt/lib/Profile/DataAggregator.cpp
@@ -49,6 +49,13 @@ static cl::opt<bool>
                      cl::desc("aggregate basic samples (without LBR info)"),
                      cl::cat(AggregatorCategory));
 
+cl::opt<bool> ArmSPE(
+    "spe",
+    cl::desc(
+        "Enable Arm SPE mode. Used in conjuction with no-lbr mode, ie `--spe "
+        "--nl`"),
+    cl::cat(AggregatorCategory));
+
 static cl::opt<std::string>
     ITraceAggregation("itrace",
                       cl::desc("Generate LBR info with perf itrace argument"),
@@ -180,11 +187,19 @@ void DataAggregator::start() {
 
   findPerfExecutable();
 
-  if (opts::BasicAggregation) {
-    launchPerfProcess("events without LBR",
-                      MainEventsPPI,
+  if (opts::ArmSPE) {
+    if (!opts::BasicAggregation) {
+      errs() << "PERF2BOLT-ERROR: Arm SPE mode is combined only with "
+                "BasicAggregation.\n";
+      exit(1);
+    }
+    launchPerfProcess("branch events with SPE", MainEventsPPI,
+                      "script -F pid,event,ip,addr --itrace=i1i",
+                      /*Wait = */ false);
+  } else if (opts::BasicAggregation) {
+    launchPerfProcess("events without LBR", MainEventsPPI,
                       "script -F pid,event,ip",
-                      /*Wait = */false);
+                      /*Wait = */ false);
   } else if (!opts::ITraceAggregation.empty()) {
     std::string ItracePerfScriptArgs = llvm::formatv(
         "script -F pid,ip,brstack --itrace={0}", opts::ITraceAggregation);
@@ -192,10 +207,9 @@ void DataAggregator::start() {
                       ItracePerfScriptArgs.c_str(),
                       /*Wait = */ false);
   } else {
-    launchPerfProcess("branch events",
-                      MainEventsPPI,
+    launchPerfProcess("branch events", MainEventsPPI,
                       "script -F pid,ip,brstack",
-                      /*Wait = */false);
+                      /*Wait = */ false);
   }
 
   // Note: we launch script for mem events regardless of the option, as the
@@ -531,14 +545,20 @@ Error DataAggregator::preprocessProfile(BinaryContext &BC) {
               "not read one from input binary\n";
   }
 
-  auto ErrorCallback = [](int ReturnCode, StringRef ErrBuf) {
+  const Regex NoData("Samples for '.*' event do not have ADDR attribute set. "
+                     "Cannot print 'addr' field.");
+
+  auto ErrorCallback = [&NoData](int ReturnCode, StringRef ErrBuf) {
+    if (opts::ArmSPE && NoData.match(ErrBuf)) {
+      errs() << "PERF2BOLT-ERROR: perf data are incompatible for Arm SPE mode "
+                "consumption. ADDR attribute is unset.\n";
+      exit(1);
+    }
     errs() << "PERF-ERROR: return code " << ReturnCode << "\n" << ErrBuf;
     exit(1);
   };
 
   auto MemEventsErrorCallback = [&](int ReturnCode, StringRef ErrBuf) {
-    Regex NoData("Samples for '.*' event do not have ADDR attribute set. "
-                 "Cannot print 'addr' field.");
     if (!NoData.match(ErrBuf))
       ErrorCallback(ReturnCode, ErrBuf);
   };
@@ -579,7 +599,8 @@ Error DataAggregator::preprocessProfile(BinaryContext &BC) {
     exit(0);
   }
 
-  if ((!opts::BasicAggregation && parseBranchEvents()) ||
+  if (((!opts::BasicAggregation && !opts::ArmSPE) && parseBranchEvents()) ||
+      (opts::BasicAggregation && opts::ArmSPE && parseSpeAsBasicEvents()) ||
       (opts::BasicAggregation && parseBasicEvents()))
     errs() << "PERF2BOLT: failed to parse samples\n";
 
@@ -1226,6 +1247,66 @@ ErrorOr<DataAggregator::PerfBasicSample> DataAggregator::parseBasicSample() {
   return PerfBasicSample{Event.get(), Address};
 }
 
+ErrorOr<
+    std::pair<DataAggregator::PerfBasicSample, DataAggregator::PerfBasicSample>>
+DataAggregator::parseSpeAsBasicSamples() {
+  while (checkAndConsumeFS()) {
+  }
+
+  ErrorOr<int64_t> PIDRes = parseNumberField(FieldSeparator, true);
+  if (std::error_code EC = PIDRes.getError())
+    return EC;
+
+  constexpr PerfBasicSample EmptySample = PerfBasicSample{StringRef(), 0};
+  auto MMapInfoIter = BinaryMMapInfo.find(*PIDRes);
+  if (MMapInfoIter == BinaryMMapInfo.end()) {
+    consumeRestOfLine();
+    return std::make_pair(EmptySample, EmptySample);
+  }
+
+  while (checkAndConsumeFS()) {
+  }
+
+  ErrorOr<StringRef> Event = parseString(FieldSeparator);
+  if (std::error_code EC = Event.getError())
+    return EC;
+
+  while (checkAndConsumeFS()) {
+  }
+
+  ErrorOr<uint64_t> AddrResTo = parseHexField(FieldSeparator);
+  if (std::error_code EC = AddrResTo.getError())
+    return EC;
+  consumeAllRemainingFS();
+
+  ErrorOr<uint64_t> AddrResFrom = parseHexField(FieldSeparator, true);
+  if (std::error_code EC = AddrResFrom.getError())
+    return EC;
+
+  if (!checkAndConsumeNewLine()) {
+    reportError("expected end of line");
+    return make_error_code(llvm::errc::io_error);
+  }
+
+  auto genBasicSample = [&](uint64_t Address) {
+    // When fed with non SPE branch events the target address will be null.
+    // This is expected and ignored.
+    if (Address == 0x0)
+      return EmptySample;
+
+    if (!BC->HasFixedLoadAddress)
+      adjustAddress(Address, MMapInfoIter->second);
+    return PerfBasicSample{Event.get(), Address};
+  };
+
+  // Show more meaningful event names on boltdata.
+  if (Event->str() == "instructions:")
+    Event = *AddrResTo != 0x0 ? "branch-spe:" : "instruction-spe:";
+
+  return std::make_pair(genBasicSample(*AddrResFrom),
+                        genBasicSample(*AddrResTo));
+}
+
 ErrorOr<DataAggregator::PerfMemSample> DataAggregator::parseMemSample() {
   PerfMemSample Res{0, 0};
 
@@ -1703,6 +1784,46 @@ std::error_code DataAggregator::parseBasicEvents() {
   return std::error_code();
 }
 
+std::error_code DataAggregator::parseSpeAsBasicEvents() {
+  outs() << "PERF2BOLT: parsing SPE data as basic events (no LBR)...\n";
+  NamedRegionTimer T("parseSPEBasic", "Parsing SPE as basic events",
+                     TimerGroupName, TimerGroupDesc, opts::TimeAggregator);
+  uint64_t NumSpeBranchSamples = 0;
+
+  // Convert entries to one or two basic samples, depending on whether there is
+  // branch target information.
+  while (hasData()) {
+    auto SamplePair = parseSpeAsBasicSamples();
+    if (std::error_code EC = SamplePair.getError())
+      return EC;
+
+    auto registerSample = [this](const PerfBasicSample *Sample) {
+      if (!Sample->PC)
+        return;
+
+      if (BinaryFunction *BF = getBinaryFunctionContainingAddress(Sample->PC))
+        BF->setHasProfileAvailable();
+
+      ++BasicSamples[Sample->PC];
+      EventNames.insert(Sample->EventName);
+    };
+
+    if (SamplePair->first.PC != 0x0 && SamplePair->second.PC != 0x0)
+      ++NumSpeBranchSamples;
+
+    registerSample(&SamplePair->first);
+    registerSample(&SamplePair->second);
+  }
+
+  if (NumSpeBranchSamples == 0)
+    errs() << "PERF2BOLT-WARNING: no SPE branches found\n";
+  else
+    outs() << "PERF2BOLT: found " << NumSpeBranchSamples
+           << " SPE branch sample pairs.\n";
+
+  return std::error_code();
+}
+
 void DataAggregator::processBasicEvents() {
   outs() << "PERF2BOLT: processing basic events (without LBR)...\n";
   NamedRegionTimer T("processBasic", "Processing basic events", TimerGroupName,
diff --git a/bolt/test/perf2bolt/AArch64/perf2bolt-spe.test b/bolt/test/perf2bolt/AArch64/perf2bolt-spe.test
new file mode 100644
index 00000000000000..d7cea7ff769b8e
--- /dev/null
+++ b/bolt/test/perf2bolt/AArch64/perf2bolt-spe.test
@@ -0,0 +1,14 @@
+## Check that Arm SPE mode is available on AArch64 with BasicAggregation.
+
+REQUIRES: system-linux,perf,target=aarch64{{.*}}
+
+RUN: %clang %cflags %p/../../Inputs/asm_foo.s %p/../../Inputs/asm_main.c -o %t.exe
+RUN: touch %t.empty.perf.data
+RUN: perf2bolt -p %t.empty.perf.data -o %t.perf.boltdata --nl --spe --pa %t.exe 2>&1 | FileCheck %s --check-prefix=CHECK-SPE-NO-LBR
+
+CHECK-SPE-NO-LBR: PERF2BOLT: Starting data aggregation job
+
+RUN: perf record -e cycles -q -o %t.perf.data -- %t.exe
+RUN: not perf2bolt -p %t.perf.data -o %t.perf.boltdata --spe %t.exe 2>&1 | FileCheck %s --check-prefix=CHECK-SPE-LBR
+
+CHECK-SPE-LBR: PERF2BOLT-ERROR: Arm SPE mode is combined only with BasicAggregation.
diff --git a/bolt/test/perf2bolt/X86/perf2bolt-spe.test b/bolt/test/perf2bolt/X86/perf2bolt-spe.test
new file mode 100644
index 00000000000000..f31c17f411137d
--- /dev/null
+++ b/bolt/test/perf2bolt/X86/perf2bolt-spe.test
@@ -0,0 +1,9 @@
+## Check that Arm SPE mode is unavailable on X86.
+
+REQUIRES: system-linux,x86_64-linux
+
+RUN: %clang %cflags %p/../../Inputs/asm_foo.s %p/../../Inputs/asm_main.c -o %t.exe
+RUN: touch %t.empty.perf.data
+RUN: not perf2bolt -p %t.empty.perf.data -o %t.perf.boltdata --nl --spe --pa %t.exe 2>&1 | FileCheck %s
+
+CHECK: BOLT-ERROR: -spe is available only on AArch64.
diff --git a/bolt/tools/driver/llvm-bolt.cpp b/bolt/tools/driver/llvm-bolt.cpp
index efa06cd68cb997..60b813f6f858d7 100644
--- a/bolt/tools/driver/llvm-bolt.cpp
+++ b/bolt/tools/driver/llvm-bolt.cpp
@@ -51,6 +51,8 @@ static cl::opt<std::string> InputFilename(cl::Positional,
                                           cl::Required, cl::cat(BoltCategory),
                                           cl::sub(cl::SubCommand::getAll()));
 
+extern cl::opt<bool> ArmSPE;
+
 static cl::opt<std::string>
 InputDataFilename("data",
   cl::desc("<data file>"),
@@ -245,6 +247,13 @@ int main(int argc, char **argv) {
       if (Error E = RIOrErr.takeError())
         report_error(opts::InputFilename, std::move(E));
       RewriteInstance &RI = *RIOrErr.get();
+
+      if (opts::AggregateOnly && !RI.getBinaryContext().isAArch64() &&
+          opts::ArmSPE == 1) {
+        errs() << "BOLT-ERROR: -spe is available only on AArch64.\n";
+        exit(1);
+      }
+
       if (!opts::PerfData.empty()) {
         if (!opts::AggregateOnly) {
           errs() << ToolName
diff --git a/bolt/unittests/Profile/CMakeLists.txt b/bolt/unittests/Profile/CMakeLists.txt
index e0aa0926b49c03..ce01c6c4b949ee 100644
--- a/bolt/unittests/Profile/CMakeLists.txt
+++ b/bolt/unittests/Profile/CMakeLists.txt
@@ -1,11 +1,25 @@
+set(LLVM_LINK_COMPONENTS
+  DebugInfoDWARF
+  Object
+  ${LLVM_TARGETS_TO_BUILD}
+  )
+
 add_bolt_unittest(ProfileTests
   DataAggregator.cpp
+  PerfSpeEvents.cpp
 
   DISABLE_LLVM_LINK_LLVM_DYLIB
   )
 
 target_link_libraries(ProfileTests
   PRIVATE
+  LLVMBOLTCore
   LLVMBOLTProfile
+  LLVMTargetParser
+  LLVMTestingSupport
   )
 
+foreach (tgt ${BOLT_TARGETS_TO_BUILD})
+  string(TOUPPER "${tgt}" upper)
+  target_compile_definitions(ProfileTests PRIVATE "${upper}_AVAILABLE")
+endforeach()
diff --git a/bolt/unittests/Profile/PerfSpeEvents.cpp b/bolt/unittests/Profile/PerfSpeEvents.cpp
new file mode 100644
index 00000000000000..807a3bb1e07f40
--- /dev/null
+++ b/bolt/unittests/Profile/PerfSpeEvents.cpp
@@ -0,0 +1,173 @@
+//===- bolt/unittests/Profile/PerfSpeEvents.cpp ---------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifdef AARCH64_AVAILABLE
+
+#include "bolt/Core/BinaryContext.h"
+#include "bolt/Profile/DataAggregator.h"
+#include "llvm/BinaryFormat/ELF.h"
+#include "llvm/DebugInfo/DWARF/DWARFContext.h"
+#include "llvm/Support/CommandLine.h"
+#include "llvm/Support/TargetSelect.h"
+#include "gtest/gtest.h"
+
+using namespace llvm;
+using namespace llvm::bolt;
+using namespace llvm::object;
+using namespace llvm::ELF;
+
+namespace opts {
+extern cl::opt<std::string> ReadPerfEvents;
+} // namespace opts
+
+namespace llvm {
+namespace bolt {
+
+/// Perform checks on perf SPE branch events combined with other SPE or perf
+/// events.
+struct PerfSpeEventsTestHelper : public testing::Test {
+  void SetUp() override {
+    initalizeLLVM();
+    prepareElf();
+    initializeBOLT();
+  }
+
+protected:
+  void initalizeLLVM() {
+    llvm::InitializeAllTargetInfos();
+    llvm::InitializeAllTargetMCs();
+    llvm::InitializeAllAsmParsers();
+    llvm::InitializeAllDisassemblers();
+    llvm::InitializeAllTargets();
+    llvm::InitializeAllAsmPrinters();
+  }
+
+  void prepareElf() {
+    memcpy(ElfBuf, "\177ELF", 4);
+    ELF64LE::Ehdr *EHdr = reinterpret_cast<typename ELF64LE::Ehdr *>(ElfBuf);
+    EHdr->e_ident[llvm::ELF::EI_CLASS] = llvm::ELF::ELFCLASS64;
+    EHdr->e_ident[llvm::ELF::EI_DATA] = llvm::ELF::ELFDATA2LSB;
+    EHdr->e_machine = llvm::ELF::EM_AARCH64;
+    MemoryBufferRef Source(StringRef(ElfBuf, sizeof(ElfBuf)), "ELF");
+    ObjFile = cantFail(ObjectFile::createObjectFile(Source));
+  }
+
+  void initializeBOLT() {
+    Relocation::Arch = ObjFile->makeTriple().getArch();
+    BC = cantFail(BinaryContext::createBinaryContext(
+        ObjFile->makeTriple(), std::make_shared<orc::SymbolStringPool>(),
+        ObjFile->getFileName(), nullptr, /*IsPIC*/ false,
+        DWARFContext::create(*ObjFile.get()), {llvm::outs(), llvm::errs()}));
+    ASSERT_FALSE(!BC);
+  }
+
+  char ElfBuf[sizeof(typename ELF64LE::Ehdr)] = {};
+  std::unique_ptr<ObjectFile> ObjFile;
+  std::unique_ptr<BinaryContext> BC;
+
+  /// Return true when the expected \p SampleSize profile data are generated and
+  /// contain all the \p ExpectedEventNames.
+  bool checkEvents(uint64_t PID, size_t SampleSize,
+                   const StringSet<> &ExpectedEventNames) {
+    DataAggregator DA("<pseudo input>");
+    DA.ParsingBuf = opts::ReadPerfEvents;
+    DA.BC = BC.get();
+    DataAggregator::MMapInfo MMap;
+    DA.BinaryMMapInfo.insert(std::make_pair(PID, MMap));
+
+    DA.parseSpeAsBasicEvents();
+
+    for (auto &EE : ExpectedEventNames)
+      if (!DA.EventNames.contains(EE.first()))
+        return false;
+
+    return SampleSize == DA.BasicSamples.size();
+  }
+};
+
+} // namespace bolt
+} // namespace llvm
+
+// Check that DataAggregator can parseSpeAsBasicEvents for branch events when
+// combined with other event types.
+
+TEST_F(PerfSpeEventsTestHelper, SpeBranches) {
+  // Check perf input with SPE branch events.
+  // Example collection command:
+  // ```
+  // perf record -e 'arm_spe_0/branch_filter=1/u' -- BINARY
+  // ```
+
+  opts::ReadPerfEvents =
+      "1234          instructions:              a002    a001\n"
+      "1234          instructions:              b002    b001\n"
+      "1234          instructions:              c002    c001\n"
+      "1234          instructions:              d002    d001\n"
+      "1234          instructions:              e002    e001\n";
+
+  EXPECT_TRUE(checkEvents(1234, 10, {"branch-spe:"}));
+}
+
+TEST_F(PerfSpeEventsTestHelper, SpeBranchesAndCycles) {
+  // Check perf input with SPE branch events and cycles.
+  // Example collection command:
+  // ```
+  // perf record -e cycles:u -e 'arm_spe_0/branch_filter=1/u' -- BINARY
+  // ```
+
+  opts::ReadPerfEvents =
+      "1234          instructions:              a002    a001\n"
+      "1234              cycles:u:                 0    b001\n"
+      "1234              cycles:u:                 0    c001\n"
+      "1234          instructions:              d002    d001\n"
+      "1234          instructions:              e002    e001\n";
+
+  EXPECT_TRUE(checkEvents(1234, 8, {"branch-spe:", "cycles:u:"}));
+}
+
+TEST_F(PerfSpeEventsTestHelper, SpeAnyEventAndCycles) {
+  // Check perf input with any SPE event type and cycles.
+  // Example collection command:
+  // ```
+  // perf record -e cycles:u -e 'arm_spe_0//u' -- BINARY
+  // ```
+
+  opts::ReadPerfEvents =
+      "1234              cycles:u:                0     a001\n"
+      "1234              cycles:u:                0     b001\n"
+      "1234          instructions:                0     c001\n"
+      "1234          instructions:                0     d001\n"
+      "1234          instructions:              e002    e001\n";
+
+  EXPECT_TRUE(
+      checkEvents(1234, 6, {"cycles:u:", "instruction-spe:", "branch-spe:"}));
+}
+
+TEST_F(PerfSpeEventsTestHelper, SpeNoBranchPairsRecorded) {
+  // Check perf input that has no SPE branch pairs recorded.
+  // Example collection command:
+  // ```
+  // perf record -e cycles:u -e 'arm_spe_0/load_filter=1,branch_filter=0/u' --
+  // BINARY
+  // ```
+
+  testing::internal::CaptureStderr();
+  opts::ReadPerfEvents =
+      "1234          instructions:                 0    a001\n"
+      "1234              cycles:u:                 0    b001\n"
+      "1234          instructions:                 0    c001\n"
+      "1234              cycles:u:                 0    d001\n"
+      "1234          instructions:                 0    e001\n";
+
+  EXPECT_TRUE(checkEvents(1234, 5, {"instruction-spe:", "cycles:u:"}));
+
+  std::string Stderr = testing::internal::GetCapturedStderr();
+  EXPECT_EQ(Stderr, "PERF2BOLT-WARNING: no SPE branches found\n");
+}
+
+#endif

Copy link

github-actions bot commented Dec 20, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@paschalis-mpeis
Copy link
Member Author

This PR is an implementation of the (4a) approach of:

We did some limited, quick testing and there was no clear winner between the two approaches, but the --spe flag is introduced in a way to accommodate both.

I believe @kaadam had some work on (4b)? Maybe at some point we could additionally have that merged, and community can test on a wider set of apps/workloads. I believe there won't be dramatic performance changes.

Please give SPE a try along with this patch and report any feedback. To check if SPE is available on your machine, see point (3) on the issue. Let us know if more information is needed on how to enable or use SPE!

Copy link
Member

@yota9 yota9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your amazing job!

@paschalis-mpeis
Copy link
Member Author

Hey @yota9,

Thanks a lot for your review!

I addressed your comments except this one (left a comment there).
Please have another look and let me know of any further changes.

@paschalis-mpeis paschalis-mpeis requested a review from yota9 January 15, 2025 15:29
@aaupov
Copy link
Contributor

aaupov commented Jan 16, 2025

Hi Paschalis, thank you for working on this.
The benefit that SPE has over IP sampling is the edge frequency information. So instead of creating two basic (IP) samples we should create branch samples (LBR) with stack depth one. Branch samples are later attached to CFG edges. This should improve the resulting performance when using SPE profiling.

@aaupov
Copy link
Contributor

aaupov commented Jan 16, 2025

This PR is an implementation of the (4a) approach of:

We did some limited, quick testing and there was no clear winner between the two approaches, but the --spe flag is introduced in a way to accommodate both.

Missed this comment. Am I reading it right that you didn't see a perf difference between registering SPE as two basic events or one branch event? In this case, can you please try -infer-fall-throughs option with the latter?

@paschalis-mpeis
Copy link
Member Author

paschalis-mpeis commented Jan 17, 2025

Hey Amir and Maks,

Thank you for taking a look at this!

Am I reading it right that you didn't see a perf difference between registering SPE as two basic events or one branch event? In this case, can you please try -infer-fall-throughs option with the latter?

Correct, in some preliminary internal tests we found both approaches to be close to each other.

Thanks for your suggestion to use -infer-fall-throughs. I thought LBR mode was inferring fall-through branches by default. But it looks like this has to be manually specified?

Let me share my understanding on the LBR format to see if I got this right:
Each LBR event gets a contiguous stack of taken branches. And any other branches that may lay in between them are known to be fall-throughs, which BOLT can infer. eg, if we have:

  • $\bf\textsf{\color{blue}TK1}$ -> $\textsf{\color{blue}TK2}$ -> $\textsf{\color{blue}TK3}$, then BOLT can propagate CFG hotness to:
  • $\textsf{\color{blue}TK1}$ -> FT1a -> FT1b .. -> $\textsf{\color{blue}TK2}$ -> FT2a -> FT2b .. -> $\textsf{\color{blue}TK3}$

SPE on the other hand is a statistical sampling method, meaning all collected packets are not captured contiguously. Each pair comes from a packet that looks like:

.  00000040:         PC 0xAB0 el2 ns=1
.  00000049:         PAD
.  00000053:         B COND
.  00000055:         EV RETIRED NOT-TAKEN
.  0000005a:         LAT 7 ISSUE
.  0000005d:         LAT 8 TOT
.  00000060:         TGT 0xAB4 el2 ns=1
.  00000069:         PAD
.  00000077:         TS 1234

(note: you can inspect native SPE packets w/ perf script -D)

From this example we have 0xAB0 -> 0xAB4 (a src/tgt pair), where 0xAB0 is a branch that was NOT-TAKEN.
The tgt 0xAB0 is a target address of some block (ie, not a branch). We have no information whether the branch of that target block will be taken or not. Therefore, my understanding is that we cannot infer any branches in-between src/tgt. And I believe that is why we found the two approaches to be close to each other.

Please do share your thoughts on this.

Do you think there are any other benefits when using the LBR format? It can additionally utilize prediction information (miss/hit), but we haven't found this to be that beneficial for the quite-limited SPE branch data (when compared to LBR traces).

@aaupov
Copy link
Contributor

aaupov commented Jan 17, 2025

@paschalis-mpeis is there a way to configure SPE to only collect taken branches? My impression was that it's possible, e.g. based on this: https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/introduction-to-statistical-profiling-support-in-streamline

Event packets, which provide important information about each sampled instruction.
This information includes:
...
Was a mis-predicted or not-taken branch

But I couldn't find any info regarding configuring perf filter to collect it.

@aaupov
Copy link
Contributor

aaupov commented Jan 17, 2025

Let me share my understanding on the LBR format to see if I got this right:
Each LBR event gets a contiguous stack of taken branches. And any other branches that may lay in between them are known to be fall-throughs, which BOLT can infer. eg, if we have:

Right, with taken branch stacks, we automatically "infer" fall-throughs between entries, and that becomes part of profile data that gets attached.

With SPE, if we're able to distinguish taken branches from not taken, I think we can similarly make that part of profile data so won't need infer-fall-through. If we can't distinguish them, but can filter by taken branches only, then we'd need to use infer-fall-throughs to assign fallthrough counts after taken branch counters are attached to a CFG.

@paschalis-mpeis
Copy link
Member Author

paschalis-mpeis commented Jan 20, 2025

is there a way to configure SPE to only collect taken branches?

What I believe you are asking here is to configure SPE to get us a pair of $\bf\textsf{\color{blue}SRC}$ -> $\textsf{\color{blue}TGT}$, where $\bf\textsf{\color{blue}SRC}$ and $\bf\textsf{\color{blue}TGT}$ are two taken branches captured in sequential order and are not necessarily directly linked in the CFG. In other words, make SPE act as an LBR-like buffer with a branch stack depth of 1.

I don't think that is possible. SPE does some periodic, non-contiguous, capture of events packets, in our case branches. Please consider the example below:

.  000007c0:  PC 0xAFF el2 ns=1
.  000007c9:  PAD
.  000007d3:  B COND
.  000007d5:  EV RETIRED NOT-TAKEN MISPRED
.  000007da:  LAT 12 ISSUE
.  000007dd:  LAT 13 TOT
.  000007e0:  TGT 0xB03 el2 ns=1
.  000007e9:  PAD
.  000007f7:  TS 12345
  • PC:
    • is the instruction that was captured, in our case the source branch
    • we can know whether the branch at PC was taken or not, and if it was a prediction miss/hit
    • this aligns with the blogpost you've provided
  • TGT:
    • is the target address of the landing block, ie where execution will continue next
    • SPE has no information whether the next branch will be taken or not
    • we cannot configure SPE to capture two taken branches in sequential program order, which are not necessarily directly linked in the CFG
  • PBT: optional HW feature pointing to the previous block (FEAT_SPE_PBT)
    • it is a statistical profiling of the Previous Branch Target
    • TMU there is no known HW implementation of this optional feature.

I could re-word some points in the PR/patch to make the above more clear.

(@mikewilliams-arm feel free to correct me if I missed anything)

@paschalis-mpeis
Copy link
Member Author

paschalis-mpeis commented Jan 20, 2025

With SPE, if we're able to distinguish taken branches from not taken, I think we can similarly make that part of profile data so won't need infer-fall-through. If we can't distinguish them, but can filter by taken branches only, then we'd need to use infer-fall-throughs to assign fallthrough counts after taken branch counters are attached to a CFG.

Currently, there is no such information but we could expose it with more follow-up patches on perf/linux. Please note that if we filter-out any non-taken branches, then we'll exclude information we cannot later infer.

Given the SPE limitations I've explained in previous comments, will this taken/not-taken additional information (or the infer flag) help propagating additional CFG hotness data?
Let's assume we captured the below entries:

  • Branch1 (taken) -> Block1
  • Branch2 (fallthrough) -> Block2

Regardless of whether the source branch was FT or Taken, we still don't know what will happen in Block1 and Block2 in terms of branching. I think BOLT will not be able to propagate information past these blocks, unless those are part of some extended basic block (EBB). In that case, it depends on how BOLT deals with EBBs and whether it lacks any support for them in BasicAggregation (thus giving an advantage to the LBR-format)?

@mikewilliams-arm
Copy link

is there a way to configure SPE to only collect taken branches?

For the avoidance of doubt, and benefit of anyone finding this and reading it out of context, you can configure SPE to collect only taken branches, but only from FEAT_SPEv1p2. That's a relatively new feature in the field. From looking at the kernel sources, you need to check for /sys/devices/arm_spe_0/format/inv_event_filter and the syntax would be something like perf record -e arm_spe_0/branch_filter=1,inv_event_filter=64/. (I might be wrong - I don't have access to such a system.)

You can always do this filtering post-hoc in software. perf record -e arm_spe_0/branch_filter=1/ should work on all SPE implementations, and according to Google's AI, 60% of branches are taken, so it's about a 66% overhead to store all the not taken branches and filter them out. perf script --inject=b used to do a poor job of preserving all the branch information through the injected events, making this harder to do. I believe that is something being looked into, if it's not already addressed.

However, even so, each sampled branch is exactly that - a single sampled branch. It does not collect sequences of branches other than through the aforementioned optional PBT extension. So, you can only infer that where you came from and where you branched to were executed.

@paschalis-mpeis
Copy link
Member Author

paschalis-mpeis commented Jan 21, 2025

Great, thanks a lot Michael for filling in with details!

Indeed the differences are subtle. I've answered a slightly different question, which I've now refined as it wasn't fully correct:

What I believe you are asking here is to configure SPE to get us a pair of $\bf\textsf{\color{blue}SRC}$ -> $\textsf{\color{blue}TGT}$, where $\bf\textsf{\color{blue}SRC}$ and $\bf\textsf{\color{blue}TGT}$ are two taken branches captured in sequential order and are not necessarily directly linked in the CFG.

Whether we filter-out the non-taken branches at the HW collection level (i.e., with the inv_event_filter interface), or in post-processing SW, the information loss holds:

Please note that if we filter-out any non-taken branches, then we'll exclude information we cannot later infer.

And this is because we'll end up with all the taken branch pairs that have direct links in the CFG.
In other words, we cannot get two branches that are indirect ancestors in the CFG, which would have left opportunities for inferring FTs.

++NumSpeBranchSamples;

registerSample(&SamplePair->first);
registerSample(&SamplePair->second);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct in understanding that it is the case when we have sample for branch SRC -> TGT which was or was not be taken. However we increase hotness of SRC and TGT nodes in any case registering samples always for both nodes and not taking into account ratio of samples with this branch taken and not taken?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Pavel,

Reading this back, you are concerned whether storing samples on TGT branches that are not NOT-TAKEN might increase hotness in a block that it shouldn't have. Correct?

That should not be a concern, as regardless of whether a branch is taken or not, the reported TGT is what was architecturally executed. In other words, NOT-TAKEN (or it's absence) characterizes what had happen in the src branch (PC), while TGT will always point to the path we end up taking.

So, for fall-through SPE packets, the TGT address would always be the next address from PC (ie, 0xA00 + 4, which is the instruction size in AArch64):

PC 0xA00
B COND
EV RETIRED NOT-TAKEN
TGT 0xA04

For taken branches, the TGT can be at a distance further than just 4 :

PC 0xA00
B COND
EV RETIRED
TGT 0xBBB

In my previous examples I was using mock addresses for PC/TGT, so I've updated any relevant examples to avoid confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thank you @paschalis-mpeis for clarifying about taken/not taken information and updating examples. @aaupov @maksfb would you like any additional explanations regarding SPE packets? Generally speaking SPE is providing event based sampling for branches and doesn't have enough information to create trace of N>1 branches and inferring fall throughs. We are aiming to add BRBE (Branch Record Buffer Extension) support for this in BOLT and provide branch stack trace like LBR with it.

Copy link
Contributor

@kaadam kaadam Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks Paschalis for your example.
Maybe it's worth to highlight that the not-taken event is only related to conditional instruction (conditional branch or compare-and-branch), it tells that failed its condition code check, that's it. Since TGT (what you mentioned) "will always point to the path we end up taking", in this case presence of the not-taken event type is not relevant us, accordingly we will always get the 'taken paths'. Theoretically these branch information support our optimization, bolt will be able to rely on them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, thanks Adam. This is irrelevant to any unconditional branching (including call/ret).
Skipping 'non-taken' conditional branches is the optimization LBR/BRBE can do, as that can be inferred in post-processing.

@paschalis-mpeis
Copy link
Member Author

paschalis-mpeis commented Feb 12, 2025

Just adding that we are in the process of upstreaming 'brstack' support for SPE, which would handle the PBT feature nicely for us, and would sit nicely within bolt sources. The latest revision of these patches could also be found here:

Once upstreamed, we can adapt the patch to work for the LBR-format (cc: @kaadam).

paschalis-mpeis and others added 2 commits February 27, 2025 11:55
BOLT gains the ability to process branch target information generated by
Arm SPE data, using the `BasicAggregation` format.

Example usage is:
```bash
perf2bolt -p perf.data -o perf.boltdata --nl --spe BINARY
```

New branch data and compatibility:
---
SPE branch entries in perf data contain a branch pair (`IP` -> `ADDR`)
for the source and destination branches. DataAggregator processes those
by creating two basic samples. Any other event types will have `ADDR`
field set to `0x0`. For those a single sample will be created. Such
events can be either SPE or non-SPE, like `l1d-access` and `cycles`
respectively.

The format of the input perf entries is:
```
PID   EVENT-TYPE   ADDR   IP
```

When on SPE mode and:
- host is not `AArch64`, BOLT will exit with a relevant message
- `ADDR` field is unavailable, BOLT will exit with a relevant message
- no branch pairs were recorded, BOLT will present a warning

Examples of generating profiling data for the SPE mode:
---
Profiles can be captured with perf on AArch64 machines with SPE enabled.
They can be combined with other events, SPE or not.

Capture only SPE branch data events:
```bash
perf record -e 'arm_spe_0/branch_filter=1/u' -- BINARY
```

Capture any SPE events:
```bash
perf record -e 'arm_spe_0//u' -- BINARY
```

Capture any SPE events and cycles
```bash
perf record -e 'arm_spe_0//u' -e cycles:u -- BINARY
```

More filters, jitter, and specify count to control overheads/quality.
```bash
perf record -e 'arm_spe_0/branch_filter=1,load_filter=0,store_filter=0,jitter=1/u' -c 10007 -- BINARY
```
@paschalis-mpeis paschalis-mpeis force-pushed the users/paschalis-mpeis/bolt-spe-mode branch from 10f7219 to 47a986d Compare February 27, 2025 13:05
@paschalis-mpeis
Copy link
Member Author

Forced push to rebase to latest main to address conflicts (PCs/IPs were removed from the LBR samples).

Will be proceeding soon with an LBR patch, which for now will be stacked on top of this PR (cc: @kaadam).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants