-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BOLT][AArch64] Introduce SPE mode in BasicAggregation #120741
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-bolt Author: Paschalis Mpeis (paschalis-mpeis) ChangesBOLT gains the ability to process branch target information generated by Example usage is: perf2bolt -p perf.data -o perf.boltdata --nl --spe BINARY New branch data and compatibility:SPE branch entries in perf data contain a branch pair ( The format of the input perf entries is:
When on SPE mode and:
Examples of generating profiling data for the SPE mode:Profiles can be captured with perf on AArch64 machines with SPE enabled. Capture only SPE branch data events: perf record -e 'arm_spe_0/branch_filter=1/u' -- BINARY Capture any SPE events: perf record -e 'arm_spe_0//u' -- BINARY Capture any SPE events and cycles perf record -e 'arm_spe_0//u' -e cycles:u -- BINARY More filters, jitter, and specify count to control overheads/quality. perf record -e 'arm_spe_0/branch_filter=1,load_filter=0,store_filter=0,jitter=1/u' -c 10007 -- BINARY Full diff: https://github.com/llvm/llvm-project/pull/120741.diff 7 Files Affected:
diff --git a/bolt/include/bolt/Profile/DataAggregator.h b/bolt/include/bolt/Profile/DataAggregator.h
index 320623cfa15af1..be6e0fbd6347a0 100644
--- a/bolt/include/bolt/Profile/DataAggregator.h
+++ b/bolt/include/bolt/Profile/DataAggregator.h
@@ -78,6 +78,8 @@ class DataAggregator : public DataReader {
static bool checkPerfDataMagic(StringRef FileName);
private:
+ friend struct PerfSpeEventsTestHelper;
+
struct PerfBranchSample {
SmallVector<LBREntry, 32> LBR;
uint64_t PC;
@@ -294,6 +296,15 @@ class DataAggregator : public DataReader {
/// and a PC
ErrorOr<PerfBasicSample> parseBasicSample();
+ /// Parse an Arm SPE entry into the non-lbr format by generating two basic
+ /// samples. The format of an input SPE entry is:
+ /// ```
+ /// PID EVENT-TYPE ADDR IP
+ /// ```
+ /// SPE branch events will have 'ADDR' set to a branch target address while
+ /// other perf or SPE events will have it set to zero.
+ ErrorOr<std::pair<PerfBasicSample,PerfBasicSample>> parseSpeAsBasicSamples();
+
/// Parse a single perf sample containing a PID associated with an IP and
/// address.
ErrorOr<PerfMemSample> parseMemSample();
@@ -343,6 +354,9 @@ class DataAggregator : public DataReader {
/// Process non-LBR events.
void processBasicEvents();
+ /// Parse Arm SPE events into the non-LBR format.
+ std::error_code parseSpeAsBasicEvents();
+
/// Parse the full output generated by perf script to report memory events.
std::error_code parseMemEvents();
diff --git a/bolt/lib/Profile/DataAggregator.cpp b/bolt/lib/Profile/DataAggregator.cpp
index 2b02086e3e0c99..7038ca5b1452ab 100644
--- a/bolt/lib/Profile/DataAggregator.cpp
+++ b/bolt/lib/Profile/DataAggregator.cpp
@@ -49,6 +49,13 @@ static cl::opt<bool>
cl::desc("aggregate basic samples (without LBR info)"),
cl::cat(AggregatorCategory));
+cl::opt<bool> ArmSPE(
+ "spe",
+ cl::desc(
+ "Enable Arm SPE mode. Used in conjuction with no-lbr mode, ie `--spe "
+ "--nl`"),
+ cl::cat(AggregatorCategory));
+
static cl::opt<std::string>
ITraceAggregation("itrace",
cl::desc("Generate LBR info with perf itrace argument"),
@@ -180,11 +187,19 @@ void DataAggregator::start() {
findPerfExecutable();
- if (opts::BasicAggregation) {
- launchPerfProcess("events without LBR",
- MainEventsPPI,
+ if (opts::ArmSPE) {
+ if (!opts::BasicAggregation) {
+ errs() << "PERF2BOLT-ERROR: Arm SPE mode is combined only with "
+ "BasicAggregation.\n";
+ exit(1);
+ }
+ launchPerfProcess("branch events with SPE", MainEventsPPI,
+ "script -F pid,event,ip,addr --itrace=i1i",
+ /*Wait = */ false);
+ } else if (opts::BasicAggregation) {
+ launchPerfProcess("events without LBR", MainEventsPPI,
"script -F pid,event,ip",
- /*Wait = */false);
+ /*Wait = */ false);
} else if (!opts::ITraceAggregation.empty()) {
std::string ItracePerfScriptArgs = llvm::formatv(
"script -F pid,ip,brstack --itrace={0}", opts::ITraceAggregation);
@@ -192,10 +207,9 @@ void DataAggregator::start() {
ItracePerfScriptArgs.c_str(),
/*Wait = */ false);
} else {
- launchPerfProcess("branch events",
- MainEventsPPI,
+ launchPerfProcess("branch events", MainEventsPPI,
"script -F pid,ip,brstack",
- /*Wait = */false);
+ /*Wait = */ false);
}
// Note: we launch script for mem events regardless of the option, as the
@@ -531,14 +545,20 @@ Error DataAggregator::preprocessProfile(BinaryContext &BC) {
"not read one from input binary\n";
}
- auto ErrorCallback = [](int ReturnCode, StringRef ErrBuf) {
+ const Regex NoData("Samples for '.*' event do not have ADDR attribute set. "
+ "Cannot print 'addr' field.");
+
+ auto ErrorCallback = [&NoData](int ReturnCode, StringRef ErrBuf) {
+ if (opts::ArmSPE && NoData.match(ErrBuf)) {
+ errs() << "PERF2BOLT-ERROR: perf data are incompatible for Arm SPE mode "
+ "consumption. ADDR attribute is unset.\n";
+ exit(1);
+ }
errs() << "PERF-ERROR: return code " << ReturnCode << "\n" << ErrBuf;
exit(1);
};
auto MemEventsErrorCallback = [&](int ReturnCode, StringRef ErrBuf) {
- Regex NoData("Samples for '.*' event do not have ADDR attribute set. "
- "Cannot print 'addr' field.");
if (!NoData.match(ErrBuf))
ErrorCallback(ReturnCode, ErrBuf);
};
@@ -579,7 +599,8 @@ Error DataAggregator::preprocessProfile(BinaryContext &BC) {
exit(0);
}
- if ((!opts::BasicAggregation && parseBranchEvents()) ||
+ if (((!opts::BasicAggregation && !opts::ArmSPE) && parseBranchEvents()) ||
+ (opts::BasicAggregation && opts::ArmSPE && parseSpeAsBasicEvents()) ||
(opts::BasicAggregation && parseBasicEvents()))
errs() << "PERF2BOLT: failed to parse samples\n";
@@ -1226,6 +1247,66 @@ ErrorOr<DataAggregator::PerfBasicSample> DataAggregator::parseBasicSample() {
return PerfBasicSample{Event.get(), Address};
}
+ErrorOr<
+ std::pair<DataAggregator::PerfBasicSample, DataAggregator::PerfBasicSample>>
+DataAggregator::parseSpeAsBasicSamples() {
+ while (checkAndConsumeFS()) {
+ }
+
+ ErrorOr<int64_t> PIDRes = parseNumberField(FieldSeparator, true);
+ if (std::error_code EC = PIDRes.getError())
+ return EC;
+
+ constexpr PerfBasicSample EmptySample = PerfBasicSample{StringRef(), 0};
+ auto MMapInfoIter = BinaryMMapInfo.find(*PIDRes);
+ if (MMapInfoIter == BinaryMMapInfo.end()) {
+ consumeRestOfLine();
+ return std::make_pair(EmptySample, EmptySample);
+ }
+
+ while (checkAndConsumeFS()) {
+ }
+
+ ErrorOr<StringRef> Event = parseString(FieldSeparator);
+ if (std::error_code EC = Event.getError())
+ return EC;
+
+ while (checkAndConsumeFS()) {
+ }
+
+ ErrorOr<uint64_t> AddrResTo = parseHexField(FieldSeparator);
+ if (std::error_code EC = AddrResTo.getError())
+ return EC;
+ consumeAllRemainingFS();
+
+ ErrorOr<uint64_t> AddrResFrom = parseHexField(FieldSeparator, true);
+ if (std::error_code EC = AddrResFrom.getError())
+ return EC;
+
+ if (!checkAndConsumeNewLine()) {
+ reportError("expected end of line");
+ return make_error_code(llvm::errc::io_error);
+ }
+
+ auto genBasicSample = [&](uint64_t Address) {
+ // When fed with non SPE branch events the target address will be null.
+ // This is expected and ignored.
+ if (Address == 0x0)
+ return EmptySample;
+
+ if (!BC->HasFixedLoadAddress)
+ adjustAddress(Address, MMapInfoIter->second);
+ return PerfBasicSample{Event.get(), Address};
+ };
+
+ // Show more meaningful event names on boltdata.
+ if (Event->str() == "instructions:")
+ Event = *AddrResTo != 0x0 ? "branch-spe:" : "instruction-spe:";
+
+ return std::make_pair(genBasicSample(*AddrResFrom),
+ genBasicSample(*AddrResTo));
+}
+
ErrorOr<DataAggregator::PerfMemSample> DataAggregator::parseMemSample() {
PerfMemSample Res{0, 0};
@@ -1703,6 +1784,46 @@ std::error_code DataAggregator::parseBasicEvents() {
return std::error_code();
}
+std::error_code DataAggregator::parseSpeAsBasicEvents() {
+ outs() << "PERF2BOLT: parsing SPE data as basic events (no LBR)...\n";
+ NamedRegionTimer T("parseSPEBasic", "Parsing SPE as basic events",
+ TimerGroupName, TimerGroupDesc, opts::TimeAggregator);
+ uint64_t NumSpeBranchSamples = 0;
+
+ // Convert entries to one or two basic samples, depending on whether there is
+ // branch target information.
+ while (hasData()) {
+ auto SamplePair = parseSpeAsBasicSamples();
+ if (std::error_code EC = SamplePair.getError())
+ return EC;
+
+ auto registerSample = [this](const PerfBasicSample *Sample) {
+ if (!Sample->PC)
+ return;
+
+ if (BinaryFunction *BF = getBinaryFunctionContainingAddress(Sample->PC))
+ BF->setHasProfileAvailable();
+
+ ++BasicSamples[Sample->PC];
+ EventNames.insert(Sample->EventName);
+ };
+
+ if (SamplePair->first.PC != 0x0 && SamplePair->second.PC != 0x0)
+ ++NumSpeBranchSamples;
+
+ registerSample(&SamplePair->first);
+ registerSample(&SamplePair->second);
+ }
+
+ if (NumSpeBranchSamples == 0)
+ errs() << "PERF2BOLT-WARNING: no SPE branches found\n";
+ else
+ outs() << "PERF2BOLT: found " << NumSpeBranchSamples
+ << " SPE branch sample pairs.\n";
+
+ return std::error_code();
+}
+
void DataAggregator::processBasicEvents() {
outs() << "PERF2BOLT: processing basic events (without LBR)...\n";
NamedRegionTimer T("processBasic", "Processing basic events", TimerGroupName,
diff --git a/bolt/test/perf2bolt/AArch64/perf2bolt-spe.test b/bolt/test/perf2bolt/AArch64/perf2bolt-spe.test
new file mode 100644
index 00000000000000..d7cea7ff769b8e
--- /dev/null
+++ b/bolt/test/perf2bolt/AArch64/perf2bolt-spe.test
@@ -0,0 +1,14 @@
+## Check that Arm SPE mode is available on AArch64 with BasicAggregation.
+
+REQUIRES: system-linux,perf,target=aarch64{{.*}}
+
+RUN: %clang %cflags %p/../../Inputs/asm_foo.s %p/../../Inputs/asm_main.c -o %t.exe
+RUN: touch %t.empty.perf.data
+RUN: perf2bolt -p %t.empty.perf.data -o %t.perf.boltdata --nl --spe --pa %t.exe 2>&1 | FileCheck %s --check-prefix=CHECK-SPE-NO-LBR
+
+CHECK-SPE-NO-LBR: PERF2BOLT: Starting data aggregation job
+
+RUN: perf record -e cycles -q -o %t.perf.data -- %t.exe
+RUN: not perf2bolt -p %t.perf.data -o %t.perf.boltdata --spe %t.exe 2>&1 | FileCheck %s --check-prefix=CHECK-SPE-LBR
+
+CHECK-SPE-LBR: PERF2BOLT-ERROR: Arm SPE mode is combined only with BasicAggregation.
diff --git a/bolt/test/perf2bolt/X86/perf2bolt-spe.test b/bolt/test/perf2bolt/X86/perf2bolt-spe.test
new file mode 100644
index 00000000000000..f31c17f411137d
--- /dev/null
+++ b/bolt/test/perf2bolt/X86/perf2bolt-spe.test
@@ -0,0 +1,9 @@
+## Check that Arm SPE mode is unavailable on X86.
+
+REQUIRES: system-linux,x86_64-linux
+
+RUN: %clang %cflags %p/../../Inputs/asm_foo.s %p/../../Inputs/asm_main.c -o %t.exe
+RUN: touch %t.empty.perf.data
+RUN: not perf2bolt -p %t.empty.perf.data -o %t.perf.boltdata --nl --spe --pa %t.exe 2>&1 | FileCheck %s
+
+CHECK: BOLT-ERROR: -spe is available only on AArch64.
diff --git a/bolt/tools/driver/llvm-bolt.cpp b/bolt/tools/driver/llvm-bolt.cpp
index efa06cd68cb997..60b813f6f858d7 100644
--- a/bolt/tools/driver/llvm-bolt.cpp
+++ b/bolt/tools/driver/llvm-bolt.cpp
@@ -51,6 +51,8 @@ static cl::opt<std::string> InputFilename(cl::Positional,
cl::Required, cl::cat(BoltCategory),
cl::sub(cl::SubCommand::getAll()));
+extern cl::opt<bool> ArmSPE;
+
static cl::opt<std::string>
InputDataFilename("data",
cl::desc("<data file>"),
@@ -245,6 +247,13 @@ int main(int argc, char **argv) {
if (Error E = RIOrErr.takeError())
report_error(opts::InputFilename, std::move(E));
RewriteInstance &RI = *RIOrErr.get();
+
+ if (opts::AggregateOnly && !RI.getBinaryContext().isAArch64() &&
+ opts::ArmSPE == 1) {
+ errs() << "BOLT-ERROR: -spe is available only on AArch64.\n";
+ exit(1);
+ }
+
if (!opts::PerfData.empty()) {
if (!opts::AggregateOnly) {
errs() << ToolName
diff --git a/bolt/unittests/Profile/CMakeLists.txt b/bolt/unittests/Profile/CMakeLists.txt
index e0aa0926b49c03..ce01c6c4b949ee 100644
--- a/bolt/unittests/Profile/CMakeLists.txt
+++ b/bolt/unittests/Profile/CMakeLists.txt
@@ -1,11 +1,25 @@
+set(LLVM_LINK_COMPONENTS
+ DebugInfoDWARF
+ Object
+ ${LLVM_TARGETS_TO_BUILD}
+ )
+
add_bolt_unittest(ProfileTests
DataAggregator.cpp
+ PerfSpeEvents.cpp
DISABLE_LLVM_LINK_LLVM_DYLIB
)
target_link_libraries(ProfileTests
PRIVATE
+ LLVMBOLTCore
LLVMBOLTProfile
+ LLVMTargetParser
+ LLVMTestingSupport
)
+foreach (tgt ${BOLT_TARGETS_TO_BUILD})
+ string(TOUPPER "${tgt}" upper)
+ target_compile_definitions(ProfileTests PRIVATE "${upper}_AVAILABLE")
+endforeach()
diff --git a/bolt/unittests/Profile/PerfSpeEvents.cpp b/bolt/unittests/Profile/PerfSpeEvents.cpp
new file mode 100644
index 00000000000000..807a3bb1e07f40
--- /dev/null
+++ b/bolt/unittests/Profile/PerfSpeEvents.cpp
@@ -0,0 +1,173 @@
+//===- bolt/unittests/Profile/PerfSpeEvents.cpp ---------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifdef AARCH64_AVAILABLE
+
+#include "bolt/Core/BinaryContext.h"
+#include "bolt/Profile/DataAggregator.h"
+#include "llvm/BinaryFormat/ELF.h"
+#include "llvm/DebugInfo/DWARF/DWARFContext.h"
+#include "llvm/Support/CommandLine.h"
+#include "llvm/Support/TargetSelect.h"
+#include "gtest/gtest.h"
+
+using namespace llvm;
+using namespace llvm::bolt;
+using namespace llvm::object;
+using namespace llvm::ELF;
+
+namespace opts {
+extern cl::opt<std::string> ReadPerfEvents;
+} // namespace opts
+
+namespace llvm {
+namespace bolt {
+
+/// Perform checks on perf SPE branch events combined with other SPE or perf
+/// events.
+struct PerfSpeEventsTestHelper : public testing::Test {
+ void SetUp() override {
+ initalizeLLVM();
+ prepareElf();
+ initializeBOLT();
+ }
+
+protected:
+ void initalizeLLVM() {
+ llvm::InitializeAllTargetInfos();
+ llvm::InitializeAllTargetMCs();
+ llvm::InitializeAllAsmParsers();
+ llvm::InitializeAllDisassemblers();
+ llvm::InitializeAllTargets();
+ llvm::InitializeAllAsmPrinters();
+ }
+
+ void prepareElf() {
+ memcpy(ElfBuf, "\177ELF", 4);
+ ELF64LE::Ehdr *EHdr = reinterpret_cast<typename ELF64LE::Ehdr *>(ElfBuf);
+ EHdr->e_ident[llvm::ELF::EI_CLASS] = llvm::ELF::ELFCLASS64;
+ EHdr->e_ident[llvm::ELF::EI_DATA] = llvm::ELF::ELFDATA2LSB;
+ EHdr->e_machine = llvm::ELF::EM_AARCH64;
+ MemoryBufferRef Source(StringRef(ElfBuf, sizeof(ElfBuf)), "ELF");
+ ObjFile = cantFail(ObjectFile::createObjectFile(Source));
+ }
+
+ void initializeBOLT() {
+ Relocation::Arch = ObjFile->makeTriple().getArch();
+ BC = cantFail(BinaryContext::createBinaryContext(
+ ObjFile->makeTriple(), std::make_shared<orc::SymbolStringPool>(),
+ ObjFile->getFileName(), nullptr, /*IsPIC*/ false,
+ DWARFContext::create(*ObjFile.get()), {llvm::outs(), llvm::errs()}));
+ ASSERT_FALSE(!BC);
+ }
+
+ char ElfBuf[sizeof(typename ELF64LE::Ehdr)] = {};
+ std::unique_ptr<ObjectFile> ObjFile;
+ std::unique_ptr<BinaryContext> BC;
+
+ /// Return true when the expected \p SampleSize profile data are generated and
+ /// contain all the \p ExpectedEventNames.
+ bool checkEvents(uint64_t PID, size_t SampleSize,
+ const StringSet<> &ExpectedEventNames) {
+ DataAggregator DA("<pseudo input>");
+ DA.ParsingBuf = opts::ReadPerfEvents;
+ DA.BC = BC.get();
+ DataAggregator::MMapInfo MMap;
+ DA.BinaryMMapInfo.insert(std::make_pair(PID, MMap));
+
+ DA.parseSpeAsBasicEvents();
+
+ for (auto &EE : ExpectedEventNames)
+ if (!DA.EventNames.contains(EE.first()))
+ return false;
+
+ return SampleSize == DA.BasicSamples.size();
+ }
+};
+
+} // namespace bolt
+} // namespace llvm
+
+// Check that DataAggregator can parseSpeAsBasicEvents for branch events when
+// combined with other event types.
+
+TEST_F(PerfSpeEventsTestHelper, SpeBranches) {
+ // Check perf input with SPE branch events.
+ // Example collection command:
+ // ```
+ // perf record -e 'arm_spe_0/branch_filter=1/u' -- BINARY
+ // ```
+
+ opts::ReadPerfEvents =
+ "1234 instructions: a002 a001\n"
+ "1234 instructions: b002 b001\n"
+ "1234 instructions: c002 c001\n"
+ "1234 instructions: d002 d001\n"
+ "1234 instructions: e002 e001\n";
+
+ EXPECT_TRUE(checkEvents(1234, 10, {"branch-spe:"}));
+}
+
+TEST_F(PerfSpeEventsTestHelper, SpeBranchesAndCycles) {
+ // Check perf input with SPE branch events and cycles.
+ // Example collection command:
+ // ```
+ // perf record -e cycles:u -e 'arm_spe_0/branch_filter=1/u' -- BINARY
+ // ```
+
+ opts::ReadPerfEvents =
+ "1234 instructions: a002 a001\n"
+ "1234 cycles:u: 0 b001\n"
+ "1234 cycles:u: 0 c001\n"
+ "1234 instructions: d002 d001\n"
+ "1234 instructions: e002 e001\n";
+
+ EXPECT_TRUE(checkEvents(1234, 8, {"branch-spe:", "cycles:u:"}));
+}
+
+TEST_F(PerfSpeEventsTestHelper, SpeAnyEventAndCycles) {
+ // Check perf input with any SPE event type and cycles.
+ // Example collection command:
+ // ```
+ // perf record -e cycles:u -e 'arm_spe_0//u' -- BINARY
+ // ```
+
+ opts::ReadPerfEvents =
+ "1234 cycles:u: 0 a001\n"
+ "1234 cycles:u: 0 b001\n"
+ "1234 instructions: 0 c001\n"
+ "1234 instructions: 0 d001\n"
+ "1234 instructions: e002 e001\n";
+
+ EXPECT_TRUE(
+ checkEvents(1234, 6, {"cycles:u:", "instruction-spe:", "branch-spe:"}));
+}
+
+TEST_F(PerfSpeEventsTestHelper, SpeNoBranchPairsRecorded) {
+ // Check perf input that has no SPE branch pairs recorded.
+ // Example collection command:
+ // ```
+ // perf record -e cycles:u -e 'arm_spe_0/load_filter=1,branch_filter=0/u' --
+ // BINARY
+ // ```
+
+ testing::internal::CaptureStderr();
+ opts::ReadPerfEvents =
+ "1234 instructions: 0 a001\n"
+ "1234 cycles:u: 0 b001\n"
+ "1234 instructions: 0 c001\n"
+ "1234 cycles:u: 0 d001\n"
+ "1234 instructions: 0 e001\n";
+
+ EXPECT_TRUE(checkEvents(1234, 5, {"instruction-spe:", "cycles:u:"}));
+
+ std::string Stderr = testing::internal::GetCapturedStderr();
+ EXPECT_EQ(Stderr, "PERF2BOLT-WARNING: no SPE branches found\n");
+}
+
+#endif
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
This PR is an implementation of the (4a) approach of: We did some limited, quick testing and there was no clear winner between the two approaches, but the I believe @kaadam had some work on (4b)? Maybe at some point we could additionally have that merged, and community can test on a wider set of apps/workloads. I believe there won't be dramatic performance changes. Please give SPE a try along with this patch and report any feedback. To check if SPE is available on your machine, see point (3) on the issue. Let us know if more information is needed on how to enable or use SPE! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your amazing job!
Hi Paschalis, thank you for working on this. |
Missed this comment. Am I reading it right that you didn't see a perf difference between registering SPE as two basic events or one branch event? In this case, can you please try |
Hey Amir and Maks, Thank you for taking a look at this!
Correct, in some preliminary internal tests we found both approaches to be close to each other. Thanks for your suggestion to use Let me share my understanding on the LBR format to see if I got this right:
SPE on the other hand is a statistical sampling method, meaning all collected packets are not captured contiguously. Each pair comes from a packet that looks like:
(note: you can inspect native SPE packets w/ From this example we have Please do share your thoughts on this. Do you think there are any other benefits when using the LBR format? It can additionally utilize prediction information (miss/hit), but we haven't found this to be that beneficial for the quite-limited SPE branch data (when compared to LBR traces). |
@paschalis-mpeis is there a way to configure SPE to only collect taken branches? My impression was that it's possible, e.g. based on this: https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/introduction-to-statistical-profiling-support-in-streamline
But I couldn't find any info regarding configuring perf filter to collect it. |
Right, with taken branch stacks, we automatically "infer" fall-throughs between entries, and that becomes part of profile data that gets attached. With SPE, if we're able to distinguish taken branches from not taken, I think we can similarly make that part of profile data so won't need |
What I believe you are asking here is to configure SPE to get us a pair of I don't think that is possible. SPE does some periodic, non-contiguous, capture of events packets, in our case branches. Please consider the example below:
I could re-word some points in the PR/patch to make the above more clear. (@mikewilliams-arm feel free to correct me if I missed anything) |
Currently, there is no such information but we could expose it with more follow-up patches on perf/linux. Please note that if we filter-out any non-taken branches, then we'll exclude information we cannot later infer. Given the SPE limitations I've explained in previous comments, will this taken/not-taken additional information (or the infer flag) help propagating additional CFG hotness data?
Regardless of whether the source branch was FT or Taken, we still don't know what will happen in |
For the avoidance of doubt, and benefit of anyone finding this and reading it out of context, you can configure SPE to collect only taken branches, but only from FEAT_SPEv1p2. That's a relatively new feature in the field. From looking at the kernel sources, you need to check for You can always do this filtering post-hoc in software. However, even so, each sampled branch is exactly that - a single sampled branch. It does not collect sequences of branches other than through the aforementioned optional PBT extension. So, you can only infer that where you came from and where you branched to were executed. |
Great, thanks a lot Michael for filling in with details! Indeed the differences are subtle. I've answered a slightly different question, which I've now refined as it wasn't fully correct:
Whether we filter-out the non-taken branches at the HW collection level (i.e., with the
And this is because we'll end up with all the taken branch pairs that have direct links in the CFG. |
++NumSpeBranchSamples; | ||
|
||
registerSample(&SamplePair->first); | ||
registerSample(&SamplePair->second); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct in understanding that it is the case when we have sample for branch SRC -> TGT which was or was not be taken. However we increase hotness of SRC and TGT nodes in any case registering samples always for both nodes and not taking into account ratio of samples with this branch taken and not taken?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Pavel,
Reading this back, you are concerned whether storing samples on TGT branches that are not NOT-TAKEN might increase hotness in a block that it shouldn't have. Correct?
That should not be a concern, as regardless of whether a branch is taken or not, the reported TGT
is what was architecturally executed. In other words, NOT-TAKEN
(or it's absence) characterizes what had happen in the src branch (PC
), while TGT
will always point to the path we end up taking.
So, for fall-through SPE packets, the TGT
address would always be the next address from PC
(ie, 0xA00
+ 4
, which is the instruction size in AArch64):
PC 0xA00
B COND
EV RETIRED NOT-TAKEN
TGT 0xA04
For taken branches, the TGT
can be at a distance further than just 4
:
PC 0xA00
B COND
EV RETIRED
TGT 0xBBB
In my previous examples I was using mock addresses for PC/TGT, so I've updated any relevant examples to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, thank you @paschalis-mpeis for clarifying about taken/not taken information and updating examples. @aaupov @maksfb would you like any additional explanations regarding SPE packets? Generally speaking SPE is providing event based sampling for branches and doesn't have enough information to create trace of N>1 branches and inferring fall throughs. We are aiming to add BRBE (Branch Record Buffer Extension) support for this in BOLT and provide branch stack trace like LBR with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks Paschalis for your example.
Maybe it's worth to highlight that the not-taken event is only related to conditional instruction (conditional branch or compare-and-branch), it tells that failed its condition code check, that's it. Since TGT
(what you mentioned) "will always point to the path we end up taking", in this case presence of the not-taken event type is not relevant us, accordingly we will always get the 'taken paths'. Theoretically these branch information support our optimization, bolt will be able to rely on them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, thanks Adam. This is irrelevant to any unconditional branching (including call/ret).
Skipping 'non-taken' conditional branches is the optimization LBR/BRBE can do, as that can be inferred in post-processing.
Just adding that we are in the process of upstreaming 'brstack' support for SPE, which would handle the Once upstreamed, we can adapt the patch to work for the LBR-format (cc: @kaadam). |
BOLT gains the ability to process branch target information generated by Arm SPE data, using the `BasicAggregation` format. Example usage is: ```bash perf2bolt -p perf.data -o perf.boltdata --nl --spe BINARY ``` New branch data and compatibility: --- SPE branch entries in perf data contain a branch pair (`IP` -> `ADDR`) for the source and destination branches. DataAggregator processes those by creating two basic samples. Any other event types will have `ADDR` field set to `0x0`. For those a single sample will be created. Such events can be either SPE or non-SPE, like `l1d-access` and `cycles` respectively. The format of the input perf entries is: ``` PID EVENT-TYPE ADDR IP ``` When on SPE mode and: - host is not `AArch64`, BOLT will exit with a relevant message - `ADDR` field is unavailable, BOLT will exit with a relevant message - no branch pairs were recorded, BOLT will present a warning Examples of generating profiling data for the SPE mode: --- Profiles can be captured with perf on AArch64 machines with SPE enabled. They can be combined with other events, SPE or not. Capture only SPE branch data events: ```bash perf record -e 'arm_spe_0/branch_filter=1/u' -- BINARY ``` Capture any SPE events: ```bash perf record -e 'arm_spe_0//u' -- BINARY ``` Capture any SPE events and cycles ```bash perf record -e 'arm_spe_0//u' -e cycles:u -- BINARY ``` More filters, jitter, and specify count to control overheads/quality. ```bash perf record -e 'arm_spe_0/branch_filter=1,load_filter=0,store_filter=0,jitter=1/u' -c 10007 -- BINARY ```
10f7219
to
47a986d
Compare
Forced push to rebase to latest main to address conflicts (PCs/IPs were removed from the LBR samples). Will be proceeding soon with an LBR patch, which for now will be stacked on top of this PR (cc: @kaadam). |
BOLT gains the ability to process Arm SPE data using the
BasicAggregation
format.Example usage is:
New branch data and compatibility:
perf
since Linux 6.13 reports for SPE branch pairs (PC
→TGT
) where:PC
:PC
(i.e., to consider only the taken branches) would result in a data loss thatBOLT cannot later infer.
TGT
:perf
can now report.DataAggregator processes this information by creating two basic samples.
Any other event types will have
ADDR
field set to0x0
. For those a single samplewill be created.
Such events can be either SPE or non-SPE, like
l1d-access
andcycles
respectively.The format of the input perf entries is:
When on SPE mode and:
AArch64
, BOLT will exit with a relevant messageADDR
field is unavailable, BOLT will exit with a relevant messageExamples of generating profiling data for the SPE mode:
Profiles can be captured with perf on AArch64 machines with SPE enabled.
They can be combined with other events, SPE or not.
In the future we might restrict processing to just the branch packets.
Capture only SPE branch data events:
perf record -e 'arm_spe_0/branch_filter=1/u' -- BINARY
Using more filters, some jitter, and specify count to control overheads/quality:
perf record -e 'arm_spe_0/branch_filter=1,load_filter=0,store_filter=0,jitter=1/u' -c 10007 -- BINARY
Capture any SPE events:
perf record -e 'arm_spe_0//u' -- BINARY
Capture any SPE events and cycles
perf record -e 'arm_spe_0//u' -e cycles:u -- BINARY