C++ links: performance tools
- ankerl::nanobench: a platform independent microbenchmarking library for C++11/14/17/20
- benchmark (Google)
- Celero
- hayai: the C++ benchmarking framework
- HdrHistogram: A High Dynamic Range (HDR) Histogram
- https://hdrhistogram.github.io/HdrHistogram/
- https://github.com/HdrHistogram/HdrHistogram_c
- Understanding Latency and Response Time: Pitfalls and Key Lessons
- How Not to Measure Latency
- geiger: A micro benchmark library in C++ that supports hardware performance counters
- moodycamel::microbench
- Nonius: A C++ micro-benchmarking framework
- picobench: A micro microbenchmarking library for C++11 in a single header file
- ubench.h: single header benchmark framework for C and C++
- Micro benchmarking libraries for C++
- Performance Analysis and Tuning on Modern CPU
- Systems Benchmarking Crimes - Gernot Heiser - https://www.cse.unsw.edu.au/~gernot/benchmarking-crimes.html
- Intel Memory Latency Checker (MLC)
- a tool used to measure memory latencies and bandwidth, and how they change with increasing load on the system
- https://www.intel.com/software/mlc
- Memory Bandwidth Benchmark
- MBW determines the "copy" memory bandwidth available to userspace programs. Its simplistic approach models that of real applications. It is not tuned to extremes and it is not aware of hardware architecture, just like your average software package.
- https://github.com/raas/mbw
- Multichase: a pointer chaser benchmark
- Multiload - a superset of multichase which runs latency, memory bandwidth, and loaded-latency
- https://github.com/google/multichase
- pmbw: Parallel Memory Bandwidth Benchmark / Measurement
- a set of assembler routines to measure the parallel memory (cache and RAM) bandwidth of modern multi-core machines
- http://panthema.net/2013/pmbw/
- https://github.com/bingmann/pmbw
- Spatter: Benchmark for measuring the performance of sparse and irregular memory access
- https://github.com/hpcgarage/spatter
- Spatter: A Tool for Evaluating Gather / Scatter Performance
- STREAM: Sustainable Memory Bandwidth in High Performance Computers
- http://www.cs.virginia.edu/stream/
- STREAM benchmark - https://github.com/jeffhammond/STREAM
- NUMA-STREAM - https://github.com/larsbergstrom/NUMA-STREAM
- BabelStream: STREAM, for lots of devices written in many programming models
- TheBandwidthBenchmark
- https://github.com/RRZE-HPC/TheBandwidthBenchmark
- https://hpc-wiki.info/hpc/Micro_benchmarking#The_Bandwidth_Benchmark
- a collection of simple streaming kernels; apart from the micro-benchmark functionality also a blueprint for other micro-benchmark applications; contains C modules for: aligned data allocation, query and control affinity settings, accurate timing
- tinymembench: simple benchmark for memory throughput and latency
- Bytehound: a memory profiler for Linux
- Heaptrack - A Heap Memory Profiler for Linux
- How to Write a Heap Memory Profiler
- CppCon 2019; Milian Wolff
- https://www.youtube.com/watch?v=YB0QoWI-g8E
- Slides & code: https://github.com/milianw/how-to-write-a-memory-profiler
- MALT & NUMAPROF: Memory Profiling for HPC Applications
- NUMAPROF: a NUMA memory profiler based on Pintool to track remote memory accesses
- MALT: a MALloc Tracker to find where and how your made your memory allocations in C/C++/Fortran applications
- https://memtt.github.io/malt/
- https://github.com/memtt/malt
- MALT: A Malloc Tracker
- International Workshop on Software Engineering for Parallel Systems (SEPS) 2017
- Sébastien Valat, Andres S. Charif-Rubial, William Jalby
- paper: https://memtt.github.io/malt/downloads/2017-seps-malt.pdf
- slides: https://svalat.github.io/docs/2017-10-MALT-SEPS17.pdf
- FOSDEM 2019; Sébastien Valat
- Memoro: A Detailed Heap Profiler
- https://epfl-vlsc.github.io/memoro/
- https://github.com/epfl-vlsc/memoro
- Detailed Heap Profiling
- International Symposium on Memory Management (ISMM) 2018
- Stuart Byma, Jim Larus
- https://dl.acm.org/citation.cfm?id=3210564
- Memoro: Scaling an LLVM-based Heap profiler
- 2019 LLVM Developers’ Meeting; Thierry Treyer
- https://www.youtube.com/watch?v=fm47XsATelI
- https://llvm.org/devmtg/2019-10/slides/Treyer-Memoro.pdf
- Memory Profiling
- 2024; Denis Bakhvalov
- Introduction, Memory Usage Case Study (Heaptrack), Memory Footprint with Intel SDE, Memory Footprint Case Study, Data Locality and Reuse Distances
- https://easyperf.net/blog/2024/02/12/Memory-Profiling-Part1
- memory-profiler: A memory profiler for Linux
- memtrail: A LD_PRELOAD based memory profiler and leak detector for Linux
- memusage - profile memory usage of a program
- mstat: measure memory usage of a program over time (Linux)
- fine-grained, cgroup-based tool for profiling memory usage over time of a process tree
- https://github.com/bpowers/mstat
- MTuner - a C/C++ memory profiler and memory leak finder for Windows, PlayStation 4, PlayStation 3, etc.
- Object Introspection: Dynamic C++ Object Profiling
- enables on-demand, hierarchical profiling of objects in arbitrary C & C++ programs with no recompilation
- https://github.com/facebookexperimental/object-introspection
- https://facebookexperimental.github.io/object-introspection/
- Object Introspection: A C++ Memory Profiler
- CppCon 2023
- Jonathan Haslam & Aditya Sarwade
- https://www.youtube.com/watch?v=6IlTs8YRne0
- https://github.com/CppCon/CppCon2023/blob/main/Presentations/object_introspection_cppcon.pdf
- PerfMemPlus: A Tool for Automatic Discovery of Memory Performance Problems
- Tool for memory performance analysis based on Linux perf.
- https://github.com/helchr/perfMemPlus
- PerfMemPlus: A Tool for Automatic Discovery of Memory Performance Problems
- ISC 2019
- Christian Helm, Kenjiro Taura
- https://doi.org/10.1007/978-3-030-20656-7_11
- On the Correct Measurement of Application Memory Bandwidth and Memory Access Latency
- HPC Asia 2020
- Christian Helm, Kenjiro Taura
- https://doi.org/10.1145/3368474.3368476
- Poireau: a sampling allocation debugger
- Typegrind
- a type preserving heap profiler for C++ - collects memory allocation information with type information
- https://typegrind.github.io/
- https://github.com/typegrind/typegrind
- Valgrind
- http://valgrind.org/
- DHAT: a dynamic heap analysis tool - http://valgrind.org/docs/manual/dh-manual.html
- Massif: a heap profiler - http://valgrind.org/docs/manual/ms-manual.html
- Tools for microarchitectural benchmarking
- AnICA: Analyzing Inconsistencies in Microarchitectural Code Analyzers
- OOPSLA 2022
- Fabian Ritter, Sebastian Hack
- https://dl.acm.org/doi/10.1145/3563288
- https://arxiv.org/abs/2209.05994
- https://github.com/cdl-saarland/AnICA
- https://compilers.cs.uni-saarland.de/projects/anica/
- asmbench: A Benchmark Toolkit for Assembly Instructions Using the LLVM JIT
- https://github.com/RRZE-HPC/asmbench
- OoO Instruction Benchmarking Framework on the Back of Dragons
- 2018 SC18 ACM SRC Poster
- J. Hammer, G. Hager, G. Wellein
- https://sc18.supercomputing.org/proceedings/src_poster/src_poster_pages/spost115.html
- BHive: A Benchmark Suite and Measurement Framework for Validating x86-64 Basic Block Performance Models
- IISWC 2019
- Yishen Chen, Ajay Brahmakshatriya, Charith Mendis, Alex Renda, Eric Atkinson, Ondrej Sykora, Saman Amarasinghe, Michael Carbin
- http://groups.csail.mit.edu/commit/papers/19/ithemal-measurement.pdf
- https://github.com/ithemal/bhive
- GRANITE: A Graph Neural Network Model for Basic Block Throughput Estimation
- 2022 IEEE International Symposium on Workload Characterization (IISWC)
- Ondrej Sykora, Phitchaya Mangpo Phothilimthana, Charith Mendis, Amir Yazdanbakhsh
- https://arxiv.org/abs/2210.03894
- Gematria - machine learning for machine code
- Contains sources of Gematria, a framework for machine learning on machine code. It includes implementations of the GRANITE model and the Ithemal hierarchical LSTM model for learning inverse throughput of basic blocks.
- https://github.com/google/gematria
- Intel Architecture Code Analyzer (IACA)
- ibench: Measure instruction latency and throughput
- Ithemal: Instruction THroughput Estimator using MAchine Learning
- https://github.com/psg-mit/Ithemal
- Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks
- ICML 2019
- Charith Mendis, Alex Renda, Saman Amarasinghe, Michael Carbin
- https://arxiv.org/abs/1808.07412
- http://proceedings.mlr.press/v97/mendis19a.html
- llvm-exegesis – LLVM Machine Instruction Benchmark
- https://llvm.org/docs/CommandGuide/llvm-exegesis.html
- https://github.com/llvm/llvm-project/tree/main/llvm/tools/llvm-exegesis
- Static Performance Analysis with LLVM
- 2018 European LLVM Developers Meeting
- C. Courbet, O. Sykora, G. Chatelet, B. De Backer
- https://youtu.be/XinMk-t8N-w
- http://llvm.org/devmtg/2018-04/slides/Courbet-Static%20Performance%20Analysis%20with%20LLVM.pdf
- Measuring x86 instruction latencies with LLVM
- 2018 European LLVM Developers Meeting
- G. Chatelet, C. Courbet, B. De Backer, O. Sykora
- https://youtu.be/ex_C27OoApI
- http://llvm.org/devmtg/2018-04/slides/Chatelet-Measuring%20x86%20instruction%20latencies%20with%20LLVM.pdf
- llvm-mca - LLVM Machine Code Analyzer
- https://llvm.org/docs/CommandGuide/llvm-mca.html
- https://github.com/llvm/llvm-project/tree/main/llvm/tools/llvm-mca
- Understanding the performance of code using LLVM's Machine Code Analyzer (llvm-mca)
- 2018 LLVM Developers’ Meeting; Andrea Di Biagio & Matt Davis
- https://www.youtube.com/watch?v=Ku2D8bjEGXk
- MC Ruler: Seamless llvm-mca CMake integration
- DiffTune: Optimizing CPU Simulator Parameters with Learned Differentiable Surrogates
- MICRO 2020
- Alex Renda, Yishen Chen, Charith Mendis, Michael Carbin
- https://arxiv.org/abs/2010.04017
- https://github.com/ithemal/DiffTune
- https://www.youtube.com/watch?v=7sN2YsqgPLY
- microarchitecturometer: Measures microarchitectural details
- Measures microarchitectural details such as ROB size.
- https://github.com/Veedrac/microarchitecturometer
- nanoBench: A tool for running small microbenchmarks on recent Intel and AMD x86 CPUs
- used for running the microbenchmarks for obtaining the latency, throughput, and port usage data available on http://uops.info
- https://github.com/andreas-abel/nanoBench
- https://uops.info/
- nanoBench Cache Analyzer
- Automatic Generation of Models of Microarchitectures
- 2020 PhD Dissertation; Andreas Abel
- https://d-nb.info/1212853466/34
- https://dx.doi.org/10.22028/D291-31299
- uops.info: Characterizing Latency, Throughput, and Port Usage of Instructions on Intel Microarchitectures
- ASPLOS 2019
- Andreas Abel, Jan Reineke
- https://arxiv.org/abs/1810.04610
- nanoBench: A Low-Overhead Tool for Running Microbenchmarks on x86 Systems
- 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
- Andreas Abel, Jan Reineke
- https://arxiv.org/abs/1911.03282
- https://www.youtube.com/watch?v=DEN4UteY4Sg
- Open Power/Performance Analysis Tool (OPPAT)
- a cross-OS, cross-architecture Power and Performance Analysis Tool
- cross-OS: supports Windows ETW trace files and Linux/Android perf/trace-cmd trace files
- cross-architecture: supports Intel and ARM chips hardware events (using perf and/or PCM)
- https://patinnc.github.io/
- https://github.com/patinnc/oppat
- OSACA: Open Source Architecture Code Analyzer
- https://github.com/RRZE-HPC/osaca
- https://hpc.fau.de/research/tools/
- Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures
- Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) 2018
- Jan Laukemann, Julian Hammer, Johannes Hofmann, Georg Hager, Gerhard Wellein
- https://arxiv.org/abs/1809.00912
- Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels
- arXiv 2019
- Jan Laukemann, Julian Hammer, Georg Hager, Gerhard Wellein
- https://arxiv.org/abs/1910.00214
- https://github.com/RRZE-HPC/OSACA-CP-2019
- Cross-Architecture Automatic Critical Path Detection For In-Core Performance Analysis
- 2020 Master Thesis; Jan Laukemann
- https://hpc.fau.de/files/2020/02/Masterarbeit_JL-_final.pdf
- Measurements and reproducibility instructions
- PALMED: Throughput Characterization for Superscalar Architectures
- CGO 2022
- Nicolas Derumigny, Fabian Gruber, Théophile Bastian, Christophe Guillon, Guillaume Iooss, Louis-Noël Pouchet, Fabrice Rastello
- https://palmed.corse.inria.fr
- https://arxiv.org/abs/2012.11473
- https://www.youtube.com/watch?v=xxeVsu1hvsk
- https://gitlab.inria.fr/nderumig/palmed
- "PALMED is a framework aiming to automatically provide a precise performance model for a processor. It works by inferring a port mapping of a given processor and deducing the resource usage of each relevant instruction."
- PMEvo: Portable Inference of Port Mappings for Out-of-Order Processors by Evolutionary Optimization
- PLDI 2020
- Fabian Ritter, Sebastian Hack
- https://compilers.cs.uni-saarland.de/papers/ritter_pmevo_pldi20.pdf
- https://compilers.cs.uni-saarland.de/projects/portmap/
- https://github.com/cdl-saarland/pmevo-artifact
- timing-harness: Harness for profiling arbitrary basic blocks.
- uarch-bench: A benchmark for low-level CPU micro-architectural features
- uiCA - The uops.info Code Analyzer
- https://uica.uops.info/
- https://github.com/andreas-abel/uiCA
- Accurate Throughput Prediction of Basic Blocks on Recent Intel Microarchitectures
- 2021
- Andreas Abel, Jan Reineke
- https://arxiv.org/abs/2107.14210
- Facile: Fast, Accurate, and Interpretable Basic-Block Throughput Prediction
- IEEE International Symposium on Workload Characterization (IISWC) 2023
- Andreas Abel, Shrey Sharma, Jan Reineke
- https://arxiv.org/abs/2310.13212
- BOLT: Binary Optimization and Layout Tool
- A linux command-line utility used for optimizing performance of binaries
- https://github.com/facebookincubator/BOLT
- Accelerate large-scale applications with BOLT
- Building Binary Optimizer with LLVM
- 2016 EuroLLVM Developers' Meeting; Maksim Panchenko
- https://llvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf
- https://www.youtube.com/watch?v=gw3iDO3By5Y
- BOLT: A Practical Binary Optimizer for Data Centers and Beyond
- Maksim Panchenko, Rafael Auler, Bill Nell, Guilherme Ottoni
- https://arxiv.org/abs/1807.06735
- MAQAO (Modular Assembly Quality Analyzer and Optimizer)
- 0x.Tools: Always-on Profiling for Production Systems (Linux)
- Agner Fog's test programs for measuring clock cycles and performance monitoring
- BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more
- https://iovisor.github.io/bcc/
- https://github.com/iovisor/bcc
- https://github.com/iovisor/bpf-docs
- http://www.brendangregg.com/blog/2016-03-05/linux-bpf-superpowers.html
- http://www.brendangregg.com/blog/2016-03-28/linux-bpf-bcc-road-ahead-2016.html
- http://www.brendangregg.com/blog/2016-06-14/ubuntu-xenial-bcc-bpf.html
- https://qmonnet.github.io/whirl-offload/2016/09/01/dive-into-bpf/
- Caliper: A Performance Analysis Toolbox in a Library
- an instrumentation and performance profiling library
- https://github.com/LLNL/Caliper
- SPOT: a web-based visualization for ubiquitous performance data
- Automating Application Performance Analysis with Caliper, SPOT, and Hatchet
- 2021 ECP Annual Meeting
- David Boehme, Matthew LeGendre, Abhinav Bhatele, Olga Pearce
- https://www.youtube.com/watch?v=p8gjA6rbpvo
- https://www.exascaleproject.org/event/perfanalysis/
- Coz: Finding Code that Counts with Causal Profiling
- https://github.com/plasma-umass/coz/
- Charlie Curtsinger, Emery Berger
- SOSP 2015
- ;login: 41(2) (2016)
- Performance Matters - Strange Loop 2019; Emery Berger
- Coz vs. Sampling Profilers
- easy_profiler: Lightweight cross-platform profiler library for C++
- gperftools (originally Google Performance Tools)
- "The fastest malloc we’ve seen; works particularly well with threads and STL. Also: thread-friendly heap-checker, heap-profiler, and cpu-profiler."
- https://github.com/gperftools/gperftools
- gprofng: The Next-Generation GNU Profiling Tool
- NHR PerfLab Seminar 2022; Ruud van der Pas
- HawkTracer
- a highly portable, low-overhead, configurable profiling tool for getting performance metrics from low-end devices
- Linux, Windows, macOS; C & C++ library; Python & Rust wrappers
- https://www.hawktracer.org/
- https://github.com/amzn/hawktracer
- Low-end platform profiling with HawkTracer profiler
- FOSDEM 2020; Marcin Kolny
- https://fosdem.org/2020/schedule/event/debugging_hawktrace/
- Hotspot - the Linux perf GUI for performance analysis
- Likwid: Performance monitoring and benchmarking suite
- magic-trace: collects and displays high-resolution traces of what a process is doing
- "use it like perf: point it to a process and off it goes. The key difference from perf is that instead of sampling call stacks throughout time, magic-trace uses Intel Processor Trace to snapshot a ring buffer of all control flow leading up to a chosen point in time. Then, you can explore an interactive timeline of what happened."
- https://magic-trace.org/
- https://github.com/janestreet/magic-trace
- microprofile: an embeddable profiler
- not-perf: A sampling CPU profiler for Linux
- Optick: C++ Profiler For Games
- Palanteer: visual Python and C++ profiler
- perf
- perf-tools - https://github.com/brendangregg/perf-tools
- perf_events: The Unofficial Linux Perf Events Web-Page
- perfmon2 - http://perfmon2.sourceforge.net/
- "Perfmon2 aims to be a portable interface across all modern processors. It is designed to give full access to a given PMU and all the corresponding hardware performance counters. Typically the PMU hardware implementations use a different number of registers, counters with different length and possibly other unique features, a complexity that the software has to cope with. Although processors have different PMU implementations, they usually use configurations registers and data registers. Perfmon2 provides a uniform abstract model of these registers and exports read/write operations accordingly."
- Perfetto - System profiling, app tracing, and trace analysis
- Performance instrumentation and tracing for Android, Linux, and Chrome
- https://github.com/google/perfetto
- https://perfetto.dev/
- https://perfetto.dev/docs/
- Performance Application Programming Interface (PAPI)
- http://icl.cs.utk.edu/papi/
- http://icl.cs.utk.edu/projects/papi/wiki/Main_Page
- http://www.drdobbs.com/tools/performance-monitoring-with-papi/184406109
- papi-wrapper (C++ library) - https://github.com/sean-chester/papi-wrapper
- libpapipp: A C++ wrapper around libpapi - https://github.com/david-grs/papipp
- pmu tools: Intel PMU profiling tools
- https://github.com/andikleen/pmu-tools
- https://github.com/andikleen/pmu-tools/wiki/toplev-manual
- pmu-tools part I - introduction, ocperf - http://halobates.de/blog/p/245
- pmu-tools part II - toplev - http://halobates.de/blog/p/262
- Processor Counter Monitor (PCM)
- https://github.com/opcm/pcm
- Intel Performance Counter Monitor (PCM) - http://www.intel.com/software/pcm
- Profilerpedia: A map of the Software Profiling Ecosystem
- Remotery: Single C file, Realtime CPU/GPU Profiler with Remote Web Viewer
- sysdig
- timemory: Timing + Memory + Hardware Counter Utilities for C / C++ / CUDA / Python
- Linux, macOS, Windows
- https://github.com/NERSC/timemory
- Timemory: Modular Performance Analysis for HPC
- ISC 2020
- Jonathan R. Madsen, Muaaz G. Awan, Hugo Brunie, Jack Deslippe, Rahulkumar Gayatri, Leonid Oliker, Yunsong Wang, Charlene Yang, Samuel Williams
- https://doi.org/10.1007/978-3-030-50743-5_22
- Tracy Profiler
- Tracy is a real time, nanosecond resolution frame profiler that can be used for remote or embedded telemetry of your application. It can profile CPU (C++, Lua), GPU (OpenGL, Vulkan) and memory. It also can display locks held by threads and their interactions with each other.
- https://bitbucket.org/wolfpld/tracy
- Introduction to the Tracy profiler - https://www.youtube.com/watch?v=fB5B46lbapc
- mpiP: A light-weight MPI profiler
- cpuprofilify: Converts output of various profiling/sampling tools to the .cpuprofile format so it can be loaded into Chrome DevTools.
- gprof2dot
- "Python script to convert the output from many profilers into a dot graph."
- https://github.com/jrfonseca/gprof2dot
- Event Tracing for Windows (ETW) / Windows Performance Toolkit – Xperf
- WindowsPerf: (Linux perf inspired) Windows on Arm performance profiling tool
- libcpucycles: library to count CPU cycles
- Supports counters for amd64 (both PMC and TSC), arm32, arm64 (both PMC and VCT), mips64, ppc32, ppc64, riscv32, riscv64, sparc64, and x86, plus automatic fallbacks to various OS-level timing mechanisms.
- https://cpucycles.cr.yp.to
- low-overhead-timers: Very low-overhead timer/counter interfaces for C on Intel 64 processors
- https://github.com/jdmccalpin/low-overhead-timers
- Comments on timing short code sections on Intel processors
- plf::nanotimer
- A simple C++03/11/etc timer class for ~microsecond-precision cross-platform benchmarking. The implementation is as limited and as simple as possible to create the lowest amount of overhead.
- https://github.com/mattreecebentley/plf_nanotimer
- Flame Graphs
- http://www.brendangregg.com/flamegraphs.html
- http://queue.acm.org/detail.cfm?id=2927301
- Memory Leak (and Growth) Flame Graphs
- FlameScope: a visualization tool for exploring different time ranges as Flame Graphs
- GOoDA (Generic Optimization Data Analyzer): PMU event data analysis package
- Gooda is a pmu event data analysis package that consists of some predefined data collection scripts to use perf record in a sensible manner, analyze the data utilizing a cycle accounting methodology and create the tables and dot/svg files needed for the gooda-visualizer package.
- https://github.com/David-Levinthal/gooda
- Hatchet: Graph-indexed Pandas DataFrames for analyzing hierarchical performance data
- pprof - a tool for visualization and analysis of profiling data