Evaluate using Link-Time Optimization (LTO), Profile-Guided Optimization (PGO) and Post-Link Optimization (PLO) #203
zamazan4ik
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi!
Recently I checked many optimizations like LTO, PGO and PLO (mostly with LLVM BOLT) improvements on multiple projects. The results are available here. According to the tests, these optimizations can help to achieve better performance in many cases for many CPU-intensive applications. I think trying to optimize this project with them will be an interesting idea.
I already did some benchmarks and want to share my results here.
Test environment
main
branch on commit2ec0920632645bd69460f82b6e4be7a4a6e110e1
Benchmark
For benchmark purposes, I used several scenarios: built-in benches and manually-crafted workload (quite naive to be honest). All PGO and PLO optimizations are done with cargo-pgo.
Built-in benchmarks: Release run is done with
cargo bench
, PGO instrumentation phase is done withcargo pgo bench
, PGO optimized benches are done withcargo pgo optimize bench
.Manually-crafted workload. I use
agrind -f logs_small.txt '* | json | p90(response_ms)'
pattern.logs_small.txt
is generated withtest_files/gen_logs.py
and has size ~42 Mib.taskset -c 0
is used to reduce the OS scheduler influence on the results.All tests are done on the same machine, done multiple times (with
hyperfine
), with the same background "noise" (as much as I can guarantee of course) - the results are consistent across runs.LTO build is done by adding the following lines to the
Cargo.toml
:Results
Let's start with built-in benchmarks:
According to these benchmarks, PGO brings measurable improvements.
Let's continue with manual workloads. At first, I decided to test performance improvements by enabling LTO. I found that it was disabled several years ago. The results:
where:
agrind_release
- Release buildagrind_release_lto
- Release + LTO buildAccording to the tests above, I see measurable improvements from LTO in performance and the binary size (see the binary size comparison below).
However, my tests for the same scenario with PGO and PLO didn't show improvements. I guess if we choose other workloads with more parts about logs processing (to increase CPU intensiveness of the workload) PGO and PLO can help too but more testing is required.
For anyone interested in binary sizes, I collected some statistics too (without debug symbols stripping):
Further steps
I can suggest the following action points:
agrind
according to their workloads.Here are some examples of how PGO optimization is integrated into other projects:
configure
scriptRegarding LLVM BOLT integration, I have the following links:
I would be happy to answer your questions about all the optimizations above.
Beta Was this translation helpful? Give feedback.
All reactions