Evaluate using Link-Time Optimization (LTO), Profile-Guided Optimization (PGO) and Post-Link Optimization (PLO) #203

zamazan4ik · 2024-04-30T02:11:08Z

zamazan4ik
Apr 30, 2024

Hi!

Recently I checked many optimizations like LTO, PGO and PLO (mostly with LLVM BOLT) improvements on multiple projects. The results are available here. According to the tests, these optimizations can help to achieve better performance in many cases for many CPU-intensive applications. I think trying to optimize this project with them will be an interesting idea.

I already did some benchmarks and want to share my results here.

Test environment

Fedora 39
Linux kernel 6.8.7
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.77.2
angle-grinder version: the latest for now from the main branch on commit 2ec0920632645bd69460f82b6e4be7a4a6e110e1
Disabled Turbo boost

Benchmark

For benchmark purposes, I used several scenarios: built-in benches and manually-crafted workload (quite naive to be honest). All PGO and PLO optimizations are done with cargo-pgo.

Built-in benchmarks: Release run is done with cargo bench, PGO instrumentation phase is done with cargo pgo bench, PGO optimized benches are done with cargo pgo optimize bench.

Manually-crafted workload. I use agrind -f logs_small.txt '* | json | p90(response_ms)' pattern. logs_small.txt is generated with test_files/gen_logs.py and has size ~42 Mib. taskset -c 0 is used to reduce the OS scheduler influence on the results.

All tests are done on the same machine, done multiple times (with hyperfine), with the same background "noise" (as much as I can guarantee of course) - the results are consistent across runs.

LTO build is done by adding the following lines to the Cargo.toml:

[profile.release]
debug = true
codegen-units = 1
lto = true

Results

Let's start with built-in benchmarks:

Release: https://gist.github.com/zamazan4ik/6a5caf91bf445a7a7317db7dd46a7f2b
PGO-optimized compared to Release: https://gist.github.com/zamazan4ik/93eecbb2411836ec2e4e26b3d8f3fc19
(just for reference) PGO-instrumented compared to Release: https://gist.github.com/zamazan4ik/cceec70351851a735786fe2c4cc3d49b

According to these benchmarks, PGO brings measurable improvements.

Let's continue with manual workloads. At first, I decided to test performance improvements by enabling LTO. I found that it was disabled several years ago. The results:

hyperfine --warmup 3 --min-runs 10 "taskset -c 0 ./agrind_release -f logs_small.txt '* | json | p90(response_ms)'" "taskset -c 0 ./agrind_release_lto -f logs_small.txt '* | json | p90(response_ms)'"
Benchmark 1: taskset -c 0 ./agrind_release -f logs_small.txt '* | json | p90(response_ms)'
  Time (mean ± σ):      4.615 s ±  0.025 s    [User: 3.515 s, System: 1.085 s]
  Range (min … max):    4.569 s …  4.655 s    10 runs

Benchmark 2: taskset -c 0 ./agrind_release_lto -f logs_small.txt '* | json | p90(response_ms)'
  Time (mean ± σ):      4.209 s ±  0.012 s    [User: 3.176 s, System: 1.021 s]
  Range (min … max):    4.192 s …  4.224 s    10 runs

Summary
  taskset -c 0 ./agrind_release_lto -f logs_small.txt '* | json | p90(response_ms)' ran
    1.10 ± 0.01 times faster than taskset -c 0 ./agrind_release -f logs_small.txt '* | json | p90(response_ms)'

where:

agrind_release - Release build
agrind_release_lto - Release + LTO build

According to the tests above, I see measurable improvements from LTO in performance and the binary size (see the binary size comparison below).

However, my tests for the same scenario with PGO and PLO didn't show improvements. I guess if we choose other workloads with more parts about logs processing (to increase CPU intensiveness of the workload) PGO and PLO can help too but more testing is required.

For anyone interested in binary sizes, I collected some statistics too (without debug symbols stripping):

Release: 74 Mib
Release + LTO: 49 Mib
Release + PGO instrumentation: 75 Mib
Release + LTO + PGO instrumentation: 72 Mib
Release + PGO optimized: 61 Mib
Release + LTO + PGO optimized: 45 Mib
Release + LTO + PGO optimized + BOLT instrumentation: 69 Mib
Release + LTO + PGO optimized + BOLT optimized: 47 Mib

Further steps

I can suggest the following action points:

Enable LTO for release builds.
Perform more PGO and PLO benchmarks on agrind in more scenarios. If it shows improvements - add a note to the documentation about possible improvements in the project performance with PGO.
Providing an easier way (e.g. a build option) to build scripts with PGO can be helpful for the end-users and maintainers since they will be able to optimize agrind according to their workloads.

Here are some examples of how PGO optimization is integrated into other projects:

Rustc: a CI script for the multi-stage build
GCC:
- Official docs, section "Building with profile feedback"
- A part in a "wonderful" configure script
Clang: Docs
Python:
- CPython: README
- Pyston: README
Go: Bash script
V8: Bazel flag
ChakraCore: Scripts
Chromium: Script
Firefox: Docs
- Thunderbird has PGO support too
PHP - Makefile command and old Centminmod scripts
MySQL: CMake script
YugabyteDB: GitHub commit
FoundationDB: Script
Zstd: Makefile
Foot: Scripts
Windows Terminal: GitHub PR
Pydantic-core: GitHub PR
file.d: GitHub PR
OceanBase: CMake flag

Regarding LLVM BOLT integration, I have the following links:

Rustc:
- Rustc itself (GitHub PR)
- LLVM in Rustc (Reddit)
CPython: GitHub PR
YDB: GitHub comment
Clang:
LDC: GitHub comment
HHVM, Proxygen and others: Facebook paper
NodeJS: Blog
Chromium: Blog
MySQL, MongoDB, memcached, Verilator: Paper

I would be happy to answer your questions about all the optimizations above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate using Link-Time Optimization (LTO), Profile-Guided Optimization (PGO) and Post-Link Optimization (PLO) #203

{{title}}

Replies: 0 comments

Select a reply

Evaluate using Link-Time Optimization (LTO), Profile-Guided Optimization (PGO) and Post-Link Optimization (PLO) #203

zamazan4ik Apr 30, 2024

Test environment

Benchmark

Results

Further steps

Replies: 0 comments

zamazan4ik
Apr 30, 2024