-
Notifications
You must be signed in to change notification settings - Fork 14
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
121 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,121 @@ | ||
--- | ||
title: Why Is the Futhark Compiler so Bad at Inlining? | ||
description: Picking your battles. | ||
--- | ||
|
||
Quick answer: because it was not necessary to do better for the | ||
research we wanted to do. The remainder of this post is an elaboration | ||
on this answer, and how it may change in the future. | ||
|
||
*Inlining* is an optimisation by which a call to a procedure is | ||
replaced by the body of the procedure. The advantage is twofold: | ||
|
||
1. Elimination of procedure call overhead, usually pushing and popping | ||
values from the stack. | ||
|
||
2. The procedure body may be optimised in the context of which it | ||
occurs. For example, if one of the actual arguments is a constant, | ||
inlining may provide opportunities for [constant | ||
folding](https://compileroptimizations.com/category/constant_folding.htm). | ||
|
||
Except in the case of very small functions, advantage (2) is the | ||
important one. Inlining is a so-called *enabling optimisation* that | ||
does not itself provide much benefit, but can allow other | ||
optimisations to apply. Indeed, it is perhaps the most important of | ||
all enabling optimisations, as most compilers are not good at | ||
optimising across procedure boundaries. | ||
|
||
The main disadvantage of inlining is that it makes the program larger. | ||
This not just has the obvious detrimental effect on compile times, but | ||
may also inhibit runtime performance as the increased code size causes | ||
pressure on the instruction cache. | ||
|
||
Whether or not to inline a function is not at all obvious, and the | ||
consequences of doing the wrong thing can be quite substantial, as | ||
crucial optimisations fail to apply. In practice, compilers use | ||
complicated and opaque heuristics, combined with programmer-provided | ||
annotations, to decide when to inline. | ||
|
||
When we [started the Futhark | ||
project](2021-12-19-past-and-present.html), long before it was | ||
[intended to be | ||
useful](https://futhark-lang.org/examples.html#projects-using-futhark), | ||
we wanted to study two main program transformations: | ||
[fusion](https://compileroptimizations.com/category/loop_fusion.htm) | ||
and [flattening](2019-02-18-futhark-at-ppopp.html). Both of these are | ||
large scale restructuring of program control flow, and they cannot | ||
easily cross function boundaries. While interprocedural variants might | ||
eventually merit study, this was not our initial goal. As a result, we | ||
picked a very simple inlining strategy: *inline every function | ||
application*. This allowed us to conduct our research without having | ||
to face the thorny problem of when to inline. | ||
|
||
Interestingly, because Futhark programs tend to be rather small, this | ||
aggressive inlining policy turned out to work fine for most programs. | ||
While we did eventually add [some attributes for controlling | ||
inlining](https://futhark.readthedocs.io/en/latest/language-reference.html#declaration-attributes), | ||
they were used very rarely. One reason is that they have quite sharp | ||
edges: the GPU backends in particular simply do not support calls of | ||
certain functions in certain positions, in particular, any function | ||
that allocates memory within a GPU kernel must be inlined in order for | ||
[memory expansion](https://futhark-lang.org/publications/ifl22.pdf) to | ||
take place, and this is not really a code generation detail that is | ||
exposed in the source language, so it is difficult to reason about. | ||
|
||
However, inlining everything is clearly not going to cut it as Futhark | ||
becomes useful for larger programs. As a result, I [spent some | ||
time](https://github.com/diku-dk/futhark/pull/1857) earlier this year | ||
on slightly refining our inlining strategy. Let me temper | ||
expectations: it is still *extremely* aggressive. Specifically, we now | ||
inline: | ||
|
||
1. Any function or application specifically marked with the | ||
`#[inline]` attribute. | ||
|
||
2. Any function that is only applied once. | ||
|
||
3. Any application of a function that creates arrays or contains any | ||
parallel operations, *if* that application is itself inside a | ||
parallel operation. | ||
|
||
4. Any function that is used inside an | ||
[AD](https://futhark-lang.org/docs/prelude/doc/prelude/ad.html) | ||
operator (because our AD transformation cannot handle applications | ||
yet). | ||
|
||
5. As is tradition, any function that satisfies a [strange and opaque | ||
heuristic](https://github.com/diku-dk/futhark/blob/71aa80cfaf6c93fc32055204a6ecb0e4e865833d/src/Futhark/Optimise/InliningDeadFun.hs#L114-L123) | ||
that essentially compares the number of parameters and size of the | ||
return type with the size of the function body. | ||
|
||
At a high level, the two kinds of functions that make it through the | ||
gauntlet without getting inlined tend to be in one of two categories: | ||
|
||
1. Large top-level functions that constitute significant subprograms. | ||
|
||
2. Scalar leaf functions, typically those that implement some | ||
nontrivial (but sequential) formula. | ||
|
||
While not inlining the functions in category 1 can sometimes cost us | ||
opportunities for fusion, their impact on compilation time is | ||
substantial. The functions in category 2 are usually not subject to | ||
the [optimisations that the Futhark compiler is particularly good | ||
at](https://futhark-lang.org/blog/2022-04-04-futhark-is-a-low-level-language.html#what-futhark-is-and-is-not-about), | ||
and so not inlining them is typically fine. Also, since the code | ||
generated by Futhark is consumed by downstream optimising compilers, | ||
such functions may still end up being inlined by compilers that have | ||
a better idea of how to micro-optimise for a specific machine. | ||
|
||
While we do not systematically track compile times for the Futhark | ||
compiler (yes, yes, I know), I did do some ad-hoc measurements when I | ||
refined our inlining strategy. A medium-sized program such as | ||
[heston32](https://github.com/diku-dk/futhark-benchmarks/blob/master/misc/heston/heston32.fut) | ||
had its compile time cut in half, with no measurable impact on | ||
run-time performance. Generally, most programs saw no significant | ||
change in their run-time behaviour. | ||
|
||
In the future, I would like to investigate interprocedural fusion. Our | ||
fusion algorithm is specified as a graph transformation, and it seems | ||
reasonable to treat function application as just another fusible node, | ||
with its fusion properties determined by its body, and fusion | ||
implicitly performing inlining when appropriate. |