Another post.

diku-dk · Oct 28, 2024 · 3c24778 · 3c24778
1 parent 31bbd1d
commit 3c24778
Showing 1 changed file with 121 additions and 0 deletions.
diff --git a/blog/2024-10-28-inlining.md b/blog/2024-10-28-inlining.md
@@ -0,0 +1,121 @@
+---
+title: Why Is the Futhark Compiler so Bad at Inlining?
+description: Picking your battles.
+---
+
+Quick answer: because it was not necessary to do better for the
+research we wanted to do. The remainder of this post is an elaboration
+on this answer, and how it may change in the future.
+
+*Inlining* is an optimisation by which a call to a procedure is
+replaced by the body of the procedure. The advantage is twofold:
+
+1. Elimination of procedure call overhead, usually pushing and popping
+   values from the stack.
+
+2. The procedure body may be optimised in the context of which it
+   occurs. For example, if one of the actual arguments is a constant,
+   inlining may provide opportunities for [constant
+   folding](https://compileroptimizations.com/category/constant_folding.htm).
+
+Except in the case of very small functions, advantage (2) is the
+important one. Inlining is a so-called *enabling optimisation* that
+does not itself provide much benefit, but can allow other
+optimisations to apply. Indeed, it is perhaps the most important of
+all enabling optimisations, as most compilers are not good at
+optimising across procedure boundaries.
+
+The main disadvantage of inlining is that it makes the program larger.
+This not just has the obvious detrimental effect on compile times, but
+may also inhibit runtime performance as the increased code size causes
+pressure on the instruction cache.
+
+Whether or not to inline a function is not at all obvious, and the
+consequences of doing the wrong thing can be quite substantial, as
+crucial optimisations fail to apply. In practice, compilers use
+complicated and opaque heuristics, combined with programmer-provided
+annotations, to decide when to inline.
+
+When we [started the Futhark
+project](2021-12-19-past-and-present.html), long before it was
+[intended to be
+useful](https://futhark-lang.org/examples.html#projects-using-futhark),
+we wanted to study two main program transformations:
+[fusion](https://compileroptimizations.com/category/loop_fusion.htm)
+and [flattening](2019-02-18-futhark-at-ppopp.html). Both of these are
+large scale restructuring of program control flow, and they cannot
+easily cross function boundaries. While interprocedural variants might
+eventually merit study, this was not our initial goal. As a result, we
+picked a very simple inlining strategy: *inline every function
+application*. This allowed us to conduct our research without having
+to face the thorny problem of when to inline.
+
+Interestingly, because Futhark programs tend to be rather small, this
+aggressive inlining policy turned out to work fine for most programs.
+While we did eventually add [some attributes for controlling
+inlining](https://futhark.readthedocs.io/en/latest/language-reference.html#declaration-attributes),
+they were used very rarely. One reason is that they have quite sharp
+edges: the GPU backends in particular simply do not support calls of
+certain functions in certain positions, in particular, any function
+that allocates memory within a GPU kernel must be inlined in order for
+[memory expansion](https://futhark-lang.org/publications/ifl22.pdf) to
+take place, and this is not really a code generation detail that is
+exposed in the source language, so it is difficult to reason about.
+
+However, inlining everything is clearly not going to cut it as Futhark
+becomes useful for larger programs. As a result, I [spent some
+time](https://github.com/diku-dk/futhark/pull/1857) earlier this year
+on slightly refining our inlining strategy. Let me temper
+expectations: it is still *extremely* aggressive. Specifically, we now
+inline:
+
+1. Any function or application specifically marked with the
+   `#[inline]` attribute.
+
+2. Any function that is only applied once.
+
+3. Any application of a function that creates arrays or contains any
+   parallel operations, *if* that application is itself inside a
+   parallel operation.
+
+4. Any function that is used inside an
+   [AD](https://futhark-lang.org/docs/prelude/doc/prelude/ad.html)
+   operator (because our AD transformation cannot handle applications
+   yet).
+
+5. As is tradition, any function that satisfies a [strange and opaque
+   heuristic](https://github.com/diku-dk/futhark/blob/71aa80cfaf6c93fc32055204a6ecb0e4e865833d/src/Futhark/Optimise/InliningDeadFun.hs#L114-L123)
+   that essentially compares the number of parameters and size of the
+   return type with the size of the function body.
+
+At a high level, the two kinds of functions that make it through the
+gauntlet without getting inlined tend to be in one of two categories:
+
+1. Large top-level functions that constitute significant subprograms.
+
+2. Scalar leaf functions, typically those that implement some
+   nontrivial (but sequential) formula.
+
+While not inlining the functions in category 1 can sometimes cost us
+opportunities for fusion, their impact on compilation time is
+substantial. The functions in category 2 are usually not subject to
+the [optimisations that the Futhark compiler is particularly good
+at](https://futhark-lang.org/blog/2022-04-04-futhark-is-a-low-level-language.html#what-futhark-is-and-is-not-about),
+and so not inlining them is typically fine. Also, since the code
+generated by Futhark is consumed by downstream optimising compilers,
+such functions may still end up being inlined by compilers that have
+a better idea of how to micro-optimise for a specific machine.
+
+While we do not systematically track compile times for the Futhark
+compiler (yes, yes, I know), I did do some ad-hoc measurements when I
+refined our inlining strategy. A medium-sized program such as
+[heston32](https://github.com/diku-dk/futhark-benchmarks/blob/master/misc/heston/heston32.fut)
+had its compile time cut in half, with no measurable impact on
+run-time performance. Generally, most programs saw no significant
+change in their run-time behaviour.
+
+In the future, I would like to investigate interprocedural fusion. Our
+fusion algorithm is specified as a graph transformation, and it seems
+reasonable to treat function application as just another fusible node,
+with its fusion properties determined by its body, and fusion
+implicitly performing inlining when appropriate.