Skip to content

Commit

Permalink
Another post.
Browse files Browse the repository at this point in the history
  • Loading branch information
athas committed Oct 28, 2024
1 parent 31bbd1d commit 3c24778
Showing 1 changed file with 121 additions and 0 deletions.
121 changes: 121 additions & 0 deletions blog/2024-10-28-inlining.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
---
title: Why Is the Futhark Compiler so Bad at Inlining?
description: Picking your battles.
---

Quick answer: because it was not necessary to do better for the
research we wanted to do. The remainder of this post is an elaboration
on this answer, and how it may change in the future.

*Inlining* is an optimisation by which a call to a procedure is
replaced by the body of the procedure. The advantage is twofold:

1. Elimination of procedure call overhead, usually pushing and popping
values from the stack.

2. The procedure body may be optimised in the context of which it
occurs. For example, if one of the actual arguments is a constant,
inlining may provide opportunities for [constant
folding](https://compileroptimizations.com/category/constant_folding.htm).

Except in the case of very small functions, advantage (2) is the
important one. Inlining is a so-called *enabling optimisation* that
does not itself provide much benefit, but can allow other
optimisations to apply. Indeed, it is perhaps the most important of
all enabling optimisations, as most compilers are not good at
optimising across procedure boundaries.

The main disadvantage of inlining is that it makes the program larger.
This not just has the obvious detrimental effect on compile times, but
may also inhibit runtime performance as the increased code size causes
pressure on the instruction cache.

Whether or not to inline a function is not at all obvious, and the
consequences of doing the wrong thing can be quite substantial, as
crucial optimisations fail to apply. In practice, compilers use
complicated and opaque heuristics, combined with programmer-provided
annotations, to decide when to inline.

When we [started the Futhark
project](2021-12-19-past-and-present.html), long before it was
[intended to be
useful](https://futhark-lang.org/examples.html#projects-using-futhark),
we wanted to study two main program transformations:
[fusion](https://compileroptimizations.com/category/loop_fusion.htm)
and [flattening](2019-02-18-futhark-at-ppopp.html). Both of these are
large scale restructuring of program control flow, and they cannot
easily cross function boundaries. While interprocedural variants might
eventually merit study, this was not our initial goal. As a result, we
picked a very simple inlining strategy: *inline every function
application*. This allowed us to conduct our research without having
to face the thorny problem of when to inline.

Interestingly, because Futhark programs tend to be rather small, this
aggressive inlining policy turned out to work fine for most programs.
While we did eventually add [some attributes for controlling
inlining](https://futhark.readthedocs.io/en/latest/language-reference.html#declaration-attributes),
they were used very rarely. One reason is that they have quite sharp
edges: the GPU backends in particular simply do not support calls of
certain functions in certain positions, in particular, any function
that allocates memory within a GPU kernel must be inlined in order for
[memory expansion](https://futhark-lang.org/publications/ifl22.pdf) to
take place, and this is not really a code generation detail that is
exposed in the source language, so it is difficult to reason about.

However, inlining everything is clearly not going to cut it as Futhark
becomes useful for larger programs. As a result, I [spent some
time](https://github.com/diku-dk/futhark/pull/1857) earlier this year
on slightly refining our inlining strategy. Let me temper
expectations: it is still *extremely* aggressive. Specifically, we now
inline:

1. Any function or application specifically marked with the
`#[inline]` attribute.

2. Any function that is only applied once.

3. Any application of a function that creates arrays or contains any
parallel operations, *if* that application is itself inside a
parallel operation.

4. Any function that is used inside an
[AD](https://futhark-lang.org/docs/prelude/doc/prelude/ad.html)
operator (because our AD transformation cannot handle applications
yet).

5. As is tradition, any function that satisfies a [strange and opaque
heuristic](https://github.com/diku-dk/futhark/blob/71aa80cfaf6c93fc32055204a6ecb0e4e865833d/src/Futhark/Optimise/InliningDeadFun.hs#L114-L123)
that essentially compares the number of parameters and size of the
return type with the size of the function body.

At a high level, the two kinds of functions that make it through the
gauntlet without getting inlined tend to be in one of two categories:

1. Large top-level functions that constitute significant subprograms.

2. Scalar leaf functions, typically those that implement some
nontrivial (but sequential) formula.

While not inlining the functions in category 1 can sometimes cost us
opportunities for fusion, their impact on compilation time is
substantial. The functions in category 2 are usually not subject to
the [optimisations that the Futhark compiler is particularly good
at](https://futhark-lang.org/blog/2022-04-04-futhark-is-a-low-level-language.html#what-futhark-is-and-is-not-about),
and so not inlining them is typically fine. Also, since the code
generated by Futhark is consumed by downstream optimising compilers,
such functions may still end up being inlined by compilers that have
a better idea of how to micro-optimise for a specific machine.

While we do not systematically track compile times for the Futhark
compiler (yes, yes, I know), I did do some ad-hoc measurements when I
refined our inlining strategy. A medium-sized program such as
[heston32](https://github.com/diku-dk/futhark-benchmarks/blob/master/misc/heston/heston32.fut)
had its compile time cut in half, with no measurable impact on
run-time performance. Generally, most programs saw no significant
change in their run-time behaviour.

In the future, I would like to investigate interprocedural fusion. Our
fusion algorithm is specified as a graph transformation, and it seems
reasonable to treat function application as just another fusible node,
with its fusion properties determined by its body, and fusion
implicitly performing inlining when appropriate.

0 comments on commit 3c24778

Please sign in to comment.