Add precompile statements #203

lkdvos · 2025-02-27T15:56:14Z

This PR is an attempt to fix some compilation time issues, as brought up by @amilsted.
The problem is the functions attached in this file

Current status

(v5.1.3)

Code Loading + Precompilation

❯ julia --project -e '@time using TensorOperations'
  0.016875 seconds (32.41 k allocations: 2.502 MiB)
  
❯ julia --project -e '@time Base.compilecache(Base.identify_package("TensorOperations"))'
  0.660362 seconds (18.10 k allocations: 1.061 MiB, 0.71% compilation time)

TTFX

❯ julia --project -e 'include("to_dense.jl"); @time to_dense_ncon(M, L, R)'
 17.228799 seconds (90.38 M allocations: 4.563 GiB, 2.30% gc time, 99.99% compilation time)
 
❯ julia --project -e 'include("to_dense.jl"); @time to_dense_tensor(M, L, R)'
 16.379553 seconds (80.78 M allocations: 4.093 GiB, 2.19% gc time, 100.00% compilation time)

Performance

❯ julia --project -e 'include("to_dense.jl"); using BenchmarkTools; @btime to_dense_ncon(M, L, R)'
  112.833 μs (825 allocations: 437.98 KiB)

❯ julia --project -e 'include("to_dense.jl"); using BenchmarkTools; @btime to_dense_tensor(M, L, R)'
  65.416 μs (56 allocations: 408.83 KiB)

Precompilation

After investigating, there are several possible improvements.
Given that there does not seem to be any type instability issues, and I couldn't spot any obvious problems, I figured that this is partially just because we really do want to specialize quite a bit, assuming that TensorOperations is an important performance bottleneck.
Therefore, I started out by simply adding precompile statements:

This PR: adding precompilation

Code Loading + Precompilation

❯ julia --project -e '@time using TensorOperations'
  0.823528 seconds (689.87 k allocations: 68.223 MiB, 7.65% gc time)
  
❯ julia --project -e '@time Base.compilecache(Base.identify_package("TensorOperations"))'
137.465629 seconds (17.85 k allocations: 1.048 MiB, 0.00% compilation time)

TTFX

❯ julia --project -e 'include("to_dense.jl"); @time to_dense_ncon(M, L, R)'
  0.963772 seconds (8.88 M allocations: 444.266 MiB, 4.24% gc time, 99.87% compilation time)
 
❯ julia --project -e 'include("to_dense.jl"); @time to_dense_tensor(M, L, R)'
  0.059043 seconds (264.73 k allocations: 13.633 MiB, 99.53% compilation time)

Performance

❯ julia --project -e 'include("to_dense.jl"); using BenchmarkTools; @btime to_dense_ncon(M, L, R)'
  114.541 μs (825 allocations: 437.98 KiB)

❯ julia --project -e 'include("to_dense.jl"); using BenchmarkTools; @btime to_dense_tensor(M, L, R)'
  60.792 μs (56 allocations: 408.83 KiB)

The good: TTFX is negligible and the performance seems unaffected!
The bad: precompilation time now takes ~140 seconds...

Standardization

Because this package often forms the important part of code for performance, this might be acceptable, but we can improve by standardizing some of the types in StridedViews, see this PR.
The precompile time is now at ~40 seconds, while TTFX is still good and performance is mostly unchanged. Interestingly though, that StridedView approach does require an additional allocation (reshape seems to cause this?), of which I'm unsure if we can get rid of it.

This PR + StridedViews standardizing parenttype

Code Loading + Precompilation

❯ julia --project -e '@time using TensorOperations'
  0.641844 seconds (223.34 k allocations: 19.020 MiB, 6.27% gc time)
  
❯ julia --project -e '@time Base.compilecache(Base.identify_package("TensorOperations"))'
 41.829745 seconds (17.85 k allocations: 1.048 MiB, 0.01% compilation time)

TTFX

❯ julia --project -e 'include("to_dense.jl"); @time to_dense_ncon(M, L, R)'
  1.001807 seconds (8.89 M allocations: 444.702 MiB, 4.45% gc time, 99.81% compilation time)
 
❯ julia --project -e 'include("to_dense.jl"); @time to_dense_tensor(M, L, R)'
  0.064136 seconds (270.55 k allocations: 13.911 MiB, 6.60% gc time, 99.63% compilation time)

Performance

❯ julia --project -e 'include("to_dense.jl"); using BenchmarkTools; @btime to_dense_ncon(M, L, R)'
  115.417 μs (844 allocations: 438.23 KiB)

❯ julia --project -e 'include("to_dense.jl"); using BenchmarkTools; @btime to_dense_tensor(M, L, R)'
  61.250 μs (75 allocations: 409.08 KiB)

As an interesting sidenote, without the precompilation statements here, but including the StridedViews standardization actually already cuts the TTFX quite considerably:

only StridedViews standardizing parenttype

Code Loading + Precompilation

❯ julia --project -e '@time using TensorOperations'
  0.017374 seconds (34.04 k allocations: 2.602 MiB)
  
❯ julia --project -e '@time Base.compilecache(Base.identify_package("TensorOperations"))'
  0.775875 seconds (18.09 k allocations: 1.060 MiB, 0.57% compilation time)

TTFX

❯ julia --project -e 'include("to_dense.jl"); @time to_dense_ncon(M, L, R)'
  9.672027 seconds (40.09 M allocations: 1.996 GiB, 1.98% gc time, 99.98% compilation time)
 
❯ julia --project -e 'include("to_dense.jl"); @time to_dense_tensor(M, L, R)'
  8.857590 seconds (31.01 M allocations: 1.552 GiB, 1.64% gc time, 100.00% compilation time)

Performance

❯ julia --project -e 'include("to_dense.jl"); using BenchmarkTools; @btime to_dense_ncon(M, L, R)'
  114.333 μs (844 allocations: 438.23 KiB)

❯ julia --project -e 'include("to_dense.jl"); using BenchmarkTools; @btime to_dense_tensor(M, L, R)'
  63.042 μs (75 allocations: 409.08 KiB)

new precompilation approach

Jutho · 2025-02-27T23:25:11Z

I am certainly ok with this. There used to be precompile statements present, but then at some point removed them as the precompilationtools ecoystem was evolving rapidly and I couldn't follow up.

lkdvos · 2025-02-28T15:27:41Z

I've updated the implementation and added a section to the documentation page to explain what is going on. In particular, we now follow the same convention as PrecompileTools for disabling precompilation, and the precompile settings can be tweaked locally for people that want this.
If everything turns green, I would say we can merge this and tag a new version. I would actually think that this might warrant a v5.2.0 release, what do you think?

amilsted · 2025-02-28T17:18:03Z

Thanks @lkdvos, this looks like it helps quite a bit. There's still a lot of compile going on, but I understand you want to avoid runtime dispatch and allocating vectors for indices and such.

A general comment: There is quite a deep call stack of functions that pass Index2Tuple arguments from level to level. Because these args are tuples, most of this stack (tensorcontract, stridedtensorcontract, blascontract, ...) gets recompiled for every shape combination in a pairwise contraction! Is it worth allowing these arguments to be arrays, for example if they are passed from ncon, until we are deeper in the stack? It seems kind of nuts that in some of the MPS code I am looking at we have >200 specializations of _blas_contract! (taking ~8s to compile) according to SnoopCompile tools. Maybe that really does need specializing (to avoid index array allocs), but higher-level tensorcontract! methods also have 10-100 specializations (some of them almost 10s of cumulative compile time) and account for a negligible portion of the runtime. Many of these are thin wrappers, like the ones that specialize on conj args (but do nothing with the tuples). Seems like these could (optionally) pass vectors along that later get converted to tuples for specialization at a lower level.

Also, have you thought about allowing these tuples to be arrays more generally and using something like bumper to allocate vectors that come from index manipulating functions?

lkdvos · 2025-02-28T17:43:42Z

@amilsted , does this include the changes from the StridedViews PR? I was hoping that it is that part that mitigates the expensive compilations of _blas_contract!, while the tensorcontract! specializations actually don't matter too much.
I have to admit that I don't actually have much feeling with what causes the largest amount of compilation time. I somehow figured that the higher-up callers are thin wrappers, so compiling these functions should not be that expensive. On the other hand, the heavy-duty lower-level functions would be the expensive ones to compile (which is probably also what we want because these determine the performance?).

I don't think the allocations of the index arrays are really the problem. I think one of the main reasons we chose tuples instead of Vectors of indices, and convert to tuples as soon as possible is that this allows type-stable calls in the following:

tensorcontract(A, pA, conjA, B, pB, conjB, pAB, ...) -> C

Here a Vector would actually make the result Array{T,N} where {N} instead. (N = numind(pAB) is available at compile-time).
Our philosophy was to convert everything to tuples as high up the chain as possible to avoid having to recompile the entire chain, with the idea that the Tuple way is the most performant. (This is also why @tensor does not have this problem, since it immediately has Tuples).
ncon is a bit annoying in that regard, since there none of this is known at compile-time, and therefore we pay the price of extra compilation without the benefit of type stability.
Obviously, this might have been a mistake, it's not that easy to say and I just wanted to give my reasoning.

One thing we could consider experimenting with as well is to simply put @nospecialize annotations around the pA, pB etc arguments, and investigate how that affects performance? It does tend to be a bit of a mess to start battling the compiler heuristics like this, but I do agree that the current compilation time is a bit excessive.

amilsted · 2025-02-28T23:45:38Z

@lkdvos This is with the StridedViews PR. With this PR, it improves the MPO contraction case we looked at significantly, but I'm still seeing a lot of compile on code that does more MPS stuff (sweeping algorithms). I should say that I have measured inference time rather than LLVM compile time. I don't know why so much time is being spent on inference. Perhaps there's a lot of constant propagation happening that blows up the cases?

Jutho · 2025-03-01T20:27:52Z

@amilsted , I won't have any time to look at this before at least Thursday, but if you could share some of these snoopcompile outputs or scripts, that would be a great help in analyzing if there are any obvious improvements up for grabs.

amilsted · 2025-03-03T23:31:13Z

Ok, it looks like part of it was that the other code I looked at was using the Bumper allocator by default. Any chance of getting those methods precompiled too? :)

amilsted · 2025-03-04T00:24:09Z

I will see if I can usefully share some snoopcompile outputs for what remains. There does indeed seem to be a lot of constant propagation happening (lots of orange-gold in plots like this). Attempting to disable it at the ncon-caller level doesn't seem to get rid of it.

amilsted · 2025-03-04T00:59:30Z

The golden-orange columns are TensorOps/Strided methods.

lkdvos · 2025-03-04T12:47:14Z

I'll look into the Bumper thing, it's a little more annoying since that becomes a combinatoric mess of what needs to be compiled: Bumper returns an UnsafeArray from UnsafeArrays.jl.
Effectively, we'd have to precompile methods for all combinations of Array and UnsafeArray as inputs and outputs, because I am not sure if I can "normalize" those types like I did in the StridedViews change. Perhaps we can map both Array and UnsafeArray via Memory in 1.11+, but I don't know how feasible this is.

…ions.jl into ld-precompile

amilsted · 2025-03-04T17:28:12Z

Yeah, fair enough. Perhaps we should just not default to using Bumper...

lkdvos mentioned this pull request Feb 27, 2025

Normalize parent array type Jutho/StridedViews.jl#15

Open

lkdvos added 2 commits February 27, 2025 10:58

Add precompile statements

87313c4

new precompilation approach

Bump v5.1.5

5b7fc77

lkdvos force-pushed the ld-precompile branch from adc7686 to 5b7fc77 Compare February 27, 2025 15:58

Use Zero() instead of false

2e30404

lkdvos added 3 commits February 28, 2025 09:51

Make precompilation configurable

6ade3c1

Add documentation

84afe18

change to minor version

28a3834

lkdvos added 2 commits March 4, 2025 10:39

Rework Strided wrapping

d71968f

Merge branch 'ld-precompile' of https://github.com/Jutho/TensorOperat…

fb23763

…ions.jl into ld-precompile

lkdvos added 3 commits March 4, 2025 12:38

small fixes

53a6a70

Remove @inline for makeblascontractable

f389aa3

Sprinkle @constprop :none

72f9dd9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add precompile statements #203

Add precompile statements #203

lkdvos commented Feb 27, 2025 •

edited

Loading

Jutho commented Feb 27, 2025

lkdvos commented Feb 28, 2025

amilsted commented Feb 28, 2025

lkdvos commented Feb 28, 2025

amilsted commented Feb 28, 2025

Jutho commented Mar 1, 2025

amilsted commented Mar 3, 2025

amilsted commented Mar 4, 2025

amilsted commented Mar 4, 2025

lkdvos commented Mar 4, 2025

amilsted commented Mar 4, 2025

Add precompile statements #203

Are you sure you want to change the base?

Add precompile statements #203

Conversation

lkdvos commented Feb 27, 2025 • edited Loading

Current status

Precompilation

Standardization

Jutho commented Feb 27, 2025

lkdvos commented Feb 28, 2025

amilsted commented Feb 28, 2025

lkdvos commented Feb 28, 2025

amilsted commented Feb 28, 2025

Jutho commented Mar 1, 2025

amilsted commented Mar 3, 2025

amilsted commented Mar 4, 2025

amilsted commented Mar 4, 2025

lkdvos commented Mar 4, 2025

amilsted commented Mar 4, 2025

lkdvos commented Feb 27, 2025 •

edited

Loading