Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add precompile statements #203

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Add precompile statements #203

wants to merge 11 commits into from

Conversation

lkdvos
Copy link
Collaborator

@lkdvos lkdvos commented Feb 27, 2025

This PR is an attempt to fix some compilation time issues, as brought up by @amilsted.
The problem is the functions attached in this file

Current status

(v5.1.3)

Code Loading + Precompilation

❯ julia --project -e '@time using TensorOperations'
  0.016875 seconds (32.41 k allocations: 2.502 MiB)
  
❯ julia --project -e '@time Base.compilecache(Base.identify_package("TensorOperations"))'
  0.660362 seconds (18.10 k allocations: 1.061 MiB, 0.71% compilation time)

TTFX

❯ julia --project -e 'include("to_dense.jl"); @time to_dense_ncon(M, L, R)'
 17.228799 seconds (90.38 M allocations: 4.563 GiB, 2.30% gc time, 99.99% compilation time)
 
❯ julia --project -e 'include("to_dense.jl"); @time to_dense_tensor(M, L, R)'
 16.379553 seconds (80.78 M allocations: 4.093 GiB, 2.19% gc time, 100.00% compilation time)

Performance

❯ julia --project -e 'include("to_dense.jl"); using BenchmarkTools; @btime to_dense_ncon(M, L, R)'
  112.833 μs (825 allocations: 437.98 KiB)

❯ julia --project -e 'include("to_dense.jl"); using BenchmarkTools; @btime to_dense_tensor(M, L, R)'
  65.416 μs (56 allocations: 408.83 KiB)

Precompilation

After investigating, there are several possible improvements.
Given that there does not seem to be any type instability issues, and I couldn't spot any obvious problems, I figured that this is partially just because we really do want to specialize quite a bit, assuming that TensorOperations is an important performance bottleneck.
Therefore, I started out by simply adding precompile statements:

This PR: adding precompilation

Code Loading + Precompilation

❯ julia --project -e '@time using TensorOperations'
  0.823528 seconds (689.87 k allocations: 68.223 MiB, 7.65% gc time)
  
❯ julia --project -e '@time Base.compilecache(Base.identify_package("TensorOperations"))'
137.465629 seconds (17.85 k allocations: 1.048 MiB, 0.00% compilation time)

TTFX

❯ julia --project -e 'include("to_dense.jl"); @time to_dense_ncon(M, L, R)'
  0.963772 seconds (8.88 M allocations: 444.266 MiB, 4.24% gc time, 99.87% compilation time)
 
❯ julia --project -e 'include("to_dense.jl"); @time to_dense_tensor(M, L, R)'
  0.059043 seconds (264.73 k allocations: 13.633 MiB, 99.53% compilation time)

Performance

❯ julia --project -e 'include("to_dense.jl"); using BenchmarkTools; @btime to_dense_ncon(M, L, R)'
  114.541 μs (825 allocations: 437.98 KiB)

❯ julia --project -e 'include("to_dense.jl"); using BenchmarkTools; @btime to_dense_tensor(M, L, R)'
  60.792 μs (56 allocations: 408.83 KiB)

The good: TTFX is negligible and the performance seems unaffected!
The bad: precompilation time now takes ~140 seconds...

Standardization

Because this package often forms the important part of code for performance, this might be acceptable, but we can improve by standardizing some of the types in StridedViews, see this PR.
The precompile time is now at ~40 seconds, while TTFX is still good and performance is mostly unchanged. Interestingly though, that StridedView approach does require an additional allocation (reshape seems to cause this?), of which I'm unsure if we can get rid of it.

This PR + StridedViews standardizing parenttype

Code Loading + Precompilation

❯ julia --project -e '@time using TensorOperations'
  0.641844 seconds (223.34 k allocations: 19.020 MiB, 6.27% gc time)
  
❯ julia --project -e '@time Base.compilecache(Base.identify_package("TensorOperations"))'
 41.829745 seconds (17.85 k allocations: 1.048 MiB, 0.01% compilation time)

TTFX

❯ julia --project -e 'include("to_dense.jl"); @time to_dense_ncon(M, L, R)'
  1.001807 seconds (8.89 M allocations: 444.702 MiB, 4.45% gc time, 99.81% compilation time)
 
❯ julia --project -e 'include("to_dense.jl"); @time to_dense_tensor(M, L, R)'
  0.064136 seconds (270.55 k allocations: 13.911 MiB, 6.60% gc time, 99.63% compilation time)

Performance

❯ julia --project -e 'include("to_dense.jl"); using BenchmarkTools; @btime to_dense_ncon(M, L, R)'
  115.417 μs (844 allocations: 438.23 KiB)

❯ julia --project -e 'include("to_dense.jl"); using BenchmarkTools; @btime to_dense_tensor(M, L, R)'
  61.250 μs (75 allocations: 409.08 KiB)

As an interesting sidenote, without the precompilation statements here, but including the StridedViews standardization actually already cuts the TTFX quite considerably:

only StridedViews standardizing parenttype

Code Loading + Precompilation

❯ julia --project -e '@time using TensorOperations'
  0.017374 seconds (34.04 k allocations: 2.602 MiB)
  
❯ julia --project -e '@time Base.compilecache(Base.identify_package("TensorOperations"))'
  0.775875 seconds (18.09 k allocations: 1.060 MiB, 0.57% compilation time)

TTFX

❯ julia --project -e 'include("to_dense.jl"); @time to_dense_ncon(M, L, R)'
  9.672027 seconds (40.09 M allocations: 1.996 GiB, 1.98% gc time, 99.98% compilation time)
 
❯ julia --project -e 'include("to_dense.jl"); @time to_dense_tensor(M, L, R)'
  8.857590 seconds (31.01 M allocations: 1.552 GiB, 1.64% gc time, 100.00% compilation time)

Performance

❯ julia --project -e 'include("to_dense.jl"); using BenchmarkTools; @btime to_dense_ncon(M, L, R)'
  114.333 μs (844 allocations: 438.23 KiB)

❯ julia --project -e 'include("to_dense.jl"); using BenchmarkTools; @btime to_dense_tensor(M, L, R)'
  63.042 μs (75 allocations: 409.08 KiB)

@Jutho
Copy link
Owner

Jutho commented Feb 27, 2025

I am certainly ok with this. There used to be precompile statements present, but then at some point removed them as the precompilationtools ecoystem was evolving rapidly and I couldn't follow up.

@lkdvos
Copy link
Collaborator Author

lkdvos commented Feb 28, 2025

I've updated the implementation and added a section to the documentation page to explain what is going on. In particular, we now follow the same convention as PrecompileTools for disabling precompilation, and the precompile settings can be tweaked locally for people that want this.
If everything turns green, I would say we can merge this and tag a new version. I would actually think that this might warrant a v5.2.0 release, what do you think?

@amilsted
Copy link

Thanks @lkdvos, this looks like it helps quite a bit. There's still a lot of compile going on, but I understand you want to avoid runtime dispatch and allocating vectors for indices and such.

A general comment: There is quite a deep call stack of functions that pass Index2Tuple arguments from level to level. Because these args are tuples, most of this stack (tensorcontract, stridedtensorcontract, blascontract, ...) gets recompiled for every shape combination in a pairwise contraction! Is it worth allowing these arguments to be arrays, for example if they are passed from ncon, until we are deeper in the stack? It seems kind of nuts that in some of the MPS code I am looking at we have >200 specializations of _blas_contract! (taking ~8s to compile) according to SnoopCompile tools. Maybe that really does need specializing (to avoid index array allocs), but higher-level tensorcontract! methods also have 10-100 specializations (some of them almost 10s of cumulative compile time) and account for a negligible portion of the runtime. Many of these are thin wrappers, like the ones that specialize on conj args (but do nothing with the tuples). Seems like these could (optionally) pass vectors along that later get converted to tuples for specialization at a lower level.

Also, have you thought about allowing these tuples to be arrays more generally and using something like bumper to allocate vectors that come from index manipulating functions?

@lkdvos
Copy link
Collaborator Author

lkdvos commented Feb 28, 2025

@amilsted , does this include the changes from the StridedViews PR? I was hoping that it is that part that mitigates the expensive compilations of _blas_contract!, while the tensorcontract! specializations actually don't matter too much.
I have to admit that I don't actually have much feeling with what causes the largest amount of compilation time. I somehow figured that the higher-up callers are thin wrappers, so compiling these functions should not be that expensive. On the other hand, the heavy-duty lower-level functions would be the expensive ones to compile (which is probably also what we want because these determine the performance?).

I don't think the allocations of the index arrays are really the problem. I think one of the main reasons we chose tuples instead of Vectors of indices, and convert to tuples as soon as possible is that this allows type-stable calls in the following:

tensorcontract(A, pA, conjA, B, pB, conjB, pAB, ...) -> C

Here a Vector would actually make the result Array{T,N} where {N} instead. (N = numind(pAB) is available at compile-time).
Our philosophy was to convert everything to tuples as high up the chain as possible to avoid having to recompile the entire chain, with the idea that the Tuple way is the most performant. (This is also why @tensor does not have this problem, since it immediately has Tuples).
ncon is a bit annoying in that regard, since there none of this is known at compile-time, and therefore we pay the price of extra compilation without the benefit of type stability.
Obviously, this might have been a mistake, it's not that easy to say and I just wanted to give my reasoning.

One thing we could consider experimenting with as well is to simply put @nospecialize annotations around the pA, pB etc arguments, and investigate how that affects performance? It does tend to be a bit of a mess to start battling the compiler heuristics like this, but I do agree that the current compilation time is a bit excessive.

@amilsted
Copy link

@lkdvos This is with the StridedViews PR. With this PR, it improves the MPO contraction case we looked at significantly, but I'm still seeing a lot of compile on code that does more MPS stuff (sweeping algorithms). I should say that I have measured inference time rather than LLVM compile time. I don't know why so much time is being spent on inference. Perhaps there's a lot of constant propagation happening that blows up the cases?

@Jutho
Copy link
Owner

Jutho commented Mar 1, 2025

@amilsted , I won't have any time to look at this before at least Thursday, but if you could share some of these snoopcompile outputs or scripts, that would be a great help in analyzing if there are any obvious improvements up for grabs.

@amilsted
Copy link

amilsted commented Mar 3, 2025

Ok, it looks like part of it was that the other code I looked at was using the Bumper allocator by default. Any chance of getting those methods precompiled too? :)

@amilsted
Copy link

amilsted commented Mar 4, 2025

I will see if I can usefully share some snoopcompile outputs for what remains. There does indeed seem to be a lot of constant propagation happening (lots of orange-gold in plots like this). Attempting to disable it at the ncon-caller level doesn't seem to get rid of it.

@amilsted
Copy link

amilsted commented Mar 4, 2025

flame

The golden-orange columns are TensorOps/Strided methods.

@lkdvos
Copy link
Collaborator Author

lkdvos commented Mar 4, 2025

I'll look into the Bumper thing, it's a little more annoying since that becomes a combinatoric mess of what needs to be compiled: Bumper returns an UnsafeArray from UnsafeArrays.jl.
Effectively, we'd have to precompile methods for all combinations of Array and UnsafeArray as inputs and outputs, because I am not sure if I can "normalize" those types like I did in the StridedViews change. Perhaps we can map both Array and UnsafeArray via Memory in 1.11+, but I don't know how feasible this is.

@amilsted
Copy link

amilsted commented Mar 4, 2025

Yeah, fair enough. Perhaps we should just not default to using Bumper...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants