-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add precompile statements #203
base: master
Are you sure you want to change the base?
Conversation
new precompilation approach
adc7686
to
5b7fc77
Compare
I am certainly ok with this. There used to be precompile statements present, but then at some point removed them as the precompilationtools ecoystem was evolving rapidly and I couldn't follow up. |
I've updated the implementation and added a section to the documentation page to explain what is going on. In particular, we now follow the same convention as |
Thanks @lkdvos, this looks like it helps quite a bit. There's still a lot of compile going on, but I understand you want to avoid runtime dispatch and allocating vectors for indices and such. A general comment: There is quite a deep call stack of functions that pass Also, have you thought about allowing these tuples to be arrays more generally and using something like bumper to allocate vectors that come from index manipulating functions? |
@amilsted , does this include the changes from the StridedViews PR? I was hoping that it is that part that mitigates the expensive compilations of I don't think the allocations of the index arrays are really the problem. I think one of the main reasons we chose tuples instead of tensorcontract(A, pA, conjA, B, pB, conjB, pAB, ...) -> C Here a One thing we could consider experimenting with as well is to simply put |
@lkdvos This is with the StridedViews PR. With this PR, it improves the MPO contraction case we looked at significantly, but I'm still seeing a lot of compile on code that does more MPS stuff (sweeping algorithms). I should say that I have measured inference time rather than LLVM compile time. I don't know why so much time is being spent on inference. Perhaps there's a lot of constant propagation happening that blows up the cases? |
@amilsted , I won't have any time to look at this before at least Thursday, but if you could share some of these snoopcompile outputs or scripts, that would be a great help in analyzing if there are any obvious improvements up for grabs. |
Ok, it looks like part of it was that the other code I looked at was using the Bumper allocator by default. Any chance of getting those methods precompiled too? :) |
I will see if I can usefully share some snoopcompile outputs for what remains. There does indeed seem to be a lot of constant propagation happening (lots of orange-gold in plots like this). Attempting to disable it at the ncon-caller level doesn't seem to get rid of it. |
I'll look into the Bumper thing, it's a little more annoying since that becomes a combinatoric mess of what needs to be compiled: |
Yeah, fair enough. Perhaps we should just not default to using Bumper... |
This PR is an attempt to fix some compilation time issues, as brought up by @amilsted.
The problem is the functions attached in this file
Current status
(v5.1.3)
Code Loading + Precompilation
TTFX
Performance
Precompilation
After investigating, there are several possible improvements.
Given that there does not seem to be any type instability issues, and I couldn't spot any obvious problems, I figured that this is partially just because we really do want to specialize quite a bit, assuming that TensorOperations is an important performance bottleneck.
Therefore, I started out by simply adding precompile statements:
This PR: adding precompilation
Code Loading + Precompilation
TTFX
Performance
The good: TTFX is negligible and the performance seems unaffected!
The bad: precompilation time now takes ~140 seconds...
Standardization
Because this package often forms the important part of code for performance, this might be acceptable, but we can improve by standardizing some of the types in StridedViews, see this PR.
The precompile time is now at ~40 seconds, while TTFX is still good and performance is mostly unchanged. Interestingly though, that StridedView approach does require an additional allocation (
reshape
seems to cause this?), of which I'm unsure if we can get rid of it.This PR + StridedViews standardizing parenttype
Code Loading + Precompilation
TTFX
Performance
As an interesting sidenote, without the precompilation statements here, but including the StridedViews standardization actually already cuts the TTFX quite considerably:
only StridedViews standardizing parenttype
Code Loading + Precompilation
TTFX
Performance