-
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RMSD with multicore and timeout functionality #113
Conversation
…tionality Note: MyPy and Flake8 doesn't pass (yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Jnelen for the nice contribution! This is just a first quick pass (it's late...), I'll have a deeper look tomorrow.
@Jnelen sorry, I clicked "Update Branch" without too much thinking. You'll have to pull before pushing further changes, otherwise you will get an error. Sorry about that! |
General cleanup and improvements Co-authored-by: Rocco Meli <r.meli@bluemail.ch>
Make rmsd_timeout a private function MyPy still complains when comparing num_workers with os.cpu_count()
it now supports mol-mol, mol-list and list-list as input. In the case of mol-list, all calculations are performed separately to take advantage of multiple cores
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please test the suggested changes. I'm a bit short in time these days so I did not test them myself, sorry.
Co-authored-by: Rocco Meli <r.meli@bluemail.ch>
Refactor to prmsdwrapper Some improvements to typing
For the Value ctype I picked a regular, single precision float. However we could also use a double precision float if you think this is more appropriate! |
I just suppressed the 2 remaining MyPY errors. One of them should hopefully get fixed by MyPy (see the earlier issue I referred to), the other is the problem that os.cpu_count() can return Functionally I think it's basically how we want to have, but perhaps some parts of the implementation can still be cleaned up? Besides that, I suppose some tests should be added as well? The timeout feature should be relatively easy to test (by using one of the "problematic" compounds that made me start looking into this in the first place. But i'm not sure how to properly test the multicore feature? By the way, there is no rush in getting this finished quickly, so please only review it if you actually have the time! |
Yes, we need tests. =)
Yes. I think the worst one you reported is a great candidate.
I think GitHub Actions have access to multiple cores, but I should double check. In any case, the test would be on correctness. Eventually we could also add a benchmark test, but it's not a priority IMO. If you have some results from your work (where you see an improvement), you can just post the timings here for future reference. But not a priority, the important thing is that it's correct.
Thanks. You are doing great work, so I don't want it to stall. ;) |
spyrmsd/rmsd.py
Outdated
# Cast the molecules to lists if they aren't already | ||
if not isinstance(molrefs, list): | ||
molrefs = [molrefs] | ||
if not isinstance(mols, list): | ||
mols = [mols] | ||
|
||
# Match the length of the molref | ||
if len(molrefs) == 1 and len(molrefs) < len(mols): | ||
molrefs = molrefs * len(mols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we want to do is to parallelise on graph isomorphisms, since computing the RMSD is rather fast with NumPy vectorisation. Therefore, I think the use case of multiple references each with multiple different poses (i.e. in the case of docking), is an important use case that is not covered, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current behaviour is like this: If a user inputs 1 molref and a list of mols ([mol1, mol2, ...]
to compare it to (so this could be different docked poses), what happens is that multiple rmsdwrapper
calculations are performed across different cores like this:
rmsdwrapper(molref, mol1) #core 1
rmsdwrapper(molref, mol2) #core 2
...
So this is just using an embarrassingly parallel approach using the "regular" rmsdwrapper
, the graph isomorphisms calculations aren't even really being parallelized explicitly. Thinking more about it, I'm honestly not even sure how this interacts with the graph caching either (because of the multicore aspect).
My implementation had mostly my own use-case in mind, which I thought would generalize decently well. Basically I wanted to accelerate the workload of calculating the RMSD of a few ligandposes from different docking methods after a virtual screening run. However, considering your insights, maybe my approach was/is too simplistic and needs to be improved. I do think this would introduce even more complexity, so we will need to think things through very carefully (so basically continue at it like we have been since the start haha).
I added some very basic tests for the prmsdwrapper function. (I basically copied the tests with the rmsdwrapper function and used the prsmdwrapper function instead). I also added a test for the timeout function. However one thing I noticed is that for some reason this runs twice: once with a True and once with a False variable. When the tests were giving errors I got this:
I fixed the error of the assert, but the test still runs twice. I'm still confused since there isn't a parametrize, or even an input variable. I couldn't immediately figure it out, but I assume it's getting this from some earlier statement? Ofcourse there is still some work left improving the actual prmsdwrapper function, but I suppose some basic tests can also add sanity checks to ensure we're not breaking stuff. |
Co-authored-by: Rocco Meli <r.meli@bluemail.ch>
Thanks @Jnelen! I'm a bit swamped this week too, but I'll try to have a look and reply ASAP. Sorry for the delays.
Yes, this is a bit confusing. I think the two runs are coming from the following fixture, which is automatically applied to all the tests of the file ( Lines 11 to 15 in 08764e1
It's used to trigger a separate code path deep inside the minimized RMSD calculation. So the two runs are normal and you should not worry about them, but great you look this closely at what's happening! |
No problem at all. Actually, I will probably also be quite inactive until Wednesday next week. I should still be able to respond to some comments, but I probably won't be able to change or test any new code until then. Just as a heads-up!
Ah that explains it, thanks for the clarification! |
I'm back from my break! Right before that I actually updated my use-case (ESSENCE-Dock) to work with a time-out and multicore, but I ended up using Pebble since this handles both at the same time. For my specific use-case I think it was the most elegant solution, and I use a Singularity image anyway, so I didn't mind adding an additional dependency. However this got me thinking if we should change the implementation of the current multicore and timeout implementation. Maybe I had my (specific) use-case too much in mind? It definitely is functional, and will work just fine to process a large number of molecules across different CPUs. However the calculation of graph-isomorphisms isn't currently accelerated, but for that I assume we need to look into specific solutions for the different backends (since I assume they can take care of the multicore support)? However, I could also see how this could lead to problems with the current multicore implementation, since it will want to use multiple cores in each process? That's why I'm wondering if we should make larger changes to the current version. Finally as for a benchmark, I haven't ran a proper one yet, but I'll try to run one later with the current implementation. |
Hi, I finally got around to work on a simple benchmark, but I have started to notice quite a notable difference between running the calculations the first time versus subsequent times. Here is a concrete example for the same dataset.
Versus the second (and subsequent) time(s):
Do you know why this is? I'm not sure, but I think this is because of the loading of the molecules, and that in subsequent runs these are maybe cached? Also note the discrepancy between the real time and the user time in the first run. I tried several different datasets, and this pattern seems to be very common. Do you know what could be happening? I will try to follow-up later with some concrete "benchmarks" and more context! |
Alright, I did some testing, and the performance of the parallel implementation is a lot worse than I thought it would be to be honest.. In the edge-cases tests I was doing it seemed good, because even if some "problematic" molecules were present it would still finish in a reasonable amount of time, where as with the "regular" implementation it would get killed or take a very long time. However, in more "normal" scenarios, the current prmsdwrapper seems to degrade the performance by a lot. I used some docking examples from DUD-E to test. This example is for HS90A, where I used the LeadFinder (LF) and Gnina (GN) results to calculate the RMSD for 4938 compounds. You can find the scripts I used here: benchmark_scripts.zip. Anyway, here are my numbers:
Something else I noticed is that error handling in the |
Thanks for your thorough review @RMeli! I tried to address most issues. I added an example test for the more realistic test case (with chunksize=2 and a timeout), but I made it skip, since in the current implementation we force the chunksize to be 1 when a timeout is set. However, I think that for debugging/developing purposes this can still be interesting, which is why I added it anyway (and made it skip so it doesn't give an error for now). |
I agree on the interest, but I think it would be better to leave it as a comment here in this PR instead of within the codebase. As I mentioned in the last comment, I completely forgot we went for exclusivity of the two features, so some of my comments were a bit confusing/confused. Apologies for the confusion. |
Test for the hypothetical case where a timeout on chunks of size 2 makes the whole test fail. The test checks that the whole chunk is set to Currently, specifying a timeout will automatically change the chunk size to @pytest.mark.skip(
reason="Chunksize is currently automatically set to 1 when a timeout is set"
)
def test_prmsdwrapper_mixed_timeout(muparfostat, benzene) -> None:
muparfostat_mol = copy.deepcopy(muparfostat)
benzene_mol = copy.deepcopy(benzene)
lst_1 = [
muparfostat_mol,
benzene_mol,
benzene_mol,
benzene_mol,
benzene_mol,
muparfostat_mol,
]
lst_2 = [
muparfostat_mol,
benzene_mol,
benzene_mol,
benzene_mol,
benzene_mol,
muparfostat_mol,
]
# Currently we force the num_workers to be 1 when there is a timeout set, but this could make it easier to debug things in the future
RMSDlist = prmsdwrapper(lst_1, lst_2, timeout=1e-3, num_workers=1, chunksize=2)
expectedResult = [np.nan, np.nan, 0, 0, np.nan, np.nan]
np.testing.assert_allclose(RMSDlist, expectedResult, atol=1e-5) |
I added some clarifications in the code! I saw we still had some conversations open, but they were mostly discussions about things that can improved in the future. |
Hi @Jnelen, I'm back looking at this. Really sorry for the massive delays here. When I run the test locally, |
The timeout functionality works great IMO. With In [8]: p.prmsdwrapper(m, m, timeout=1)
/home/rmeli/git/oss/spyrmsd/spyrmsd/parallel.py:134: UserWarning: 1 compounds failed to process successfully and have been added as 'np.nan'. 0 compounds raised an error, 1 chunks timed out
warnings.warn(
Out[8]: [nan]
In [9]: p.prmsdwrapper(m, [m, m, m, m, m], timeout=1)
/home/rmeli/git/oss/spyrmsd/spyrmsd/parallel.py:134: UserWarning: 5 compounds failed to process successfully and have been added as 'np.nan'. 0 compounds raised an error, 5 chunks timed out
warnings.warn(
Out[9]: [nan, nan, nan, nan, nan] Using |
The open discussions are mostly on what to parallelize on. Currently, if I understand correctly, parallelization happens over different poses against the same reference. Essentially prmsdwrapper(m, [m1, m2]) becomes # In parallel
rmsdwrapper(m, m1)
rmsdwrapper(m, m2) This approach doesn't seem particularly appealing to me, because you re-compute the graph isomorphism each time. Computing the graph isomorphism is by far the most expensive part.
One reference, very many poses for the same referenceIn this case you have prmsdwrapper(m, [m1, m2, m3, m4, m5, m6, m7, ...]) and it should transform to something like # Serial, compute isomorphism once
isomorphism = None
rmsd1, isomorphism = _rmsd_isomorphic_core(m, m1, isomorphism) # Isomorphisms are cached
# In parallel
rmsd2, _ = _rmsd_isomorphic_core(m, m2, isomorphism) # Re-use isomorphisms
rmsd3, _ = _rmsd_isomorphic_core(m, m3, isomorphism)
rmsd4, _ = _rmsd_isomorphic_core(m, m4, isomorphism)
... This might be beneficial in two ways: for very symmetric molecules the number of isomorphism to go through might be high, and if the number of molecules is very high. Multiple referencesThis would consist in parallelizing over multiple reference (i.e. over graph isomorphisms): prmsdwrapper([m, n], [[m1, m2], [n1, n2]) becomes # In parallell
prmsdwrapper(m, [m1, m2])
prmsdwrapper(n, [n1, n2]) I think this is extremely useful, but it should be reasonably straightforward for an user to implement (without timeout). Which option do you think would be more useful? If the way you implemented it right now is already useful for you, could you please clarify how it is so? I'm sorry if I totally misunderstood something. Apart from this design comment, which we discussed here and there above, everything else looks great. The timeout functionality alone is already extremely useful. |
Yes, this is because we also test the edge-case where we hit the timeout timeframe. So in those cases it is stuck until the timeout hits, which slows down some of those tests. If you would skip those tests, it should be just as quick if not quicker than the original tests! |
Right! Ignore this brain fart of mine! ;) |
Regarding the parallelization details, the current implementation indeed is quite "dumb" and just running prmsdwrapper "independently" in parallel. The main use for me is the timeout feature, since I often process drugbank with some consensus docking protocol. Drugbank contains quite a few molecules with many isomorphisms, blocking any other calculations. The fact that you can also run the calculations using mutiple cores is a nice bonus as well. I already have my own implementation for this in my project, but I feel a general function for this could be useful. As you indicate there are ways to improve it. Ideally, the calculations of the isomorphisms itself would be done multicore and also cached, followed by performing parallel RMSD calculations using the cached isomorphisms. |
The timeout feature is really useful, so I'd really like to merge it. If you don't have time to work on this anymore, I'd suggest to merge it as-is, and leave it un-documented so that it does not start to attract too many users. Then I'll implement the version re-using the isomorphism. How does this sound? |
Sounds great to me! And apologies for not being able to fully finish it. Maybe in a couple of weeks I have some more time where I can have a look again, but I don't want to promise anything. So if you want to work on it, please go ahead! |
No worries, it's totally understandable. It's great work and I want to get it in and improve it incrementally instead of let it sit here for longer. (It's been dragging for long enough, really sorry about that!) I'll have a last look in the coming days and then merge. |
Can you please update the |
How does this look? Please let me know if you want me to change the wording/something else! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you for all the work and for putting up with a very slow review. I'm happy to merge it as-is as an experimental feature.
The very last question: I looked at the CodeCov Report and realised that the TimeoutError
branch is not covered. However, timeout is tested. Do you know why this is the case?
Fix test so it works with the changes made in PR RMeli#118 Co-authored-by: Rocco Meli <r.meli@bluemail.ch>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Jnelen for all the contributions (and also the very interesting observations during the long discussions). I'm sorry this took way longer than initially hoped!
No problem at all, it was really nice working with you and I also learned a lot along the way! Thanks for sticking with me, I will still be here if you would like my input on further improvements! |
As we discussed in #105: here is my first shot at implementation of RMSD functions with multicore and timeout functionalities.
For now, I kept the
rmsd_timeout
function public, however we could also just make it private. This function supports a molref with multiple mols, something that the multicore function doesn't support directly (everything is done pairwise here). Although I guess there is an easy workaround for the user?Maybe it's even better to implement this ourselves: if only one mol is supplied as the ref mol, extend the list to match the length of the mol list? (like this
[refmol]*len(mols)
). This would also give each molecule a seperate timeout. Having the possibility of a mollist in thermsd_timeout
function also made me return the raw result of rmsdwrapper (a list) instead of the value (because otherwise only one value would be returned). So if we make it private, it could clean up things and also emulate the regular rmsdwrapper functionality better! (I just came up with this while writing this description, if you like it I will implement it tomorrow!)I also still need to add tests. I guess we could add a test for a timeout, a test for matching mols and molrefs list, ... But multicore might be harder to test?
Any feedback welcome as always!
Note: I intentionally kept all the new funtions and imports together at the bottom so it's more clear to see what's added. This makes Flake8 doesn't pass. MyPy also doesn't pass yet, but I'll get to fixing that tomorrow as well!