Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RMSD with multicore and timeout functionality #113
RMSD with multicore and timeout functionality #113
Changes from 27 commits
8ecd2c2
4ecdbae
56a1eae
52e1e21
1353fb5
0fe91a7
56a215b
8dcc22b
1fe4f9a
7e0c311
713b833
9491130
28c41a3
fa25924
0ff8b65
dc1b0db
1258956
54e28a1
ead245b
cef2b0d
1a0527b
c0b0fa0
20d1965
bca1889
5729f6b
abe6353
dbade96
70765d5
73ed019
10a47be
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parallelisation is performed over graph isomorphisms, so I don't think a single reference structure is particularly interesting to have. Is this for the timeout functionality only? If yes, I might be worth adding a note in the docstring, explaining the two (non-exclusive) functionalities of this function and that parallelisation only makes sense for multiple molecules.
Maybe it adds useless overhead, but it might be worth checking
num_workers
and raising a warning if a single molecules is passed withnum_workers > 1
? But maybe it's not worth it, because it would prevent to setnum_workers = None
by default.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if there is only one structure it makes less sense to use this method, unless you want to use the timeout feature like you mentioned. Another scenario could be where you want to check just one structure against a lot of reference structures, which would also be executed in parallel. (For example if you docked a structure 100 times, and want to compare the top structure directly with all the other compounds, it would treat every RMSD calculation seperately and run in parallel across multiple cores). In this case, it would basically match the single input structure to all the reference molecules for you, and calculate it seperately using
rmsdwrapper
. Sormsdwrapper(mol, refmol1)
,rmsdwrapper(mol, refmol2)
, ... is computed across all the available cores.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but in this case you would have to re-compute the graph isomorphism each time, which might be the most expensive thing. I'm starting to wonder if these two cases (one reference and many many corresponding structures, or multiple references with few corresponding structures) are different enough to distinguish them? The first case would definitely benefit to compute the graphs isomorphisms once, and then parallelise over the RMSD calculations (although these are quite fast, so the overhead of parallelisation might be too much...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think? Do you have a feel about how much is the parallel overhead compared to just the RMSD calculation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah so the parallel method is "stupid" in the sense that it re-computes the graph isomorphism each time (which I agree is not efficient). However I think that for this there would need to be made some more changes (maybe cache it somehow so it can be reused?). Also the graph isomorphism calculations aren't perfomed in a multicore manner either. In an ideal world the isormorphisms would be calculated using all the assigned num_workers, and be cached. After that the RMSDs could be computed across all num_workers (single core) and reuse the cached isomorphisms until all calculations are finished. However I don't have the bandwidth to work on that. The initial goal for me was to have a function with
timeout
function so it can run more stable on libraries that have "problematic" molecules (like Drugbank).Also for your question about overhead: with pebble there still is some overhead, but compared to spawning the processes manually as I did in the first try, it looks to be relatively small (but ofcourse not 0). I would say around a 10-20% overhead maybe? Here are the benchmarks I shared before:
rmsdwrapper
function: 12.00spebblermsdwrapper
with num_worker=1: 14.27spebblermsdwrapper
with num_worker=2: 7.69spebblermsdwrapper
with num_worker=4: 4.93sSo using 2 Cores already gives you a faster performance than 1 (which in my opinion seems already worth it, especially with the timeout feature)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is how a standard RMSD calculation already works, if you are passing a single reference and many structures to compare it with. The isomorphism is computed only once and cached:
spyrmsd/spyrmsd/rmsd.py
Lines 239 to 240 in 62c5df3
So one would have to use
_rmsd_isomorphic_core
in a similar way in order to implement parallelisation over RMSD calculations with a single graph isomorphism.Totally understandable. And so that there are no misunderstanding, I'm not suggesting to implement both directly, but to narrow the scope of the one you are implementing (parallelisation on graph isomorphism, and timeout), and leave the other less interesting case (parallelisation over RMSD calculations with the same isomorphism) as a TODO.
Yes, I was wondering how the overhead compares with RMSD calculations only, instead of graph isomorphism + RMSD. The former is vectorised with
numpy
, so it should be very fast (and the overhead would have much more impact).