Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RMSD with multicore and timeout functionality #113

Merged
merged 30 commits into from
Jun 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
8ecd2c2
First implementation of rmsd function with timeout and multicore func…
Jnelen Mar 11, 2024
4ecdbae
Merge branch 'develop' into timeout
RMeli Mar 11, 2024
56a1eae
Merge branch 'develop' into timeout
Jnelen Mar 12, 2024
52e1e21
Apply suggestions from code review
Jnelen Mar 12, 2024
1353fb5
Fix Flake8 problems
Jnelen Mar 12, 2024
0fe91a7
Slight improvements to _rmsd_timeout
Jnelen Mar 12, 2024
56a215b
Verbose implementation of improved functionality:
Jnelen Mar 12, 2024
8dcc22b
Transition from using a Queue to a shared Value
Jnelen Mar 13, 2024
1fe4f9a
Finish transition to multiprocessing Values
Jnelen Mar 13, 2024
7e0c311
Suppress MyPy errors at line 494 and 538
Jnelen Mar 14, 2024
713b833
Add basic tests for prmsdwrapper
Jnelen Mar 17, 2024
9491130
Apply suggestions from code review
Jnelen Mar 17, 2024
28c41a3
Initial version of the multicore rmsdwrapper with timeout feature usi…
Jnelen Apr 7, 2024
fa25924
Revert rmsd.py and test_rmsd.py
Jnelen Apr 8, 2024
0ff8b65
Merge latest develop commit
Jnelen Apr 9, 2024
dc1b0db
Add importskip if Pebble isn't installed
Jnelen Apr 9, 2024
1258956
Include Pebble installation for the RDKit CI tests
Jnelen Apr 10, 2024
54e28a1
Adress the a lot of the review comments
Jnelen Apr 11, 2024
ead245b
add chunksize check when timeout is set with prmsdwrapper
Jnelen Apr 14, 2024
cef2b0d
Add more tests for prmsdwrapper
Jnelen Apr 22, 2024
1a0527b
Simplify num_worker determination
Jnelen May 16, 2024
c0b0fa0
Merge branch 'develop' into timeout
RMeli May 28, 2024
20d1965
Incorporate suggested changes from review
Jnelen May 30, 2024
bca1889
Merge branch 'timeout' of https://github.com/Jnelen/spyrmsd into timeout
Jnelen May 30, 2024
5729f6b
Improve documentation regarding chunksize enforcement
Jnelen May 31, 2024
abe6353
Merge branch 'RMeli:develop' into timeout
Jnelen Jun 21, 2024
dbade96
Update the Changelog
Jnelen Jun 21, 2024
70765d5
Update tests/test_parallel.py
Jnelen Jun 21, 2024
73ed019
Fix formatting
Jnelen Jun 21, 2024
10a47be
More extensive testing
Jnelen Jun 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,12 @@
## Version 0.9.0

Date: XX/YY/ZZZ
Contributors: @RMeli
Contributors: @RMeli, @Jnelen

### Added

* `--version` CLI option [PR # | @RMeli]
* `--version` CLI option [PR #131 | @RMeli]
* `prmsdwrapper`, a first implementation of a multicore version of `rmsdwrapper`. It also supports a timeout functionality. [PR #113 | @Jnelen]

## Version 0.8.0

Expand Down
3 changes: 3 additions & 0 deletions devtools/conda-envs/spyrmsd-test-rdkit-all.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@ dependencies:
- networkx>=2
- rustworkx

# Parallel
- pebble

# Chemistry
- rdkit

Expand Down
3 changes: 3 additions & 0 deletions devtools/conda-envs/spyrmsd-test-rdkit-nogt.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@ dependencies:
- networkx>=2
- rustworkx

# Parallel
- pebble

# Chemistry
- rdkit

Expand Down
139 changes: 139 additions & 0 deletions spyrmsd/parallel.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
try:
from pebble import ProcessPool
except ImportError:
errmsg = (
"Parallel execution of SPyRMSD (`prmsdwrapper`) requires `pebble`."
"Please install `pebble`: https://github.com/noxdafox/pebble"
)
raise ImportError(errmsg)

import os
import warnings
from concurrent.futures import TimeoutError
RMeli marked this conversation as resolved.
Show resolved Hide resolved
from functools import partial
from typing import List, Optional, Union

import numpy as np

from spyrmsd import molecule
from spyrmsd.rmsd import rmsdwrapper


def prmsdwrapper(
molrefs: Union[molecule.Molecule, List[molecule.Molecule]],
mols: Union[molecule.Molecule, List[molecule.Molecule]],
symmetry: bool = True,
center: bool = False,
minimize: bool = False,
strip: bool = True,
cache: bool = True,
Jnelen marked this conversation as resolved.
Show resolved Hide resolved
num_workers: Optional[int] = None,
timeout: Optional[float] = None,
chunksize: int = 1,
) -> List[float]:
"""
Compute RMSD between two molecules with a timeout.
Parameters
----------
molrefs: Union[molecule.Molecule, List[molecule.Molecule]]
Reference molecule
mols: Union[molecule.Molecule, List[molecule.Molecule]]
Molecules to compare to reference molecule
symmetry: bool, optional
Symmetry-corrected RMSD (using graph isomorphism)
center: bool, optional
Center molecules at origin
minimize: bool, optional
Minimised RMSD (using the quaternion polynomial method)
strip: bool, optional
Strip hydrogen atoms
cache: bool, optional
Cache graph isomorphisms
num_workers: int
RMeli marked this conversation as resolved.
Show resolved Hide resolved
Amount of processor to use for the parallel calculations
timeout: float, optional
After how many seconds to stop the RMSD calculations
chunksize: int, optional
How many molecules to handle per child process

Returns
-------
List[float]
RMSDs
"""

# Ensure the num_workers is less or equal than the max number of CPUs.
# Silencing MyPy since os.cpu_count() can return None
if num_workers is None:
num_workers = os.cpu_count()
num_workers = min(num_workers, os.cpu_count()) # type: ignore[type-var]

if chunksize > 1 and timeout is not None:
# When this is not enforced, it can lead to unexpected results (output list length not matching the input list for example).
# To ensure correctness we force the chunksize to be 1 to avoid potential correctness problems.
warnings.warn(
"When using the timeout feature, a chunksize of 1 is required. The chunksize is set to 1 automatically in order to continue the calculations"
)
chunksize = 1

# Cast the molecules to lists if they aren't already
if not isinstance(molrefs, list):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parallelisation is performed over graph isomorphisms, so I don't think a single reference structure is particularly interesting to have. Is this for the timeout functionality only? If yes, I might be worth adding a note in the docstring, explaining the two (non-exclusive) functionalities of this function and that parallelisation only makes sense for multiple molecules.

Maybe it adds useless overhead, but it might be worth checking num_workers and raising a warning if a single molecules is passed with num_workers > 1? But maybe it's not worth it, because it would prevent to set num_workers = None by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if there is only one structure it makes less sense to use this method, unless you want to use the timeout feature like you mentioned. Another scenario could be where you want to check just one structure against a lot of reference structures, which would also be executed in parallel. (For example if you docked a structure 100 times, and want to compare the top structure directly with all the other compounds, it would treat every RMSD calculation seperately and run in parallel across multiple cores). In this case, it would basically match the single input structure to all the reference molecules for you, and calculate it seperately using rmsdwrapper. So rmsdwrapper(mol, refmol1), rmsdwrapper(mol, refmol2), ... is computed across all the available cores.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another scenario could be where you want to check just one structure against a lot of reference structures, which would also be executed in parallel.

Yes, but in this case you would have to re-compute the graph isomorphism each time, which might be the most expensive thing. I'm starting to wonder if these two cases (one reference and many many corresponding structures, or multiple references with few corresponding structures) are different enough to distinguish them? The first case would definitely benefit to compute the graphs isomorphisms once, and then parallelise over the RMSD calculations (although these are quite fast, so the overhead of parallelisation might be too much...).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think? Do you have a feel about how much is the parallel overhead compared to just the RMSD calculation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so the parallel method is "stupid" in the sense that it re-computes the graph isomorphism each time (which I agree is not efficient). However I think that for this there would need to be made some more changes (maybe cache it somehow so it can be reused?). Also the graph isomorphism calculations aren't perfomed in a multicore manner either. In an ideal world the isormorphisms would be calculated using all the assigned num_workers, and be cached. After that the RMSDs could be computed across all num_workers (single core) and reuse the cached isomorphisms until all calculations are finished. However I don't have the bandwidth to work on that. The initial goal for me was to have a function with timeout function so it can run more stable on libraries that have "problematic" molecules (like Drugbank).
Also for your question about overhead: with pebble there still is some overhead, but compared to spawning the processes manually as I did in the first try, it looks to be relatively small (but ofcourse not 0). I would say around a 10-20% overhead maybe? Here are the benchmarks I shared before:

  • regular rmsdwrapper function: 12.00s
  • pebblermsdwrapper with num_worker=1: 14.27s
  • pebblermsdwrapper with num_worker=2: 7.69s
  • pebblermsdwrapper with num_worker=4: 4.93s

So using 2 Cores already gives you a faster performance than 1 (which in my opinion seems already worth it, especially with the timeout feature)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However I think that for this there would need to be made some more changes (maybe cache it somehow so it can be reused?).

Yes, this is how a standard RMSD calculation already works, if you are passing a single reference and many structures to compare it with. The isomorphism is computed only once and cached:

spyrmsd/spyrmsd/rmsd.py

Lines 239 to 240 in 62c5df3

cache: bool
Cache graph isomorphisms

So one would have to use _rmsd_isomorphic_core in a similar way in order to implement parallelisation over RMSD calculations with a single graph isomorphism.

However I don't have the bandwidth to work on that.

Totally understandable. And so that there are no misunderstanding, I'm not suggesting to implement both directly, but to narrow the scope of the one you are implementing (parallelisation on graph isomorphism, and timeout), and leave the other less interesting case (parallelisation over RMSD calculations with the same isomorphism) as a TODO.

Also for your question about overhead: with pebble there still is some overhead, but compared to spawning the processes manually as I did in the first try, it looks to be relatively small (but ofcourse not 0).

Yes, I was wondering how the overhead compares with RMSD calculations only, instead of graph isomorphism + RMSD. The former is vectorised with numpy, so it should be very fast (and the overhead would have much more impact).

molrefs = [molrefs]
if not isinstance(mols, list):
mols = [mols]

# Match the length of the molref
if len(molrefs) == 1 and len(molrefs) < len(mols):
molrefs = molrefs * len(mols)

# Ensure molrefs and mols have the same len
if not len(molrefs) == len(mols):
raise ValueError("The 'mols' and 'molrefs' lists have different lengths.")

results = []

timeoutCounter = 0
errorCounter = 0

with ProcessPool(max_workers=num_workers) as pool:
rsmd_partial = partial(
rmsdwrapper,
symmetry=symmetry,
center=center,
minimize=minimize,
strip=strip,
cache=cache,
)

future = pool.map(
rsmd_partial, molrefs, mols, timeout=timeout, chunksize=chunksize
)
iterator = future.result()

# See https://pebble.readthedocs.io/en/latest/#pools
while True:
try:
result = next(iterator)
results.append(result[0])
except StopIteration:
break
except TimeoutError:
timeoutCounter += 1

# Upon timeout, the whole chunk fails. To ensure the length and order of the output is maintained we add np.nan for the whole chunk
# More information regarding pebble error handling: https://github.com/noxdafox/pebble/issues/132#issuecomment-2105267462
results += [np.nan] * chunksize
except Exception:
errorCounter += 1
results.append(np.nan)

if timeoutCounter + errorCounter > 0:
# Calculate total number of np.nan
failedCompoundsTotal = np.count_nonzero(np.isnan(results))

warnings.warn(
f"{failedCompoundsTotal} compounds failed to process successfully and have been added as 'np.nan'."
+ f" {errorCounter} compounds raised an error, {timeoutCounter} chunks timed out"
)

return results
2 changes: 1 addition & 1 deletion spyrmsd/rmsd.py
Original file line number Diff line number Diff line change
Expand Up @@ -307,7 +307,7 @@ def rmsdwrapper(
cache: bool = True,
) -> Any:
"""
Compute RMSD between two molecule.
Compute RMSD between two molecules.

Parameters
----------
Expand Down
6 changes: 6 additions & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,12 @@ def pyridine(molpath):
return Mol(mol, "pyridine", 11, 11, 5)


@pytest.fixture
def muparfostat(molpath):
mol = io.loadmol(os.path.join(molpath, "muparfostat.sdf"))
return Mol(mol, "muparfostat", 78, 80, 33)


@pytest.fixture(
params=[
# (name, n_atoms, n_bonds, n_h)
Expand Down
Loading