-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cheaper matching #9
Comments
Great points. Pre-screening and granularity for the matching would both make a lot of sense. The latter especially would offer more interpretability to the metrics. The 4 metric types (validity, coverage, novelty, and uniqueness) are also easily transferred to the different categories (e.g. val, cov, nov, uniq for composition). For composition, Element Mover's Distance via I think for structure validity CDVAE just looked at whether the shortest distance between any two atoms is larger than 0.5 Angstroms. EDIT: after playing around a bit with the current implementation, I agree that it's prohibitively expensive to use EDIT: It took ~4 hours to run all the metrics across the 5 |
I use https://github.com/kjappelbaum/structuregraph-helpers/blob/main/src/structuregraph_helpers/hash.py to remove duplicates. This implementation will give you an upper bound on the matches. One could then find the "truly" matching ones by doing the "real" isomorphism check. |
On a somewhat related note, I could also try to reuse match matrices calculated for earlier folds in later ones. This would reduce the computational burden by maybe a factor of 2-3. Would kind of convolute the code though and probably require a lot of logic checks. |
To pull in some additional perspectives, I created this matsci post. |
Faster StructureMatching would be a dream. It's unclear whether a better strategy is to optimize the pymatgen code (I suspect it could probably be optimized significantly), or find an alternative method (e.g., structural fingerprinting, for a more heuristic-based approach). This has certainly been the limiting factor in some of my own projects. Last time I had chance to look at this myself I created this PR: materialsproject/pymatgen#1654 but was not sufficiently convinced I could improve -- or even, if I'm honest, where the largest bottlenecks are without doing additional profiling. |
I suspect someone confident in writing Cython and/or using numba, could write a much faster version, and it would reward careful attention. It should be noted the current pymatgen code does use some custom Cython however. Otherwise, I think if I needed a "good-enough" solution for a project, I might use one of the kernels offered by Finally, another strategy I've used for coarse graining, is to compare by spacegroup first when I want to structure match (e.g., one calculated with a fine and a looser simprec, and if either match...), but I'm not wholly convinced by this either unfortunately! It is fast though :) |
Actually -- some additional notes :)
|
@mkhorton really appreciate your comments!
This is really helpful feedback, as it seems like a common issue.
Profiling the function does seem like a good next step. Thanks for linking this!
I'm tempted to numba-fy
I'm glad you mentioned these. I think I've come across this, but it carries a lot more weight having a recommendation for it.
I think I'm with you there. Fast and relatively intuitive, but not as rigorous.
Great point. I think the rigor of
Will take a look! |
from AFLOW-XtalFinder manuscript:
Any comments on what they mean by this? EDIT: I found this from the manuscript:
Also, using |
from internal discussion, by @kyledmiller
CommentsStructureMatcher
For generative models, especially when symmetry isn't directly imposed, this seems like an important feature. Related: MaterSim/PyXtal#199. Forcing models to respect a given symmetry seems like a better route but a lot less straightforward. I don't think even CDVAE, which I consider state-of-the-art for crystal generative models, enforces symmetry explicitly.
Is the main issue here that it's computationally expensive?
Is the idea here that we get nice groupings of structures similar to how we can request all chemical formulas from a given chemical system? Is there any benefit in terms of speed-up or is the main benefit provenance and consistency?
For the current implementation of XtalFinder
This is really good to know. I don't think I would have caught that nuance immediately. Kind of goes back to @mkhorton's #9 (comment). |
I found this note about speed comparisons by digging into the XtalFinder manuscript. IIUC, XtalFinder will be a lot faster for the use-case of comparing large sets of structures with each other because pairwise comparisons are generally ~3x faster, many fewer computations are used for comparisons of large sets, and it has built-in facilities for parallelization. Naturally at the expense of non-matching for small symmetry-breaking distortions mentioned earlier.
Note about scalability:
Didn't realize that AFLOW is only directly available on MacOS and Linux. I've still been holding out developing packages that support Windows users despite much grief.. |
Yes, I was intrigued too at reading that -- but I haven't had time to review the code or test at scale, so I balance with cautious skepticism. Note the parallelization is neither here nor there, since StructureMatcher is trivially parallelized too, as is the pre-grouping by symmetry which we do regularly, so I'm not sure the comparison is actually appropriately made. To put it another way, if you just use |
I agree about it being trivially parallelized, e.g. via RayTune. Perhaps the easiest form of pre-filtering is via comparing reduced chemical formulas. IIUC, use of EDIT: looks like that's already taken care of in if not self._subset and self._comparator.get_hash(struct1.composition) != self._comparator.get_hash(
struct2.composition
):
return False |
@mkhorton were you referring to e.g. AverageKernel or e.g. Coulomb matrix? I figured you were referring to the options of EDIT: also noticing that |
Ah, I hadn't noticed this either.
I've tried |
Yep, I've been also using this in the past (and also tried to supplement with heuristic based on Kabsch RMSD, and density): https://github.com/kjappelbaum/structure_comp/blob/e903be460f9ed609814240a8b2b7c1702e232a48/structure_comp/remove_duplicates.py#L453. |
@kjappelbaum I think I'll try out the Weisfeiller-Lehman hashes. |
happy to help out! Do we have a good test suite for this? Shall we go with the one of the |
@kjappelbaum that would be great! Please do. I think you have access through the |
cool, currently I test on MOFs (that's my use case ;), see the tests here https://github.com/kjappelbaum/mofchecker/blob/4cae2a27cb9cd5e953b4509d7329af83baa5e31b/tests/test_hash.py#L20 I know of some edge cases there, would be good to find similar ones in the rest of the materials space. The edge cases I currently know of mostly boil down to structure graphs that are different due to the heuristics used to compute the structure graphs. |
For now, going with compositional and structural fingerprint-based comparisons with thresholds is relatively fast (~15 min for 100 generated structures and 5 folds, compared with ~4 hrs from earlier). Planning to address one last bottleneck by loading precalculated space group numbers for the training and test structures instead of recalculating repeatedly. |
Really interesting discussion here, thanks for bringing me in @sgbaird. I have a few comments and clarifications.
Yes, definitely flexibility can be a boon for generative models but I think the specificity/restrictiveness (isopointal grouping) of XtalFinder can be a strong advantage for use cases involving symmetry-dependent properties (topology-related phenomena, symmetry-breaking transitions, etc.).
I should also note here despite what I said above, XtalFinder's symmetry determination uses an adaptive tolerance (described in the AFLOW-SYM paper) which provides a small amount built-in flexibility for small distortions. This means that an isopointal structure grouping generated by XtalFinder might actually consist of structures which would yield different space groups if one used spglib with the same absolute tolerance across the whole group. However, it seems like XtalFinder decides on a uniform space group to assign these structures and essentially resymmetrizes them to fit in the isopointal grouping.
Here I was just thinking that a group of structures with different symmetries would not have a nice 1-1 correspondence between Wyckoff sites for every structure . Of course one could find a way around this, e.g., resymmetrizing every structure to the highest symmetry or using group-subgroup relations to connect each high-symmetry parent wyckoff site in high-symmetry structures to its corresponding lower-symmetry children sites in the lower-symmetry structures (probably could use spglib for this) and might be computationally expensive.
I think you're spot on here. It's for provenance and consistency, not speed-up. The former is important for AFLOW's prototype encyclopedia.
That speed-up is impressive. That's comparing to zero pre-screening, not even compositional, right? I'd be curious how much speed-up you get between compositional pre-screening and compositional+structural pre-screening. I'd also love to look at how you use the Weisfeiller-Lehman hashes. Is the structure->graph transformation code publicly available? I've been grappling with the question of how to determine 'bonded' neighbors lately. |
@kyledmiller thanks for clarifying! Really useful info.
Correct, StructureMatcher without pre-screening (note that the StructureMatcher method would be maybe 2x faster if I had used
@kjappelbaum gave an example of what that might look like for this repo. If the fingerprint method is sufficiently fast and robust/valid, we may not incorporate that here unless we switch back to |
pymatgen
's structure matcher is relatively slow. It might make sense to "pre-screen" (or at least give the option to do so) based on some cheaper heuristic.If you consider identity to be a match of structure graphs, one might implement such a cheaper via graph hashes (I did this for instance here https://github.com/kjappelbaum/mofchecker, because for MOF unit cells the
Structure Matcher
is painfully slow). However, there are no theoretical guarantees, you rely on the bond heuristic and the implementation is not 100% fool-proof at periodic boundaries (but this can be improved).Anyhow, It might also be interesting to have some finer-grained resolution on the matching:
matbench-genmetrics/src/matbench_genmetrics/core.py
Line 69 in 5f26ff3
The text was updated successfully, but these errors were encountered: