Clone the repo and install matbench_discovery
into your Python environment (--config-settings editable-mode=compat
helps the VS Code Python extension resolve matbench_discovery
imports):
git clone https://github.com/janosh/matbench-discovery --depth 1
pip install -e ./matbench-discovery --config-settings editable-mode=compat
There's also a PyPI package for faster installation if you don't need the latest code changes (unlikely if you're planning to submit a model since the benchmark is under active development).
When you access attributes of the DataFiles
class, it automatically downloads and caches the corresponding data files. For example:
from matbench_discovery.data import DataFiles, ase_atoms_from_zip
import pandas as pd
df_wbm = pd.read_csv(DataFiles.wbm_summary.path)
# confirm test set size
assert df_wbm.shape == (256_963, 18)
# available columns in WBM summary data
assert tuple(df_wbm) == (
"material_id",
"formula",
"n_sites",
"volume",
"uncorrected_energy",
"e_form_per_atom_wbm",
"e_above_hull_wbm",
"bandgap_pbe",
"wyckoff_spglib_initial_structure",
"uncorrected_energy_from_cse",
"e_correction_per_atom_mp2020",
"e_correction_per_atom_mp_legacy",
"e_form_per_atom_uncorrected",
"e_form_per_atom_mp2020_corrected",
"e_above_hull_mp2020_corrected_ppd_mp",
"site_stats_fingerprint_init_final_norm_diff",
"wyckoff_spglib",
"unique_prototype"
)
# WBM initial structures in pymatgen JSON format
df_init_structs = pd.read_json(DataFiles.wbm_initial_structures.path)
assert tuple(df_init_structs) == ("material_id", "formula_from_cse", "initial_structure")
# WBM initial structures as ASE Atoms
wbm_init_atoms = ase_atoms_from_zip(DataFiles.wbm_initial_structures.path)
assert len(wbm_init_atoms) == 256_963
"wbm-summary"
columns:
formula
: A compound's unreduced alphabetical formulan_sites
: Number of sites in the structure's unit cellvolume
: Relaxed structure volume in cubic Angstromuncorrected_energy
: Raw VASP-computed energye_form_per_atom_wbm
: Original formation energy per atom from WBM papere_above_hull_wbm
: Original energy above the convex hull in (eV/atom) from WBM paperwyckoff_spglib
: Aflow label strings built from spacegroup and Wyckoff positions of the DFT-relaxed structure as computed by spglib.wyckoff_spglib_initial_structure
: Same aswyckoff_spglib
but computed from the initial structure.bandgap_pbe
: PBE-level DFT band gap from WBM paperuncorrected_energy_from_cse
: Uncorrected DFT energy stored inComputedStructureEntries
. Should be the same asuncorrected_energy
. There are 2 cases where the absolute difference reported in the summary file and in the computed structure entries exceeds 0.1 eV (wbm-2-3218
,wbm-1-56320
) which we attribute to rounding errors.e_form_per_atom_uncorrected
: Uncorrected DFT formation energy per atom in eV/atom.e_form_per_atom_mp2020_corrected
: Matbench Discovery takes these as ground truth for the formation energy. The result of applying the MP2020 energy corrections (latest correction scheme at time of release) toe_form_per_atom_uncorrected
.e_correction_per_atom_mp2020
:MaterialsProject2020Compatibility
energy corrections in eV/atom.e_correction_per_atom_mp_legacy
: LegacyMaterialsProjectCompatibility
energy corrections in eV/atom. Having both old and new corrections allows updating predictions from older models like MEGNet that were trained on MP formation energies treated with the old correction scheme.e_above_hull_mp2020_corrected_ppd_mp
: Energy above hull distances in eV/atom after applying the MP2020 correction scheme. The convex hull in question is the one spanned by all ~145k Materials ProjectComputedStructureEntries
. Matbench Discovery takes these as ground truth for material stability. Any value above 0 is assumed to be an unstable/metastable material.site_stats_fingerprint_init_final_norm_diff
: The norm of the difference between the initial and final site fingerprints. This is a volume-independent measure of how much the structure changed during DFT relaxation. Uses thematminer
SiteStatsFingerprint
(v0.8.0).
You can download all Matbench Discovery data files from this Figshare article.
To train an interatomic potential, we recommend the MPtrj dataset which was created to train CHGNet. With thanks to Bowen Deng for cleaning and releasing this dataset. It was created from the 2021.11.10 release of Materials Project and therefore constitutes a slightly smaller but valid subset of the allowed 2022.10.28 MP release that is our training set.
To submit a new model to this benchmark and add it to our leaderboard, please create a pull request to the main
branch that includes at least these 3 required files:
-
<yyyy-mm-dd>-<model_name>-preds.(json|csv).gz
: Your model's energy predictions for all ~250k WBM compounds as compressed JSON or CSV. The recommended way to create this file is withpandas.DataFrame.to_{json|csv}("<yyyy-mm-dd>-<model_name>-preds.(json|csv).gz")
. JSON is preferred over CSV if your model not only predicts energies (floats) but also objects like relaxed structures. See e.g. M3GNet and CHGNet test scripts. For machine learning force field (MLFF) submissions, you additionally upload the relaxed structures and forces from your model's geometry optimization to Figshare or a similar platform and include the download link in your PR description and the YAML metadata file. This file should include:- The final relaxed structures (as ASE
Atoms
or pymatgenStructures
) - Energies (eV), forces (eV/Å), stress (eV/ų) and volume (ų) at each relaxation step
Recording the model-relaxed structures enables additional analysis of root mean squared displacement (RMSD) and symmetry breaking with respect to DFT relaxed structures. Having the forces and stresses at each step also allows analyzing any pathological behavior for structures were relaxation failed or went haywire.
Example of how to record these quantities for a single structure with ASE:
from collections import defaultdict import pandas as pd from ase.atoms import Atoms from ase.optimize import FIRE from mace.calculators import mace_mp trajectory = defaultdict(list) batio3 = Atoms( "BaTiO3", scaled_positions=[ (0, 0, 0), (0.5, 0.5, 0.5), (0.5, 0, 0.5), (0, 0.5, 0.5), (0.5, 0.5, 0) ], cell=[4] * 3, ) batio3.calc = mace_mp(model_name="medium", default_dtype="float64") def callback() -> None: """Record energy, forces, stress and volume at each step.""" trajectory["energy"] += [batio3.get_potential_energy()] trajectory["forces"] += [batio3.get_forces()] trajectory["stress"] += [batio3.get_stress()] trajectory["volume"] += [batio3.get_volume()] # Optionally save structure at each step (results in much larger files) trajectory["atoms"] += [batio3.copy()] opt = FIRE(batio3) opt.attach(callback) # register callback opt.run(fmax=0.01, steps=500) # optimize geometry df_traj = pd.DataFrame(trajectory) df_traj.index.name = "step" df_traj.to_csv("trajectory.csv.gz") # Save final structure and trajectory data
- The final relaxed structures (as ASE
-
test_<model_name>.(py|ipynb)
: The Python script or Jupyter notebook that generated the energy predictions. Ideally, this file should have comments explaining at a high level what the code is doing and how the model works so others can understand and reproduce your results. If the model deployed on this benchmark was trained specifically for this purpose (i.e. if you wrote any training/fine-tuning code while preparing your PR), please also include it astrain_<model_name>.(py|ipynb)
. -
<model_name.yml>
: A file to record all relevant metadata of your algorithm like model name and version, authors, package requirements, links to publications, notes, etc. Here's a template:model_name: My new model # required (this must match the model's label which is the 3rd arg in the matbench_discovery.preds.Model enum) model_key: my-new-model # this should match the name of the YAML file and determines the URL /models/<model_key> on which details of the model are displayed on the website model_version: 1.0.0 # required matbench_discovery_version: 1.0 # required date_added: "2023-01-01" # required authors: # required (only name, other keys are optional) - name: John Doe affiliation: Some University, Some National Lab email: john-doe@uni.edu orcid: https://orcid.org/0000-xxxx-yyyy-zzzz url: lab.gov/john-doe corresponding: true role: Model & PR - name: Jane Doe affiliation: Some National Lab email: jane-doe@lab.gov url: uni.edu/jane-doe orcid: https://orcid.org/0000-xxxx-yyyy-zzzz role: Model repo: https://github.com/<user>/<repo> # required url: https://<model-docs-or-similar>.org doi: https://doi.org/10.5281/zenodo.0000000 preprint: https://arxiv.org/abs/xxxx.xxxxx requirements: # strongly recommended torch: 1.13.0 torch-geometric: 2.0.9 ... training_set: [MPtrj] # list of keys from data/training-sets.yml notes: # notes can have any key, be multiline and support markdown. description: This is how my model works... steps: | Optional *free-form* [markdown](example.com) notes. metrics: discovery: pred_file: models/<model_dir>/<yyyy-mm-dd>-<model_name>-wbm-IS2RE.csv.gz # should contain the models energy predictions for the WBM test set pred_col: e_form_per_atom_<model_name> geo_opt: # only applicable if the model performed structure relaxation pred_file: models/<model_dir>/<yyyy-mm-dd>-<model_name>-wbm-IS2RE.json.gz # should contain the models relaxed structures as ASE Atoms or pymatgen Structures, and separate columns for material_id and energies/forces/stresses at each relaxation step pred_col: e_form_per_atom_<model_name>
Arbitrary other keys can be added as needed. The above keys will be schema-validated with
pre-commit
(if installed) with errors for missing keys.
Please see any of the subdirectories in models/
for example submissions. More detailed step-by-step instructions below.
git clone https://github.com/janosh/matbench-discovery --depth 1
cd matbench-discovery
git checkout -b model-name-you-want-to-add
Tip: --depth 1
only clones the latest commit, not the full git history
which is faster if a repo contains large data files that changed over time.
Create a new folder
mkdir models/<model_name>
and place the above-listed files there. The file structure should look like this:
matbench-discovery-root
└── models
└── <model_name>
├── <model_name>.yml
├── <yyyy-mm-dd>-<model_name>-preds.(json|csv).gz
├── test_<model_name>.py
├── readme.md # optional
└── train_<model_name>.py # optional
You can include arbitrary other supporting files like metadata and model features (below 10MB total to keep git clone
time low) if they are needed to run the model or help others reproduce your results. For larger files, please upload to Figshare or similar and share the link in your PR description.
Step 3: Open a PR to the Matbench Discovery repo
Commit your files to the repo on a branch called <model_name>
and create a pull request (PR) to the Matbench repository.
git add -a models/<model_name>
git commit -m 'add <model_name> to Matbench Discovery leaderboard'
And you're done! Once tests pass and the PR is merged, your model will be added to the leaderboard! 🎉
Weights and Biases is a tool for logging training and test runs of ML models. It's free, (partly) open source and offers a special plan for academics. It auto-collects metadata like
- what hardware the model is running on
- and for how long,
- what the CPU, GPU and network utilization was over that period,
- the exact code in the script that launched the run, and
- which versions of dependencies were installed in the environment your model ran in.
This information can be useful for others looking to reproduce your results or compare their model to yours i.t.o. computational cost. We therefore strongly recommend tracking all runs that went into a model submission with WandB so that the runs can be copied over to our WandB project at https://wandb.ai/janosh/matbench-discovery for everyone to inspect. This also allows us to include your model in more detailed analysis (see the SI in the preprint).
Having problems? Please open an issue on GitHub. We're happy to help! 😊