Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi species #724

Merged
merged 14 commits into from
Jan 16, 2024
Merged

Multi species #724

merged 14 commits into from
Jan 16, 2024

Conversation

JintaoWu98
Copy link
Member

This branch allows us to create a cluster species including multi-individual species. To be more specific, the current arc creates different input Gaussian input files if it receives multispecies. With this new feature, we can combine all the species in one Gaussian input file, followed by the relevant post-processing steps.

arc/parser.py Fixed Show fixed Hide fixed
arc/job/adapters/gaussian.py Fixed Show fixed Hide fixed
Copy link

codecov bot commented Dec 18, 2023

Codecov Report

Attention: 132 lines in your changes are missing coverage. Please review.

Comparison is base (3b46df6) 73.71% compared to head (4aa5282) 73.67%.

Files Patch % Lines
arc/scheduler.py 8.69% 121 Missing and 5 partials ⚠️
arc/plotter.py 94.44% 1 Missing and 1 partial ⚠️
arc/species/species.py 50.00% 1 Missing and 1 partial ⚠️
arc/job/adapter.py 83.33% 0 Missing and 1 partial ⚠️
arc/job/adapters/common.py 97.14% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #724      +/-   ##
==========================================
- Coverage   73.71%   73.67%   -0.04%     
==========================================
  Files          99       99              
  Lines       26992    27180     +188     
  Branches     5620     5687      +67     
==========================================
+ Hits        19896    20025     +129     
- Misses       5698     5757      +59     
  Partials     1398     1398              
Flag Coverage Δ
unittests 73.67% <51.47%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

arc/scheduler.py Fixed Show fixed Hide fixed
arc/scheduler.py Fixed Show fixed Hide fixed
arc/scheduler.py Fixed Show fixed Hide fixed
arc/scheduler.py Fixed Show fixed Hide fixed
arc/scheduler.py Fixed Show fixed Hide fixed
arc/scheduler.py Fixed Show fixed Hide fixed
arc/scheduler.py Fixed Show fixed Hide fixed
arc/parser.py Fixed Show fixed Hide fixed
arc/scheduler.py Fixed Show fixed Hide fixed
arc/parser.py Fixed Show fixed Hide fixed
arc/parser.py Fixed Show fixed Hide fixed
Copy link
Member

@alongd alongd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JintaoWu98, thanks for this first major contribution to ARC!
The PR looks good and well thought of. I added comments, most of them on improving writing style, some with questions.

arc/scheduler.py Outdated
@@ -419,7 +422,10 @@ def __init__(self,
# conformers weren't asked for, assign initial_xyz
species.initial_xyz = species.conformers[0]
if species.label not in self.running_jobs:
self.running_jobs[species.label] = list() # initialize before running the first job
if not species.multi_species:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can change these 4 lines into a more compact form:

self.running_jobs[species.label if not species.multi_species else species.multi_species] = list()

if self.run_multi_species == False:
input_dict['xyz'] = xyz_to_str(self.xyz)
else:
input_dict['xyz'] = list()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simplify these five lines as:

input_dict['xyz'] = [xyz_to_str(xyz) for xyz in self.xyz] if self.run_multi_species else xyz_to_str(self.xyz)

To be extra careful, we can also change the above else to elif isinstance(xyz, list):

species = self.species_dict[label]
elif isinstance(label, list):
species = [spc for spc in self.species_list if spc.label in label]
run_multi_species = all([spc.multi_species is not None for spc in species]) if isinstance(species, list) else False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can strengthen this check by making sure all multi_species clusters are the same string:

 run_multi_species = all([spc.multi_species == spc.multi_species[0] for spc in species]) and species[0].multi_species is not None if isinstance(species, list) and len(species) else False

I hope it's not too complicated, but it should be more robust

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the advice, I agree with you this idea can make sure all multi_species clusters are the same string, but I have two questions here.
1.
For the all function,
all([spc.multi_species == spc.multi_species[0] for spc in species])
should it be the following instead
all([spc.multi_species == species[0].multi_species for spc in species])
2.
If we only focus on the run_opt_job, where we have the following lines:

        if self.species_dict[label].multi_species:
            ##Some code##
            label = [species.label for species in self.species_list
                     if species.multi_species == self.species_dict[label].multi_species]
        self.run_job(label=label, 
                     xyz=self.species_dict[label].initial_xyz if isinstance(label, str) else None,
                     level_of_theory=self.opt_level,
                     job_type='opt', 
                     fine=fine)

The label argument is already been filtered to make sure multi_species clusters are the same string. Then I think we don't really need this augmented version.
On the other hand, if we want to look somewhere else run_job is being called, this is necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this addressed?

arc/scheduler.py Outdated
@@ -1419,12 +1438,12 @@ def spawn_post_opt_jobs(self,
return None

# Spawn IRC if requested and if relevant.
if self.job_types['irc'] and self.species_dict[label].is_ts:
if label in self.output and self.job_types['irc'] and self.species_dict[label].is_ts:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is OK, but perhaps we can be clearer by writing if label in self.output.keys(), which is more "formal", but probably the same as what's written already. If we make the change, then we also need to modify below

arc/scheduler.py Outdated
@@ -2244,62 +2266,91 @@ def parse_opt_geo(self,
success = False
logger.debug(f'parsing opt geo for {job.job_name}')
if job.job_status[1]['status'] == 'done':
opt_xyz = parser.parse_xyz_from_file(path=job.local_path_to_xyz or job.local_path_to_output_file)
multi_species = any(spc.multi_species == label for spc in self.species_list)
species_labels = [spc.label for spc in self.species_list if spc.multi_species == label]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add at the end of this line: if multi_species else list() to avoid running the loop if multi_species is False

multiplicity = obj.multiplicity if species is None else species.multiplicity
number_of_radicals = obj.species[0].number_of_radicals if species is None else species.number_of_radicals
species_label = obj.species[0].label if species is None else species.label
if (multiplicity > 1 and obj.level.method_type != 'composite') \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer need and obj.level.method_type != 'composite', this is being checked above we can assume here that it is not composite

arc/scheduler.py Outdated
@@ -13,6 +13,7 @@
import numpy as np
from IPython.display import display
from typing import TYPE_CHECKING, List, Optional, Tuple, Union
from pathlib import Path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import should be added in the commit where Path is being used (unless we change to using os.path.dirname())

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think it is used below in make_multi_species_output_file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this comment addressed? If so, you can mark it as "resolved"

arc/scheduler.py Outdated
@@ -597,10 +598,15 @@ def schedule_jobs(self):
if not (job.job_id in self.server_job_ids and job.job_id not in self.completed_incore_jobs):
successful_server_termination = self.end_job(job=job, label=label, job_name=job_name)
if successful_server_termination:
multi_species = any(spc.multi_species == label for spc in self.species_list)
if multi_species:
self.multi_species_path_dict = self.make_multi_species_output_file(label=label, job=job)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should call make_multi_species_output_file() inside parse_opt_geo()?
THis place seems too generic or too high-level to be calling these functions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I put it here is that the output from make_multi_species_output_file is being used by both parse_opt_geo and parse_opt_e_elect, as seen below. This also allows us to add more parsing processes between make_multi_species_output_file and delete_multi_species_output_file eventually if it is needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this addressed?

@@ -605,8 +605,8 @@ def schedule_jobs(self):
if success:
self.parse_opt_e_elect(label=label, job=job)
self.spawn_post_opt_jobs(label=label, job_name=job_name)
# if multi_species:
# self.delete_multi_species_output_file(label=label, job=job)
if multi_species:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be squshed with the previous commit where it is added as a comment

Also, we need to consider where to call this. Maybe inside parse_opt_e_elect() as well?

arc/parser.py Outdated
except (LogError, NotImplementedError):
logger.warning(f'Could not read e_elect from {path}')
return e_elect_multi
# def parse_multi_species_e_elect(path: str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't final, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion, I remove this part entirely now.

@JintaoWu98 JintaoWu98 force-pushed the multi_species branch 5 times, most recently from 99566f1 to 658eeea Compare December 21, 2023 10:02
@calvinp0
Copy link
Member

@JintaoWu98 @alongd

Hi guys, I just wanted to ask, what happens if one species in the multi gauss file failes due to a convergence issue etc. How do we troubleshoot it? Do we rerun all the species again, or will ARC be able to recognise that, for example, 1 of the 3 species in the input.gjf failed and it will only troubleshoot that particular species

@JintaoWu98
Copy link
Member Author

JintaoWu98 commented Dec 21, 2023

@JintaoWu98 @alongd

Hi guys, I just wanted to ask, what happens if one species in the multi gauss file failes due to a convergence issue etc. How do we troubleshoot it? Do we rerun all the species again, or will ARC be able to recognise that, for example, 1 of the 3 species in the input.gjf failed and it will only troubleshoot that particular species

From my point of view, it can be done individually instead of the whole group to be run again. Because when we are doing the parsing, we slice the big output file into individual ones and investigate the status of each of the species. So when one of the species goes wrong, we can just relaunch the specific one. Though I am not sure if it is being applied currently, could you please comment on that, @alongd? Thanks!

@JintaoWu98 JintaoWu98 closed this Dec 21, 2023
@calvinp0 calvinp0 reopened this Dec 21, 2023
@alongd
Copy link
Member

alongd commented Dec 21, 2023

I agree, troubleshooting specific species in the future is the way to go

@JintaoWu98 JintaoWu98 force-pushed the multi_species branch 2 times, most recently from 46022ca to e2ec635 Compare December 27, 2023 10:11
Copy link
Member

@alongd alongd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JintaoWu98 !

I added another round of comments, please take a look

@@ -375,7 +375,8 @@ def determine_job_array_parameters(self):
ARC will allocate, e.g., 8 workers, to simultaneously get processes (one by one) from the HDF5 bank
and execute them. On average, each worker in this example executes 125 jobs.
"""
if self.execution_type == 'incore':
if self.execution_type == 'incore' \
or self.species is not None and all([spc.multi_species is not None for spc in self.species]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use self.run_multi_species here instead of self.species is not None and all([spc.multi_species is not None for spc in self.species])?

@@ -545,7 +549,7 @@ def set_file_paths(self):

if self.server is not None:
# Parentheses don't play well in folder names:
species_name_remote = self.species_label.replace('(', '_').replace(')', '_')
species_name_remote = self.species_label.replace('(', '_').replace(')', '_') if isinstance(self.species_label, str) else ''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make species_name_remote an empty sting in case we have a multispecies. Is that desirable?

arc/job/adapters/common.py Show resolved Hide resolved
@@ -245,8 +248,8 @@ def write_input_file(self) -> None:
if self.level.method_type in ['semiempirical', 'force_field']:
self.checkfile = None

if not is_restricted(self):
input_dict['restricted'] = 'u'
# if self.run_multi_species is False and not is_restricted(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this commented-out code remain here? Can we remove it?

@@ -263,8 +271,10 @@ def write_input_file(self) -> None:
self.level.method = 'cbs-qb3'

# Job type specific options
# max_cycles 500
max_c = self.args['trsh'].split()[1] if 'max_cycles' in self.args['trsh'] else 100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this comment addressed?

if self.job_type in ['opt', 'conformers', 'optfreq', 'composite']:
keywords = ['ts', 'calcfc', 'noeigentest', 'maxcycles=100'] if self.is_ts else ['calcfc']
keywords = ['ts', 'calcfc', 'noeigentest', f'maxcycles={max_c}' + '}'] if self.is_ts else ['calcfc']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the extra + '}'?

arc/scheduler.py Outdated
@@ -13,6 +13,7 @@
import numpy as np
from IPython.display import display
from typing import TYPE_CHECKING, List, Optional, Tuple, Union
from pathlib import Path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this comment addressed? If so, you can mark it as "resolved"

arc/scheduler.py Outdated
@@ -597,10 +598,15 @@ def schedule_jobs(self):
if not (job.job_id in self.server_job_ids and job.job_id not in self.completed_incore_jobs):
successful_server_termination = self.end_job(job=job, label=label, job_name=job_name)
if successful_server_termination:
multi_species = any(spc.multi_species == label for spc in self.species_list)
if multi_species:
self.multi_species_path_dict = self.make_multi_species_output_file(label=label, job=job)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this addressed?

species = self.species_dict[label]
elif isinstance(label, list):
species = [spc for spc in self.species_list if spc.label in label]
run_multi_species = all([spc.multi_species is not None for spc in species]) if isinstance(species, list) else False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this addressed?

arc/scheduler.py Outdated
@@ -3503,6 +3615,74 @@ def save_restart_dict(self):
logger.debug(f'Dumping restart dictionary:\n{self.restart_dict}')
save_yaml_file(path=self.restart_path, content=self.restart_dict)

def make_multi_species_output_file(self,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try relocating this function from Scheduler to plotter?
Also please add a test for it

Copy link
Member Author

@JintaoWu98 JintaoWu98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let me know if there are any other comments. Thanks!

species = self.species_dict[label]
elif isinstance(label, list):
species = [spc for spc in self.species_list if spc.label in label]
run_multi_species = all([spc.multi_species is not None for spc in species]) if isinstance(species, list) else False
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the advice, I agree with you this idea can make sure all multi_species clusters are the same string, but I have two questions here.
1.
For the all function,
all([spc.multi_species == spc.multi_species[0] for spc in species])
should it be the following instead
all([spc.multi_species == species[0].multi_species for spc in species])
2.
If we only focus on the run_opt_job, where we have the following lines:

        if self.species_dict[label].multi_species:
            ##Some code##
            label = [species.label for species in self.species_list
                     if species.multi_species == self.species_dict[label].multi_species]
        self.run_job(label=label, 
                     xyz=self.species_dict[label].initial_xyz if isinstance(label, str) else None,
                     level_of_theory=self.opt_level,
                     job_type='opt', 
                     fine=fine)

The label argument is already been filtered to make sure multi_species clusters are the same string. Then I think we don't really need this augmented version.
On the other hand, if we want to look somewhere else run_job is being called, this is necessary.

arc/parser.py Outdated
software: Optional[str] = 'gaussian',
) -> Optional[Dict[str, tuple]]:
"""
Parse the electronic energy from an opt job output file.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, sorry for the confusion, in the current version of the code, I remove parse_multi_species_e_elect entirely. The original code had two goals.

The first one is to slice the big cluster output file, which is being handled by Scheduler.make_multi_species_output_file now.

The other is to parse the e_elect, which is handled by Scheduler.parse_opt_e_elect

if multi_species:
    for spc in self.species_list:
        if spc.multi_species == label:
            multi_species_opt_xyzs[spc.label] = parser.parse_xyz_from_file(path=self.multi_species_path_dict[spc.label])

The new version of the branch doesn't need to modify the parser.py at all.

arc/parser.py Outdated

Args:
path (str): The ESS log file to parse from.
zpe_scale_factor (float): The ZPE scaling factor, used only for composite methods in Gaussian via Arkane.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the above comment.

arc/parser.py Outdated
zpe_scale_factor (float): The ZPE scaling factor, used only for composite methods in Gaussian via Arkane.
software (str, optional): The ESS.

Returns: Optional[float]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the above comment.

arc/parser.py Outdated
output_file.write(software + '\n' + line + '\n')
e_elect_multi[species_name_list[i]] = parse_e_elect(path=output_file_path)
os.remove(output_file_path)
except (LogError, NotImplementedError):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, same as the previous comment about removing this entire block, this is being handled by parse_e_elect in Scheduler.parse_opt_e_elect directly.

arc/scheduler.py Outdated
@@ -13,6 +13,7 @@
import numpy as np
from IPython.display import display
from typing import TYPE_CHECKING, List, Optional, Tuple, Union
from pathlib import Path
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think it is used below in make_multi_species_output_file

arc/scheduler.py Outdated
@@ -597,10 +598,15 @@ def schedule_jobs(self):
if not (job.job_id in self.server_job_ids and job.job_id not in self.completed_incore_jobs):
successful_server_termination = self.end_job(job=job, label=label, job_name=job_name)
if successful_server_termination:
multi_species = any(spc.multi_species == label for spc in self.species_list)
if multi_species:
self.multi_species_path_dict = self.make_multi_species_output_file(label=label, job=job)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I put it here is that the output from make_multi_species_output_file is being used by both parse_opt_geo and parse_opt_e_elect, as seen below. This also allows us to add more parsing processes between make_multi_species_output_file and delete_multi_species_output_file eventually if it is needed.

arc/parser.py Outdated
except (LogError, NotImplementedError):
logger.warning(f'Could not read e_elect from {path}')
return e_elect_multi
# def parse_multi_species_e_elect(path: str,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion, I remove this part entirely now.

arc/scheduler.py Outdated
@@ -596,6 +596,7 @@ def schedule_jobs(self):
if successful_server_termination:
success = self.parse_opt_geo(label=label, job=job)
if success:
self.parse_opt_e_elect(label=label, job=job)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not self.job_types['sp']: seems to lead to functional test failure for both Kinetics and Thermo, for now I keep the if success

arc/job/adapters/common.py Show resolved Hide resolved
Copy link
Member

@alongd alongd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution, @JintaoWu98 !
Looking good. Please go ahead and squash the fixup commits

alongd and others added 3 commits January 13, 2024 16:31
make_multi_species_output_file is created to first slice the big cluster file.

delete_multi_species_output_file is created to delete the temporary generated files.
In addition, define a new method "post_opt_geo_work" for  post processing
JintaoWu98 and others added 11 commits January 13, 2024 21:18
make_multi_species_output_file is created to first slice the big cluster file.

delete_multi_species_output_file is created to delete the temporary generated files.
In case of given initial xyzs, we can now prevent them to run multiple times
Currently only implemented for Gaussian

Modify the gaussian.py for multiple species, including rewriting the template
In addition, add few method types leading to restricted cases


fixup! multi_species "is_restricted" in common.py and common_test.py
@alongd
Copy link
Member

alongd commented Jan 15, 2024

@JintaoWu98, is this PR ready to be merged? Can you run a multispecies project using the updated brunch to confirm it works as expected?

@JintaoWu98
Copy link
Member Author

@JintaoWu98, is this PR ready to be merged? Can you run a multispecies project using the updated brunch to confirm it works as expected?

Hi @alongd, a multi species project is working as expected. I think it is ready to be merged.

Copy link
Member

@alongd alongd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution!

@alongd alongd merged commit 061cdc8 into main Jan 16, 2024
5 of 7 checks passed
@alongd alongd deleted the multi_species branch January 16, 2024 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants