Multi species #724

JintaoWu98 · 2023-12-18T08:51:11Z

This branch allows us to create a cluster species including multi-individual species. To be more specific, the current arc creates different input Gaussian input files if it receives multispecies. With this new feature, we can combine all the species in one Gaussian input file, followed by the relevant post-processing steps.

arc/parser.py

arc/job/adapters/gaussian.py

codecov · 2023-12-18T09:49:43Z

Codecov Report

Attention: 132 lines in your changes are missing coverage. Please review.

Comparison is base (3b46df6) 73.71% compared to head (4aa5282) 73.67%.

Files	Patch %	Lines
arc/scheduler.py	8.69%	121 Missing and 5 partials ⚠️
arc/plotter.py	94.44%	1 Missing and 1 partial ⚠️
arc/species/species.py	50.00%	1 Missing and 1 partial ⚠️
arc/job/adapter.py	83.33%	0 Missing and 1 partial ⚠️
arc/job/adapters/common.py	97.14%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #724      +/-   ##
==========================================
- Coverage   73.71%   73.67%   -0.04%     
==========================================
  Files          99       99              
  Lines       26992    27180     +188     
  Branches     5620     5687      +67     
==========================================
+ Hits        19896    20025     +129     
- Misses       5698     5757      +59     
  Partials     1398     1398

Flag	Coverage Δ
unittests	`73.67% <51.47%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

arc/scheduler.py

arc/parser.py

arc/scheduler.py

arc/parser.py

alongd

@JintaoWu98, thanks for this first major contribution to ARC!
The PR looks good and well thought of. I added comments, most of them on improving writing style, some with questions.

alongd · 2023-12-18T13:14:12Z

arc/scheduler.py

@@ -419,7 +422,10 @@ def __init__(self,
                    # conformers weren't asked for, assign initial_xyz
                    species.initial_xyz = species.conformers[0]
                if species.label not in self.running_jobs:
-                    self.running_jobs[species.label] = list()  # initialize before running the first job
+                    if not species.multi_species:


We can change these 4 lines into a more compact form:

self.running_jobs[species.label if not species.multi_species else species.multi_species] = list()

alongd · 2023-12-18T13:20:01Z

arc/job/adapters/gaussian.py

+        if self.run_multi_species == False:
+            input_dict['xyz'] = xyz_to_str(self.xyz)
+        else:
+            input_dict['xyz'] = list()


We can simplify these five lines as:

input_dict['xyz'] = [xyz_to_str(xyz) for xyz in self.xyz] if self.run_multi_species else xyz_to_str(self.xyz)

To be extra careful, we can also change the above else to elif isinstance(xyz, list):

alongd · 2023-12-18T13:23:26Z

arc/scheduler.py

+            species = self.species_dict[label]
+        elif isinstance(label, list):
+            species = [spc for spc in self.species_list if spc.label in label]
+        run_multi_species = all([spc.multi_species is not None for spc in species]) if isinstance(species, list) else False


Perhaps we can strengthen this check by making sure all multi_species clusters are the same string:

run_multi_species = all([spc.multi_species == spc.multi_species[0] for spc in species]) and species[0].multi_species is not None if isinstance(species, list) and len(species) else False

I hope it's not too complicated, but it should be more robust

Thanks for the advice, I agree with you this idea can make sure all multi_species clusters are the same string, but I have two questions here.
1.
For the all function,
all([spc.multi_species == spc.multi_species[0] for spc in species])
should it be the following instead
all([spc.multi_species == species[0].multi_species for spc in species])
2.
If we only focus on the run_opt_job, where we have the following lines:

if self.species_dict[label].multi_species: ##Some code## label = [species.label for species in self.species_list if species.multi_species == self.species_dict[label].multi_species] self.run_job(label=label, xyz=self.species_dict[label].initial_xyz if isinstance(label, str) else None, level_of_theory=self.opt_level, job_type='opt', fine=fine)

The label argument is already been filtered to make sure multi_species clusters are the same string. Then I think we don't really need this augmented version.
On the other hand, if we want to look somewhere else run_job is being called, this is necessary.

Was this addressed?

alongd · 2023-12-18T13:32:09Z

arc/scheduler.py

@@ -1419,12 +1438,12 @@ def spawn_post_opt_jobs(self,
            return None

        # Spawn IRC if requested and if relevant.
-        if self.job_types['irc'] and self.species_dict[label].is_ts:
+        if label in self.output and self.job_types['irc'] and self.species_dict[label].is_ts:


This is OK, but perhaps we can be clearer by writing if label in self.output.keys(), which is more "formal", but probably the same as what's written already. If we make the change, then we also need to modify below

alongd · 2023-12-18T18:19:10Z

arc/scheduler.py

@@ -2244,62 +2266,91 @@ def parse_opt_geo(self,
        success = False
        logger.debug(f'parsing opt geo for {job.job_name}')
        if job.job_status[1]['status'] == 'done':
-            opt_xyz = parser.parse_xyz_from_file(path=job.local_path_to_xyz or job.local_path_to_output_file)
+            multi_species = any(spc.multi_species == label for spc in self.species_list)
+            species_labels = [spc.label for spc in self.species_list if spc.multi_species == label]


I would add at the end of this line: if multi_species else list() to avoid running the loop if multi_species is False

alongd · 2023-12-18T19:11:12Z

arc/job/adapters/common.py

+    multiplicity = obj.multiplicity if species is None else species.multiplicity
+    number_of_radicals = obj.species[0].number_of_radicals if species is None else species.number_of_radicals
+    species_label = obj.species[0].label if species is None else species.label
+    if (multiplicity > 1 and obj.level.method_type != 'composite') \


We no longer need and obj.level.method_type != 'composite', this is being checked above we can assume here that it is not composite

alongd · 2023-12-18T19:12:19Z

arc/scheduler.py

@@ -13,6 +13,7 @@
 import numpy as np
 from IPython.display import display
 from typing import TYPE_CHECKING, List, Optional, Tuple, Union
+from pathlib import Path


The import should be added in the commit where Path is being used (unless we change to using os.path.dirname())

Thanks, I think it is used below in make_multi_species_output_file

Was this comment addressed? If so, you can mark it as "resolved"

alongd · 2023-12-18T19:13:27Z

arc/scheduler.py

@@ -597,10 +598,15 @@ def schedule_jobs(self):
                        if not (job.job_id in self.server_job_ids and job.job_id not in self.completed_incore_jobs):
                            successful_server_termination = self.end_job(job=job, label=label, job_name=job_name)
                            if successful_server_termination:
+                                multi_species = any(spc.multi_species == label for spc in self.species_list)
+                                if multi_species:
+                                    self.multi_species_path_dict = self.make_multi_species_output_file(label=label, job=job)


Maybe we should call make_multi_species_output_file() inside parse_opt_geo()?
THis place seems too generic or too high-level to be calling these functions

The reason I put it here is that the output from make_multi_species_output_file is being used by both parse_opt_geo and parse_opt_e_elect, as seen below. This also allows us to add more parsing processes between make_multi_species_output_file and delete_multi_species_output_file eventually if it is needed.

Was this addressed?

alongd · 2023-12-18T19:14:28Z

arc/scheduler.py

@@ -605,8 +605,8 @@ def schedule_jobs(self):
                                if success:
                                    self.parse_opt_e_elect(label=label, job=job)
                                    self.spawn_post_opt_jobs(label=label, job_name=job_name)
-                                # if multi_species:
-                                #     self.delete_multi_species_output_file(label=label, job=job)
+                                if multi_species:


should be squshed with the previous commit where it is added as a comment

Also, we need to consider where to call this. Maybe inside parse_opt_e_elect() as well?

alongd · 2023-12-18T19:15:54Z

arc/parser.py

-    except (LogError, NotImplementedError):
-        logger.warning(f'Could not read e_elect from {path}')
-    return e_elect_multi
+# def parse_multi_species_e_elect(path: str,


this isn't final, right?

Sorry for the confusion, I remove this part entirely now.

calvinp0 · 2023-12-21T11:19:30Z

@JintaoWu98 @alongd

Hi guys, I just wanted to ask, what happens if one species in the multi gauss file failes due to a convergence issue etc. How do we troubleshoot it? Do we rerun all the species again, or will ARC be able to recognise that, for example, 1 of the 3 species in the input.gjf failed and it will only troubleshoot that particular species

JintaoWu98 · 2023-12-21T11:24:33Z

@JintaoWu98 @alongd

Hi guys, I just wanted to ask, what happens if one species in the multi gauss file failes due to a convergence issue etc. How do we troubleshoot it? Do we rerun all the species again, or will ARC be able to recognise that, for example, 1 of the 3 species in the input.gjf failed and it will only troubleshoot that particular species

From my point of view, it can be done individually instead of the whole group to be run again. Because when we are doing the parsing, we slice the big output file into individual ones and investigate the status of each of the species. So when one of the species goes wrong, we can just relaunch the specific one. Though I am not sure if it is being applied currently, could you please comment on that, @alongd? Thanks!

alongd · 2023-12-21T12:09:03Z

I agree, troubleshooting specific species in the future is the way to go

alongd

Thanks @JintaoWu98 !

I added another round of comments, please take a look

alongd · 2023-12-28T09:27:49Z

arc/job/adapter.py

@@ -375,7 +375,8 @@ def determine_job_array_parameters(self):
        ARC will allocate, e.g., 8 workers, to simultaneously get processes (one by one) from the HDF5 bank
        and execute them. On average, each worker in this example executes 125 jobs.
        """
-        if self.execution_type == 'incore':
+        if self.execution_type == 'incore' \
+                or self.species is not None and all([spc.multi_species is not None for spc in self.species]):


Can we use self.run_multi_species here instead of self.species is not None and all([spc.multi_species is not None for spc in self.species])?

alongd · 2023-12-28T09:29:10Z

arc/job/adapter.py

@@ -545,7 +549,7 @@ def set_file_paths(self):

        if self.server is not None:
            # Parentheses don't play well in folder names:
-            species_name_remote = self.species_label.replace('(', '_').replace(')', '_')
+            species_name_remote = self.species_label.replace('(', '_').replace(')', '_') if isinstance(self.species_label, str) else ''


This will make species_name_remote an empty sting in case we have a multispecies. Is that desirable?

arc/job/adapters/common.py

alongd · 2023-12-28T09:31:40Z

arc/job/adapters/gaussian.py

@@ -245,8 +248,8 @@ def write_input_file(self) -> None:
        if self.level.method_type in ['semiempirical', 'force_field']:
            self.checkfile = None

-        if not is_restricted(self):
-            input_dict['restricted'] = 'u'
+        # if self.run_multi_species is False and not is_restricted(self):


why did this commented-out code remain here? Can we remove it?

alongd · 2023-12-28T09:32:04Z

arc/job/adapters/gaussian.py

@@ -263,8 +271,10 @@ def write_input_file(self) -> None:
            self.level.method = 'cbs-qb3'

        # Job type specific options
+        # max_cycles 500
+        max_c = self.args['trsh'].split()[1] if 'max_cycles' in self.args['trsh'] else 100


Was this comment addressed?

alongd · 2023-12-28T09:33:11Z

arc/job/adapters/gaussian.py

        if self.job_type in ['opt', 'conformers', 'optfreq', 'composite']:
-            keywords = ['ts', 'calcfc', 'noeigentest', 'maxcycles=100'] if self.is_ts else ['calcfc']
+            keywords = ['ts', 'calcfc', 'noeigentest', f'maxcycles={max_c}' + '}'] if self.is_ts else ['calcfc']


Why do we need the extra + '}'?

alongd · 2023-12-28T09:35:44Z

arc/scheduler.py

@@ -13,6 +13,7 @@
 import numpy as np
 from IPython.display import display
 from typing import TYPE_CHECKING, List, Optional, Tuple, Union
+from pathlib import Path


Was this comment addressed? If so, you can mark it as "resolved"

alongd · 2023-12-28T09:36:52Z

arc/scheduler.py

@@ -597,10 +598,15 @@ def schedule_jobs(self):
                        if not (job.job_id in self.server_job_ids and job.job_id not in self.completed_incore_jobs):
                            successful_server_termination = self.end_job(job=job, label=label, job_name=job_name)
                            if successful_server_termination:
+                                multi_species = any(spc.multi_species == label for spc in self.species_list)
+                                if multi_species:
+                                    self.multi_species_path_dict = self.make_multi_species_output_file(label=label, job=job)


Was this addressed?

alongd · 2023-12-28T09:37:35Z

arc/scheduler.py

+            species = self.species_dict[label]
+        elif isinstance(label, list):
+            species = [spc for spc in self.species_list if spc.label in label]
+        run_multi_species = all([spc.multi_species is not None for spc in species]) if isinstance(species, list) else False


Was this addressed?

alongd · 2023-12-28T09:41:32Z

arc/scheduler.py

@@ -3503,6 +3615,74 @@ def save_restart_dict(self):
            logger.debug(f'Dumping restart dictionary:\n{self.restart_dict}')
            save_yaml_file(path=self.restart_path, content=self.restart_dict)

+    def make_multi_species_output_file(self,


Can you try relocating this function from Scheduler to plotter?
Also please add a test for it

JintaoWu98

Please let me know if there are any other comments. Thanks!

JintaoWu98 · 2023-12-19T08:22:36Z

arc/scheduler.py

+            species = self.species_dict[label]
+        elif isinstance(label, list):
+            species = [spc for spc in self.species_list if spc.label in label]
+        run_multi_species = all([spc.multi_species is not None for spc in species]) if isinstance(species, list) else False


Thanks for the advice, I agree with you this idea can make sure all multi_species clusters are the same string, but I have two questions here.
1.
For the all function,
all([spc.multi_species == spc.multi_species[0] for spc in species])
should it be the following instead
all([spc.multi_species == species[0].multi_species for spc in species])
2.
If we only focus on the run_opt_job, where we have the following lines:

if self.species_dict[label].multi_species: ##Some code## label = [species.label for species in self.species_list if species.multi_species == self.species_dict[label].multi_species] self.run_job(label=label, xyz=self.species_dict[label].initial_xyz if isinstance(label, str) else None, level_of_theory=self.opt_level, job_type='opt', fine=fine)

The label argument is already been filtered to make sure multi_species clusters are the same string. Then I think we don't really need this augmented version.
On the other hand, if we want to look somewhere else run_job is being called, this is necessary.

JintaoWu98 · 2023-12-19T09:36:29Z

arc/parser.py

+                                software: Optional[str] = 'gaussian',
+                                ) -> Optional[Dict[str, tuple]]:
+    """
+    Parse the electronic energy from an opt job output file.


Thanks, sorry for the confusion, in the current version of the code, I remove parse_multi_species_e_elect entirely. The original code had two goals.

The first one is to slice the big cluster output file, which is being handled by Scheduler.make_multi_species_output_file now.

The other is to parse the e_elect, which is handled by Scheduler.parse_opt_e_elect

if multi_species: for spc in self.species_list: if spc.multi_species == label: multi_species_opt_xyzs[spc.label] = parser.parse_xyz_from_file(path=self.multi_species_path_dict[spc.label])

The new version of the branch doesn't need to modify the parser.py at all.

JintaoWu98 · 2023-12-19T10:00:07Z

arc/parser.py

+
+    Args:
+        path (str): The ESS log file to parse from.
+        zpe_scale_factor (float): The ZPE scaling factor, used only for composite methods in Gaussian via Arkane.


Same as the above comment.

JintaoWu98 · 2023-12-19T10:00:28Z

arc/parser.py

+        zpe_scale_factor (float): The ZPE scaling factor, used only for composite methods in Gaussian via Arkane.
+        software (str, optional): The ESS.
+
+    Returns: Optional[float]


Same as the above comment.

JintaoWu98 · 2023-12-19T10:11:07Z

arc/parser.py

+                        output_file.write(software + '\n' + line + '\n')
+                e_elect_multi[species_name_list[i]] = parse_e_elect(path=output_file_path)
+                os.remove(output_file_path)    
+    except (LogError, NotImplementedError):


Thanks, same as the previous comment about removing this entire block, this is being handled by parse_e_elect in Scheduler.parse_opt_e_elect directly.

JintaoWu98 · 2023-12-20T03:56:03Z

arc/scheduler.py

@@ -13,6 +13,7 @@
 import numpy as np
 from IPython.display import display
 from typing import TYPE_CHECKING, List, Optional, Tuple, Union
+from pathlib import Path


Thanks, I think it is used below in make_multi_species_output_file

JintaoWu98 · 2023-12-20T03:59:48Z

arc/scheduler.py

@@ -597,10 +598,15 @@ def schedule_jobs(self):
                        if not (job.job_id in self.server_job_ids and job.job_id not in self.completed_incore_jobs):
                            successful_server_termination = self.end_job(job=job, label=label, job_name=job_name)
                            if successful_server_termination:
+                                multi_species = any(spc.multi_species == label for spc in self.species_list)
+                                if multi_species:
+                                    self.multi_species_path_dict = self.make_multi_species_output_file(label=label, job=job)


The reason I put it here is that the output from make_multi_species_output_file is being used by both parse_opt_geo and parse_opt_e_elect, as seen below. This also allows us to add more parsing processes between make_multi_species_output_file and delete_multi_species_output_file eventually if it is needed.

JintaoWu98 · 2023-12-20T04:05:06Z

arc/parser.py

-    except (LogError, NotImplementedError):
-        logger.warning(f'Could not read e_elect from {path}')
-    return e_elect_multi
+# def parse_multi_species_e_elect(path: str,


Sorry for the confusion, I remove this part entirely now.

JintaoWu98 · 2023-12-21T08:29:34Z

arc/scheduler.py

@@ -596,6 +596,7 @@ def schedule_jobs(self):
                            if successful_server_termination:
                                success = self.parse_opt_geo(label=label, job=job)
                                if success:
+                                    self.parse_opt_e_elect(label=label, job=job)


if not self.job_types['sp']: seems to lead to functional test failure for both Kinetics and Thermo, for now I keep the if success

arc/job/adapters/common.py

alongd

Thanks for this contribution, @JintaoWu98 !
Looking good. Please go ahead and squash the fixup commits

make_multi_species_output_file is created to first slice the big cluster file. delete_multi_species_output_file is created to delete the temporary generated files.

In addition, define a new method "post_opt_geo_work" for post processing

make_multi_species_output_file is created to first slice the big cluster file. delete_multi_species_output_file is created to delete the temporary generated files.

In case of given initial xyzs, we can now prevent them to run multiple times

Currently only implemented for Gaussian Modify the gaussian.py for multiple species, including rewriting the template

In addition, add few method types leading to restricted cases fixup! multi_species "is_restricted" in common.py and common_test.py

alongd · 2024-01-15T03:37:29Z

@JintaoWu98, is this PR ready to be merged? Can you run a multispecies project using the updated brunch to confirm it works as expected?

JintaoWu98 · 2024-01-15T09:15:22Z

@JintaoWu98, is this PR ready to be merged? Can you run a multispecies project using the updated brunch to confirm it works as expected?

Hi @alongd, a multi species project is working as expected. I think it is ready to be merged.

alongd

Thanks for this contribution!

JintaoWu98 requested a review from alongd December 18, 2023 08:51

github-actions bot added Module: Parser Module: Scheduler Module: Species labels Dec 18, 2023

github-advanced-security bot found potential problems Dec 18, 2023

View reviewed changes

arc/parser.py Fixed Show fixed Hide fixed

arc/job/adapters/gaussian.py Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Dec 18, 2023

View reviewed changes

arc/parser.py Fixed Show fixed Hide fixed

arc/scheduler.py Fixed Show fixed Hide fixed

arc/parser.py Fixed Show fixed Hide fixed

arc/parser.py Fixed Show fixed Hide fixed

alongd reviewed Dec 18, 2023

View reviewed changes

JintaoWu98 force-pushed the multi_species branch 5 times, most recently from 99566f1 to 658eeea Compare December 21, 2023 10:02

JintaoWu98 closed this Dec 21, 2023

calvinp0 reopened this Dec 21, 2023

JintaoWu98 force-pushed the multi_species branch 2 times, most recently from 46022ca to e2ec635 Compare December 27, 2023 10:11

alongd reviewed Dec 28, 2023

View reviewed changes

JintaoWu98 force-pushed the multi_species branch from e2ec635 to fac6d6e Compare January 10, 2024 09:25

github-actions bot added the Module: Plotter label Jan 10, 2024

JintaoWu98 commented Jan 11, 2024

View reviewed changes

alongd reviewed Jan 11, 2024

View reviewed changes

alongd and others added 3 commits January 13, 2024 16:31

Added a multispecies attribute to species.py

9795691

New functions in plotter.py process multi_species output file

bc124fd

make_multi_species_output_file is created to first slice the big cluster file. delete_multi_species_output_file is created to delete the temporary generated files.

Consider multispecies runs in scheduler.py

e1bc645

In addition, define a new method "post_opt_geo_work" for post processing

JintaoWu98 force-pushed the multi_species branch from fac6d6e to 394e65b Compare January 13, 2024 08:31

JintaoWu98 and others added 11 commits January 13, 2024 21:18

scheduler.py parses xyzs and e_elect for opt job

12eb720

make_multi_species_output_file is created to first slice the big cluster file. delete_multi_species_output_file is created to delete the temporary generated files.

Prevent scheduler.py multi_spc from redundant jobs

b92d01e

In case of given initial xyzs, we can now prevent them to run multiple times

Modify _initialize_adapter in common.py for multiple species information

59d2e0c

Added run_multi_species to factory.py and adapter.py

6bcedf3

Added run_multi_spiecies to job adapters

45a3d34

Currently only implemented for Gaussian Modify the gaussian.py for multiple species, including rewriting the template

Modify rest of the adapters to include run_multi_species

09033f6

Modify ts adapters to include run_multi_species

4d4575f

gaussian_test.py can test multi_species job and opt uff job

cc4e68d

gaussian.py avoids opt=(calcfc) for some opt job method types

3419db3

multi_species "is_restricted" in common.py and common_test.py

ca1f2d9

In addition, add few method types leading to restricted cases fixup! multi_species "is_restricted" in common.py and common_test.py

Delete Untitled.ipynb

4aa5282

JintaoWu98 force-pushed the multi_species branch from 394e65b to 4aa5282 Compare January 13, 2024 13:19

alongd approved these changes Jan 16, 2024

View reviewed changes

alongd merged commit 061cdc8 into main Jan 16, 2024
5 of 7 checks passed

alongd deleted the multi_species branch January 16, 2024 13:02

Multi species #724

Multi species #724

Conversation

JintaoWu98 commented Dec 18, 2023

codecov bot commented Dec 18, 2023 • edited Loading

Codecov Report

alongd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

calvinp0 commented Dec 21, 2023

JintaoWu98 commented Dec 21, 2023 • edited Loading

alongd commented Dec 21, 2023

alongd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JintaoWu98 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alongd left a comment

Choose a reason for hiding this comment

alongd commented Jan 15, 2024

JintaoWu98 commented Jan 15, 2024

alongd left a comment

Choose a reason for hiding this comment

codecov bot commented Dec 18, 2023 •

edited

Loading

JintaoWu98 commented Dec 21, 2023 •

edited

Loading