Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Info #455

Open
wants to merge 10 commits into
base: dev
Choose a base branch
from
Open

Cluster Info #455

wants to merge 10 commits into from

Conversation

TimothyWillard
Copy link
Contributor

Describe your changes.

This pull request adds an info module to gempyor with:

  • Internal _get_info helper for pulling a configuration YAML given a category and class to parse into,
  • Public get_cluster_info for pulling the configuration associated with a particular HPC cluster, and
  • Internal _infer_cluster_from_fqdn to guess the HPC cluster from the FQDN.

New module/functionality include docstrings and unit tests. This PR also makes it easy to add future categories of generic info/config as needed.

Does this pull request make any user interface changes? If so please describe.

The user interface changes are contained in gempyor.info and only contains new functionality, no breaking changes.

Those are reflected in updates to the documentation in docstrings, but could be added to the GitBook if needed. However I think the docstrings are comprehensive and are useful for the target audience.

What does your pull request address? Tag relevant issues.

This pull request addresses GH-342, and was spun out of GH-394.

Draft solution to GH-342 with the ability to extend to other types of
generic config/info in the future. Add `gempyor.info` module with
exported function `get_cluster_info` which pulls and parses an
information yaml into a pydantic model.
Added specification for path modifications to the cluster info (namely
to add AWS cli on rockfish to path). Specification includes the path to
add, prepend vs append, and a flag to error if path is missing.
* Added `gempyor.info._infer_cluster_from_fqdn` internal utility.
* `gempyor.info.get_cluster_info` now accepts `None` for `name`.
* Minor testing fixes.
Fleshed out documentation for the `gempyor.info` module. Including
module level docs and docstrings for the individual classes. Fleshed out
examples.
Made the `ValueError` from `_infer_cluster_from_fqdn` optional, but
default to `True` for current behavior.
@TimothyWillard TimothyWillard added enhancement Request for improvement or addition of new feature(s). gempyor Concerns the Python core. batch Relating to batch processing. medium priority Medium priority. next release Marks a PR as a target to include in the next release. labels Jan 8, 2025
@TimothyWillard TimothyWillard linked an issue Jan 8, 2025 that may be closed by this pull request
Copy link
Contributor

@pearsonca pearsonca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems overall good, but I have a hard concern re including longleaf / rockfish specs in what's supposed to be a general community library.

I get that for our use, it's pretty ideal to have these here. But seems like the wrong answer for not-us users, and taking the easy short cut for us is maybe meaning we aren't thinking hard enough about what's the right answer for other users.

.github/workflows/gempyor-ci.yml Outdated Show resolved Hide resolved
Set the `$FLEPI_PATH` environment variable in the gempyor CI GitHub
action, because the internal `gempyor.info._get_info` function needs it
to know where to look for config YAMLs.
@TimothyWillard TimothyWillard force-pushed the feature/342/cluster-info branch from 89b289e to 86dbb75 Compare January 8, 2025 19:09
@TimothyWillard
Copy link
Contributor Author

I get that for our use, it's pretty ideal to have these here. But seems like the wrong answer for not-us users, and taking the easy short cut for us is maybe meaning we aren't thinking hard enough about what's the right answer for other users.

Good point. There are alternatives:

  1. Place these info YAMLs at a new location (say $FLEPI_INFO_PATH or similar), or
  2. Use the $PROJECT_PATH instead.

It's pretty easy to change the code to accommodate either use case so I have no qualms with either. Can also add in fallbacks like first check $FLEPI_INFO_PATH if set, then check $PROJECT_PATH if set, and finally check $FLEPI_PATH if set (or whatever sequence makes the most sense).

@pearsonca @jcblemai @MacdonaldJoshuaCaleb @saraloo @anjalika-nande what makes the most sense for y'all?

@pearsonca
Copy link
Contributor

I get that for our use, it's pretty ideal to have these here. But seems like the wrong answer for not-us users, and taking the easy short cut for us is maybe meaning we aren't thinking hard enough about what's the right answer for other users.

Good point. There are alternatives:

  1. Place these info YAMLs at a new location (say $FLEPI_INFO_PATH or similar), or
  2. Use the $PROJECT_PATH instead.

It's pretty easy to change the code to accommodate either use case so I have no qualms with either. Can also add in fallbacks like first check $FLEPI_INFO_PATH if set, then check $PROJECT_PATH if set, and finally check $FLEPI_PATH if set (or whatever sequence makes the most sense).

@pearsonca @jcblemai @MacdonaldJoshuaCaleb @saraloo @anjalika-nande what makes the most sense for y'all?

I kinda think project path? Each project should represent all the pertinent data for executing that project.

I could see a place for system / user "profile" like data - that seems like the next level version of this, as we're going to have think a bit more generically about handling various project-wide-defaults.

Copy link
Collaborator

@emprzy emprzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like including module-level documentation before import statements isn't something that has been done a lot in flepiMoP, even though it should be like this. Glad you are starting that here (though, I guess that's primarily because we don't have very much module-level documentation yet...)

@@ -0,0 +1,42 @@
$schema: https://json-schema.org/draft/2020-12/schema
$id: https://example.com/product.schema.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably copy paste error?

@saraloo
Copy link
Contributor

saraloo commented Jan 10, 2025

To me $FLEPI_INFO_PATH makes sense. Understand the project path being a self-contained 'able-to-run' repo, but I think this is more practical flepiMoP related run info?

@MacdonaldJoshuaCaleb
Copy link
Collaborator

I agree with Sara on this

@pearsonca
Copy link
Contributor

To me $FLEPI_INFO_PATH makes sense. Understand the project path being a self-contained 'able-to-run' repo, but I think this is more practical flepiMoP related run info?

To me, this should ultimately look roughly like the .Rprofile / .gitconfig / etc approach - look in the immediate place, then defer to some user site-wide file, then a system wide, etc.

@anjalika-nande
Copy link
Collaborator

To me $FLEPI_INFO_PATH makes sense. Understand the project path being a self-contained 'able-to-run' repo, but I think this is more practical flepiMoP related run info?

This seems more natural to me too!

@jcblemai
Copy link
Collaborator

This is very good work and addresses the past discussion on yaml cluster configuration.

I feel that it's helpful to have our rockfish and longleaf configuration inside the repository, as a living example for others to build their on their cluster and for us to have them all updated at the same time (and if flepiMoP is run on many more cluster by others, we can remove that at a later time).

But I agree with Carl that this might mean we forget other use cases, so we'd need to be careful. I'd be for the

look in the immediate place, then defer to some user site-wide file, then a system wide, etc.

(adding then look at the flepiMoP repo))

thing which is common for these tools, instead of another environment variable.

@TimothyWillard
Copy link
Contributor Author

Sounds like the consensus is:

  1. Look in the current directory (via os.getcwd()), then
  2. Check the $FLEPI_INFO_PATH directory if set, and then finally
  3. Check the $FLEPI_PATH.

Does that sound correct before I begin making these edits (thumbs up for yes)?

Updated the search paths to 1) current working directory, 2)
`$FLEPI_INFO_PATH`, and finally 3) `$FLEPI_PATH`. Per feedback:
#455 (comment).
@TimothyWillard TimothyWillard force-pushed the feature/342/cluster-info branch from c0b2c11 to c8c055c Compare January 24, 2025 19:08
Added a GitHub action workflow to convert the yamls in `info/` and
validate them against the schema file in the corresponding directory.
@TimothyWillard
Copy link
Contributor Author

This PR now implements the agreed upon search order in #455 (comment) and is ready for review again.

@TimothyWillard TimothyWillard added high priority High priority. and removed medium priority Medium priority. labels Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
batch Relating to batch processing. enhancement Request for improvement or addition of new feature(s). gempyor Concerns the Python core. high priority High priority. next release Marks a PR as a target to include in the next release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature request]: Generic Cluster Config
7 participants