-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making coordinate variables searchable #212
Conversation
* Replaced a couple of try/excepts with .get in `src/access_nri_intake/catalog/translators.py` * Updated a misleading docstring
…nate variable searching & indexing
… searching for coordinate variables
likely to be necessary for passing around coordinates as well as data variables as we begin to make coordinates indexable - I think we'll begin to get confused about what belongs where.
dataclasses that are fed into a dataclass holding all the coordinate and data variables from a netCDF file. - Updated tests so that they are all passing: tests now expect to find coordinate variables from the netCDF files as well as data variables. - Some minor changes to make code more readable - changed long tuples to dataclasses & dictionaries where possible.
…due to use of '|' type union syntax - might be worth consideration? - Removed a couple of unused imports, cleaned up some comments
… search (forgot to test)
- Updated tests to better respect moving coordinate variables back into variables - Moved mypy setup stuff into to 'pre-commit-config.yaml'
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #212 +/- ##
=======================================
Coverage 97.17% 97.17%
=======================================
Files 9 9
Lines 672 672
=======================================
Hits 653 653
Misses 19 19 ☔ View full report in Codecov by Sentry. |
I've been through and linked a bunch of issues that I think the changes here resolve - about to verify. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but this PR doesn't actually do anything to read the coordinate variables itself (compared to the read in the main
version of the code currently does) - it's simply been prepped to handle the coordinates once intake-esm
gets updated?
for var in ds.data_vars: | ||
dvars = _DataVarInfo() | ||
|
||
for var in ds.variables: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New behaviour - scans all variables in dataset, including coordinate variables
Not precisely - it does additionally scan .nc files for coordinates (see above), but crucially also requires the esm_datastore object to do the same. If we have the same behaviour in the esm_datastore (defined in intake-esm) there this PR won't affect the variables we read. |
I've checked, and the changes in this PR don't resolve #117 or #192. In both instances, the issues specifically result from attempted concatenation of grid variables, rather than coordinate variables. Grid variables are stored as data variables in the netCDF files that hold them, but are don't depend on time & so the time variable is dropped before concatenation, which is the source of the 'no dimension to concatenate along' errors. I've changed this PR & the concomitant PR on intake-esm to draft status whilst I work out how closely related these issues are & whether we can neatly tie together all the changes we need. Edit: The obvious resolution to the grid variable problems mentioned above would be to retain all dimensions until the final dataset is produced - currently looking into that. Requesting time as a coordinate variable along with the desired grid also resolves the problems. |
Append attributes to the DataVarInfo object, if the attribute has a | ||
'long_name' key. | ||
|
||
TODO: Why do we need a long name key? seems important |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the code using the presence of 'long_name' as a proxy for "this is a real variable, not just additional info"?
ac97823
to
52c7cb0
Compare
I've split this into a series of stacked PR's: |
* Replaced a couple of Union[Type, None] with Optional[Type]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved as ready, although are we in a position to merge before Intake-ESM is updated?
I'll go back and triple check before I merge, but it should be the case that the catalogue will build exactly as before if we don't update intake-esm. |
The concomitant changes have been merged into intake-esm (link), so we should be close to ready to go with making this the default behaviour. |
Great 👍 |
Functionality changes in this branch:
source.builders.BaseBuilder.parse_access_ncfile
indexes all variables (for var in ds.data_vars
=>for var in ds.variables
), rather than just data variables, & tests have been updated to reflect additional data obtained.xr.combine_by_coords
.Refactoring
source.builders.BaseBuilder
to use a dataclass (source.utils._AccessNCFileInfo
) to pass around information about variables, rather than an outputs tuple._DataVarInfo
dataclass to hold attribute information from data variables, rather than appending them all into separate lists passed to the outputs tuple.Minor changes
Notes
intake-esm
The code changes in this branch should lead to catalogues being built with coordinate variables included in their variable flag: eg
will build a catalog which coordinate variables are searchable from. However, unless you're working with a version of intake-esm which also has coordinate variables searchable (alterations noted in Functionality changes), this will present two issues:
ocean_grid.nc
files, will not be indexed..to_dask()
will fail on coordinate indexing if a multi file dataset is obtained from the search, or return an empty dataset if a single file is returned.Issue 1. appears to be ESMDataSource._open_dataset thinking the .nc files are empty unless they contain coordinate variables & ignoring them.
Issue 2. is the result of the requested_variables used in intake-esm.
Neither of these is testable or fixable in access-nri-intake-catalog, as these issues are handled by intake-esm.
PR size
This PR contains a couple of code changes, as well as some other stuff (typing/pre-commit, etc). I'm happy to split this apart into separate pull requests if that makes things more tractable to review.