Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making coordinate variables searchable #212

Merged
merged 20 commits into from
Oct 18, 2024
Merged

Conversation

charles-turner-1
Copy link
Collaborator

@charles-turner-1 charles-turner-1 commented Oct 3, 2024

Functionality changes in this branch:

  • source.builders.BaseBuilder.parse_access_ncfile indexes all variables (for var in ds.data_vars => for var in ds.variables), rather than just data variables, & tests have been updated to reflect additional data obtained.
    • NOTE: This is dependent on changes to intake-esm proposed by @dougiesquire in isssue 660.
    • Coordinate variables are accessible through the variable flag in the access-nri intake catalogue, rather than a separate coord flag, as suggested in Making coordinate variables searchable #201 by @anton-seaice.
      • When searching for coordinate variables only, only coordinate variables are returned. No concatenation of assets is performed - only the first asset in the dataset is. This avoids crashing resulting from xr.combine_by_coords.
      • When searching for coordinate & data_variables, the data & coordinate variables searched, as well as coordinate variables that the searched data variables depend on.
      • Other searches will return all data & coordinate variables, as before.

Refactoring

  • Updated source.builders.BaseBuilder to use a dataclass (source.utils._AccessNCFileInfo) to pass around information about variables, rather than an outputs tuple.
  • Added a _DataVarInfo dataclass to hold attribute information from data variables, rather than appending them all into separate lists passed to the outputs tuple.

Minor changes

  • Add a bunch of type checking (type annotations, added mypy to pre-commit, etc)
  • Replaced clauses of the pattern
    try:
        x = x[key]
    except:
       pass
    with
    x = x.get(key,x)
    
  • A couple of minor docstring updates

Notes

intake-esm

The code changes in this branch should lead to catalogues being built with coordinate variables included in their variable flag: eg

from access_nri_intake.source.builders import AccessOm2Builder

builder = AccessOm2Builder(
    path="/g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091"
)

builder.build()

CATALOG_PATH = ...

builder.save(
    directory='CATALOG_PATH',
    name="ACCESS-OM2_01deg_jra55v13_ryf9091", 
    description="...",
)

will build a catalog which coordinate variables are searchable from. However, unless you're working with a version of intake-esm which also has coordinate variables searchable (alterations noted in Functionality changes), this will present two issues:

  1. Files comprising only coordinate variables, eg. ocean_grid.nc files, will not be indexed.
  2. Searchs purely for a coordinate variable will fail when calling .to_dask() will fail on coordinate indexing if a multi file dataset is obtained from the search, or return an empty dataset if a single file is returned.

Issue 1. appears to be ESMDataSource._open_dataset thinking the .nc files are empty unless they contain coordinate variables & ignoring them.
Issue 2. is the result of the requested_variables used in intake-esm.

Neither of these is testable or fixable in access-nri-intake-catalog, as these issues are handled by intake-esm.

PR size

This PR contains a couple of code changes, as well as some other stuff (typing/pre-commit, etc). I'm happy to split this apart into separate pull requests if that makes things more tractable to review.

* Replaced a couple of try/excepts with .get in `src/access_nri_intake/catalog/translators.py`

* Updated a misleading docstring
likely to be necessary for passing around coordinates as well as
data variables as we begin to make coordinates indexable - I think
we'll begin to get confused about what belongs where.
   dataclasses that are fed into a dataclass holding all the
   coordinate and data variables from a netCDF file.
- Updated tests so that they are all passing: tests now expect
  to find coordinate variables from the netCDF files as well as
  data variables.
- Some minor changes to make code more readable - changed long
  tuples to dataclasses & dictionaries where possible.
…due to use of '|' type union syntax - might be worth consideration?

- Removed a couple of unused imports, cleaned up some comments
- Updated tests to better respect moving coordinate variables back into variables
- Moved mypy setup stuff into to 'pre-commit-config.yaml'
@charles-turner-1 charles-turner-1 linked an issue Oct 3, 2024 that may be closed by this pull request
Copy link

codecov bot commented Oct 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.17%. Comparing base (017cd2d) to head (53d679c).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #212   +/-   ##
=======================================
  Coverage   97.17%   97.17%           
=======================================
  Files           9        9           
  Lines         672      672           
=======================================
  Hits          653      653           
  Misses         19       19           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@charles-turner-1
Copy link
Collaborator Author

I've been through and linked a bunch of issues that I think the changes here resolve - about to verify.

Copy link
Collaborator

@marc-white marc-white left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but this PR doesn't actually do anything to read the coordinate variables itself (compared to the read in the main version of the code currently does) - it's simply been prepped to handle the coordinates once intake-esm gets updated?

for var in ds.data_vars:
dvars = _DataVarInfo()

for var in ds.variables:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New behaviour - scans all variables in dataset, including coordinate variables

@charles-turner-1
Copy link
Collaborator Author

Not precisely - it does additionally scan .nc files for coordinates (see above), but crucially also requires the esm_datastore object to do the same. If we have the same behaviour in the esm_datastore (defined in intake-esm) there this PR won't affect the variables we read.

@charles-turner-1
Copy link
Collaborator Author

charles-turner-1 commented Oct 11, 2024

I've checked, and the changes in this PR don't resolve #117 or #192.

In both instances, the issues specifically result from attempted concatenation of grid variables, rather than coordinate variables. Grid variables are stored as data variables in the netCDF files that hold them, but are don't depend on time & so the time variable is dropped before concatenation, which is the source of the 'no dimension to concatenate along' errors.

I've changed this PR & the concomitant PR on intake-esm to draft status whilst I work out how closely related these issues are & whether we can neatly tie together all the changes we need.

Edit: The obvious resolution to the grid variable problems mentioned above would be to retain all dimensions until the final dataset is produced - currently looking into that.

Requesting time as a coordinate variable along with the desired grid also resolves the problems.

@charles-turner-1 charles-turner-1 marked this pull request as draft October 11, 2024 08:15
Append attributes to the DataVarInfo object, if the attribute has a
'long_name' key.

TODO: Why do we need a long name key? seems important
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the code using the presence of 'long_name' as a proxy for "this is a real variable, not just additional info"?

@charles-turner-1 charles-turner-1 changed the base branch from main to 186-return-type October 15, 2024 03:22
@charles-turner-1 charles-turner-1 marked this pull request as ready for review October 15, 2024 03:26
@charles-turner-1 charles-turner-1 marked this pull request as draft October 15, 2024 03:27
@charles-turner-1 charles-turner-1 force-pushed the 660-coordinate-variables branch from ac97823 to 52c7cb0 Compare October 15, 2024 03:32
@charles-turner-1 charles-turner-1 marked this pull request as ready for review October 15, 2024 03:35
@charles-turner-1
Copy link
Collaborator Author

I've split this into a series of stacked PR's:
#221: Adds typing, mypy type check to pre-commit. Merge into main, addresses #13.
#222: Refactors BaseBuilder.parse_access_ncfile to return a dataclass, rather than a tuple. Closes #186 and the PR merges into the branch introduced by #221
#212: (This PR) - changes data_vars => variables and updates the relevant tests to ensure we are catching all the correct coordinate variables. PR merges into the branch introduced by #222.

@charles-turner-1 charles-turner-1 linked an issue Oct 15, 2024 that may be closed by this pull request
@charles-turner-1 charles-turner-1 changed the base branch from 186-return-type to main October 17, 2024 23:24
* Replaced a couple of Union[Type, None] with Optional[Type]
Copy link
Collaborator

@marc-white marc-white left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved as ready, although are we in a position to merge before Intake-ESM is updated?

@charles-turner-1
Copy link
Collaborator Author

I'll go back and triple check before I merge, but it should be the case that the catalogue will build exactly as before if we don't update intake-esm.

@charles-turner-1 charles-turner-1 merged commit ab37010 into main Oct 18, 2024
15 checks passed
@charles-turner-1
Copy link
Collaborator Author

The concomitant changes have been merged into intake-esm (link), so we should be close to ready to go with making this the default behaviour.

@charles-turner-1 charles-turner-1 deleted the 660-coordinate-variables branch October 28, 2024 20:26
@rbeucher
Copy link
Member

Great 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants