Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detect instruments on import of pld1 files #14

Merged
merged 15 commits into from
Oct 27, 2021
Merged

detect instruments on import of pld1 files #14

merged 15 commits into from
Oct 27, 2021

Conversation

callumrollo
Copy link
Collaborator

First step to address Issue #3

This will be a pretty major rewrite. This commit improves seaexplorer.raw_to_rawnc so that it will detect the sensors. It assumes variables with a common prefix like FLNTU_ all belong to the same sensor. I think this is reasonable.

The only major deviation from the hardcoded version is that it treats NAV_ as just another sensor. Is this desired behaviour? I'm not 100 % clear in how NAV is treated in the current code.

This is gonna majorly fail the CI, as it only handles the nc import stage, I'll add more commits to fix the downstream processing. This will be a bit of a monster PR bit I'll try to keep it focused

@jklymak should I add some tests? I'm not sure what's the best way to make sure this doesn't break pre-existing capability

@jklymak
Copy link
Member

jklymak commented Oct 22, 2021

I think we can break things and rewrite the downstream processing, but what does this end up producing? Do you have an example netcdf you can share?

We may want to have a quick zoom at some point to plan out what this should look like. i.e. how do we get from random sensors to science data with units etc, flexibly, reproducibly, and with minimal pain for the "end user".

@callumrollo
Copy link
Collaborator Author

I think the netcdf format you've made is very good. I'd like to keep that, but make the code flexible to handle other payloads. At the moment I envisage an end user supplying the raw data and some kind of metadata file, like your yaml, that specifies deployment info, sensor ids etc. pyglider should then take these and produce L0 netcdf files ready for ingestion by DACs and distribution through something like ERDDAP. Does that make sense?

A zoom call would be great. Would help me get oriented on the project before I start breaking things! Are there any other interested parties?

I've just started at VOTO where I have responsibility for processing a wide array of SeaExplorer payloads. I've asked my manager what data I can share outside the organisation, we have lots of different sensor combos to process.

@callumrollo
Copy link
Collaborator Author

This commit makes the netcdf merge sensor-agnostic. One more step along the way.

Currently testing this against the original dataset, and another seaexplorer with legato, flntu, arod and ad2cp sensors

@callumrollo callumrollo marked this pull request as draft October 22, 2021 14:58
@callumrollo
Copy link
Collaborator Author

This commit makes the timeseries function sensor agnostic. It depends on the glider having either a GPCTD or LEGATO sensor for temp/conductivity etc. If other CT sails are available for the SeaExplorer, this will need to be updated

@callumrollo
Copy link
Collaborator Author

With these changes, I was able to process a dataset from a Seaexplorer with a completely different set of sensors. It requires changes to the Deployment yaml, and the plotting is still hardcoded, but I think this is enough changes for one PR

@callumrollo callumrollo marked this pull request as ready for review October 22, 2021 16:42
@richardsc
Copy link
Collaborator

This is awesome. I'll try and test in the next day or two on some of my raw data.

I'd definitely be interested in a zoom to chat about directions, as I have increasing desire to have both the SX and slocum capabilities.

@callumrollo
Copy link
Collaborator Author

Great! Let me know if you hit issues. To process all the way though, you'll need to add your sensors to the deployment yaml. Without those it should at least make the whole mission per-sensor netcdfs though. I can send you the sample dataset I've been working with on Monday.

I'm free pretty much all next week for a zoom call. I'm on UK time, but happy to schedule it for my evening if that's easier for you two.

@callumrollo
Copy link
Collaborator Author

Here are the raw and processed files from the SX dataset I've been working on. Pretty happy with it, but there's something weird going on with salinity https://callumrollo.com/nextcloud/index.php/s/e5xMFcFc42sLsn9

@jklymak
Copy link
Member

jklymak commented Oct 25, 2021

@callumrollo If possible, please add the examples to the GitHub repo for testing?

@callumrollo
Copy link
Collaborator Author

Surething, shall I duplicate the structure of example-seaexplorer? Something like:
pyglider
-- example-seaexplorer
-- example-voto-seaexplorer
------deploymentRealtime.yml
------realtime_rawnc
...

@jklymak
Copy link
Member

jklymak commented Oct 25, 2021

I think that would be fine, or feel free to rename the original one as well, perhaps by serial number or something that indicates the sensor suite: seaexplorer-gpctd-anderaa-ecopuck isn't too bad.

@callumrollo callumrollo mentioned this pull request Oct 25, 2021
@callumrollo
Copy link
Collaborator Author

Files added over in #18, feel free to pull that branch and try it out. It will require the changes made in this PR to run though

@callumrollo
Copy link
Collaborator Author

Expected at least one of those checks to fail tbh!

I think this is ready for review now. It's a bit of a monster, but you should be able to just pull this branch, run the script /example-seaexplorer-legato-flntu-arod-ad2cp/process_deploymentRealTime.py and check the output netcdfs/figures make sense. @richardsc or @jklymak would you mind reviewing this?

@callumrollo callumrollo mentioned this pull request Oct 25, 2021
Copy link
Member

@jklymak jklymak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Maybe we should go back though and simplify the logic, because the seperate sensor streams were a hack anyway, and not integral to the processing. Sorry, should have noticed that earlier, but I haven't looked at this code for a while...

pldGPCTD = pldGPCTD.drop_vars(arod)
if pldGPCTD.indexes["time"].size > min_samples_in_file:
pldGPCTD.to_netcdf(fnout[:-3] + '_gpctd.nc', 'w',
# Detect instruments in dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK. This splitting was originally because I had trouble concatenating data sets in xarray. It turns out that there was some sort of locking problem. This was a dumb attempt to not have such monolithic netcdf files which helped somewhat. However, with the lock fixed, I'm not sure its need any more.

I wonder if we can get rid of this hack, and just process without the prefix at all? That greatly simplifies all this code and we don't need to assume anything about the carrier instrument for each sensor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good simplification. I think a new PR should be opened for it though, this one is already pretty big

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we can do it that way as well if you prefer...

os.remove(outpld[:-5] + f'_{sensor}.nc')
except:
pass
with xr.open_mfdataset(f'{indir}*pld*.{kind}.*{sensor}.nc', decode_times=False, parallel=False, lock=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was this lock=False which was the magic incantation fo allow robust merging of files. I think there is some mode where the merge can fail, but I'd rather it fail than the whole process lock up, which is what it was doing before.

with xr.open_mfdataset(f'{indir}*pld*.{kind}.*{sensor}.nc', decode_times=False, parallel=False, lock=False,
preprocess=_sort) as pld:
print(pld)
if sensor == 'AROD':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally coarsening like this (which just made the file far smaller) would be a sensor yaml parameter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be better yeah. Maybe we should start a discussion on the format of the yaml? It will be a key control for automated processing and metadata inclusion

if sensor == 'AROD':
# this is the only one that's altered in the original code
pld = pld.coarsen(time=8, boundary='trim').mean()
print('Writing')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably want to use logging here. Of course that was just my old print statement ;-)

# Loop through the sensor netcdfs
sensor_ncs = glob.glob(f'{indir}*{kind}p_*.nc')
sensors = {}
for sensor_path in sensor_ncs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above, I think we can get rid of these separate netcdfs altogether.

@callumrollo
Copy link
Collaborator Author

Looks like some fairly major refactoring is in order. I think we should get PR #24 merged first. That way we'll have test data to ensure that changes to the intermediates steps aren't affecting the final netcdfs.

I'd like to keep this PR focused on adding flexibility for other seaexplorer payloads, with the extra example added to test this. Refactoring of the way sensor data are handled will be a seperate PR.

@jklymak
Copy link
Member

jklymak commented Oct 27, 2021

OK, well lets you this in as-is, and merge #24. Then you can maybe mess with he refactor which we can give a more thorough code review to. This didn't erase my hard drive or install a TOR server ;-)

Copy link
Member

@jklymak jklymak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to merge when CI is done!

@callumrollo
Copy link
Collaborator Author

Thanks for the review! Going to add logging to the sensor section you highlighted then merge in. I'll save adding the bitcoin miner until we have some more users ;)

@callumrollo
Copy link
Collaborator Author

The new tests detected that the oxygen sensor coarsening section had not been hit, so the data saved to the netcdf changed. Useful proof of concept for it.Hopefully fixed now. Will merge once the CI is green

@callumrollo callumrollo merged commit f359bbe into c-proof:main Oct 27, 2021
@callumrollo callumrollo deleted the callum-patch-5 branch November 9, 2021 08:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants