-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
detect instruments on import of pld1 files #14
detect instruments on import of pld1 files #14
Conversation
I think we can break things and rewrite the downstream processing, but what does this end up producing? Do you have an example netcdf you can share? We may want to have a quick zoom at some point to plan out what this should look like. i.e. how do we get from random sensors to science data with units etc, flexibly, reproducibly, and with minimal pain for the "end user". |
I think the netcdf format you've made is very good. I'd like to keep that, but make the code flexible to handle other payloads. At the moment I envisage an end user supplying the raw data and some kind of metadata file, like your yaml, that specifies deployment info, sensor ids etc. pyglider should then take these and produce L0 netcdf files ready for ingestion by DACs and distribution through something like ERDDAP. Does that make sense? A zoom call would be great. Would help me get oriented on the project before I start breaking things! Are there any other interested parties? I've just started at VOTO where I have responsibility for processing a wide array of SeaExplorer payloads. I've asked my manager what data I can share outside the organisation, we have lots of different sensor combos to process. |
This commit makes the netcdf merge sensor-agnostic. One more step along the way. Currently testing this against the original dataset, and another seaexplorer with legato, flntu, arod and ad2cp sensors |
This commit makes the timeseries function sensor agnostic. It depends on the glider having either a GPCTD or LEGATO sensor for temp/conductivity etc. If other CT sails are available for the SeaExplorer, this will need to be updated |
With these changes, I was able to process a dataset from a Seaexplorer with a completely different set of sensors. It requires changes to the Deployment yaml, and the plotting is still hardcoded, but I think this is enough changes for one PR |
This is awesome. I'll try and test in the next day or two on some of my raw data. I'd definitely be interested in a zoom to chat about directions, as I have increasing desire to have both the SX and slocum capabilities. |
Great! Let me know if you hit issues. To process all the way though, you'll need to add your sensors to the deployment yaml. Without those it should at least make the whole mission per-sensor netcdfs though. I can send you the sample dataset I've been working with on Monday. I'm free pretty much all next week for a zoom call. I'm on UK time, but happy to schedule it for my evening if that's easier for you two. |
Here are the raw and processed files from the SX dataset I've been working on. Pretty happy with it, but there's something weird going on with salinity https://callumrollo.com/nextcloud/index.php/s/e5xMFcFc42sLsn9 |
@callumrollo If possible, please add the examples to the GitHub repo for testing? |
Surething, shall I duplicate the structure of example-seaexplorer? Something like: |
I think that would be fine, or feel free to rename the original one as well, perhaps by serial number or something that indicates the sensor suite: |
Files added over in #18, feel free to pull that branch and try it out. It will require the changes made in this PR to run though |
Expected at least one of those checks to fail tbh! I think this is ready for review now. It's a bit of a monster, but you should be able to just pull this branch, run the script /example-seaexplorer-legato-flntu-arod-ad2cp/process_deploymentRealTime.py and check the output netcdfs/figures make sense. @richardsc or @jklymak would you mind reviewing this? |
example-seaexplorer-legato-flntu-arod-ad2cp/process_deploymentRealTime.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Maybe we should go back though and simplify the logic, because the seperate sensor streams were a hack anyway, and not integral to the processing. Sorry, should have noticed that earlier, but I haven't looked at this code for a while...
example-seaexplorer-legato-flntu-arod-ad2cp/process_deploymentRealTime.py
Outdated
Show resolved
Hide resolved
pldGPCTD = pldGPCTD.drop_vars(arod) | ||
if pldGPCTD.indexes["time"].size > min_samples_in_file: | ||
pldGPCTD.to_netcdf(fnout[:-3] + '_gpctd.nc', 'w', | ||
# Detect instruments in dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK. This splitting was originally because I had trouble concatenating data sets in xarray. It turns out that there was some sort of locking problem. This was a dumb attempt to not have such monolithic netcdf files which helped somewhat. However, with the lock fixed, I'm not sure its need any more.
I wonder if we can get rid of this hack, and just process without the prefix at all? That greatly simplifies all this code and we don't need to assume anything about the carrier instrument for each sensor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a good simplification. I think a new PR should be opened for it though, this one is already pretty big
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure we can do it that way as well if you prefer...
os.remove(outpld[:-5] + f'_{sensor}.nc') | ||
except: | ||
pass | ||
with xr.open_mfdataset(f'{indir}*pld*.{kind}.*{sensor}.nc', decode_times=False, parallel=False, lock=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was this lock=False
which was the magic incantation fo allow robust merging of files. I think there is some mode where the merge can fail, but I'd rather it fail than the whole process lock up, which is what it was doing before.
pyglider/seaexplorer.py
Outdated
with xr.open_mfdataset(f'{indir}*pld*.{kind}.*{sensor}.nc', decode_times=False, parallel=False, lock=False, | ||
preprocess=_sort) as pld: | ||
print(pld) | ||
if sensor == 'AROD': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally coarsening like this (which just made the file far smaller) would be a sensor yaml parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be better yeah. Maybe we should start a discussion on the format of the yaml? It will be a key control for automated processing and metadata inclusion
pyglider/seaexplorer.py
Outdated
if sensor == 'AROD': | ||
# this is the only one that's altered in the original code | ||
pld = pld.coarsen(time=8, boundary='trim').mean() | ||
print('Writing') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably want to use logging
here. Of course that was just my old print statement ;-)
# Loop through the sensor netcdfs | ||
sensor_ncs = glob.glob(f'{indir}*{kind}p_*.nc') | ||
sensors = {} | ||
for sensor_path in sensor_ncs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above, I think we can get rid of these separate netcdfs altogether.
Looks like some fairly major refactoring is in order. I think we should get PR #24 merged first. That way we'll have test data to ensure that changes to the intermediates steps aren't affecting the final netcdfs. I'd like to keep this PR focused on adding flexibility for other seaexplorer payloads, with the extra example added to test this. Refactoring of the way sensor data are handled will be a seperate PR. |
OK, well lets you this in as-is, and merge #24. Then you can maybe mess with he refactor which we can give a more thorough code review to. This didn't erase my hard drive or install a TOR server ;-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to merge when CI is done!
Thanks for the review! Going to add logging to the sensor section you highlighted then merge in. I'll save adding the bitcoin miner until we have some more users ;) |
The new tests detected that the oxygen sensor coarsening section had not been hit, so the data saved to the netcdf changed. Useful proof of concept for it.Hopefully fixed now. Will merge once the CI is green |
First step to address Issue #3
This will be a pretty major rewrite. This commit improves
seaexplorer.raw_to_rawnc
so that it will detect the sensors. It assumes variables with a common prefix like FLNTU_ all belong to the same sensor. I think this is reasonable.The only major deviation from the hardcoded version is that it treats NAV_ as just another sensor. Is this desired behaviour? I'm not 100 % clear in how NAV is treated in the current code.
This is gonna majorly fail the CI, as it only handles the nc import stage, I'll add more commits to fix the downstream processing. This will be a bit of a monster PR bit I'll try to keep it focused
@jklymak should I add some tests? I'm not sure what's the best way to make sure this doesn't break pre-existing capability