Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More time-efficient order in timeslice file creation #823

Closed
wants to merge 3 commits into from

Conversation

JurgenZach-NOAA
Copy link
Contributor

Changed the order of timeslice file import and creation of dataframes. Previously, a pandas dataframe was created for each timeslice, and then the dataframes were concatenated into a combined dataframe for 15-minute resampling. Now, only numpy arrays are imported, which are then combined into one large dataframe.

Additions

Removals

Changes

  • For reading individual timeslices, added a routine _read_timeslice_file_numpy in nhd_io that only returns numpy arrays
  • Rewrote get_obs_from_timeslices: streamlined creation of the observation_df in one step

Testing

  1. Running any example that reads usgs timeslice files
  2. If you want to benchmark the time, the pyinstruments profiler is suggested, which will create a report with 1 millisecond time resolution:

from pyinstrument import Profiler
profiler = Profiler()
profiler.start()

 [CODE YOU WANT TO BENCHMARK]

profiler.stop()
profiler.print()

Screenshots

For Lower Colorado example [test_AnA.yaml]: benchmark for get_obs_from_timeslices shows 2-fold speedup:

BEFORE:

image

AFTER:

image

Notes

Todos

  • Suggest considering to delete the "quality" dataframe, which will result in further speedup

Checklist

  • PR has an informative and human-readable title
  • Changes are limited to a single goal (no scope creep)
  • Code can be automatically merged (no conflicts)
  • Code follows project standards (link if applicable)
  • Passes all existing automated tests
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future todos are captured in comments
  • Visually tested in supported browsers and devices (see checklist below 👇)
  • Project documentation has been updated (including the "Unreleased" section of the CHANGELOG)
  • Reviewers requested with the Reviewers tool ➡️

Testing checklist

Target Environment support

  • Windows
  • Linux
  • Browser

Accessibility

  • Keyboard friendly
  • Screen reader friendly

Other

  • Is useable without CSS
  • Is useable without JS
  • Flexible from small to large screens
  • No linting errors or warnings
  • JavaScript tests are passing

@JurgenZach-NOAA JurgenZach-NOAA marked this pull request as draft August 9, 2024 14:58
@JurgenZach-NOAA JurgenZach-NOAA marked this pull request as ready for review August 12, 2024 14:58
@AminTorabi-NOAA
Copy link
Contributor

I test it on an example I had on vpu-17 and on this line timeslice_obs_df = pd.concat(dfList, axis = 1)
It gives error that *** pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Probably because you try to concatenate dataframes that have non-unique index values. Also when I checked there was a difference in length between dfList[0] and dfList[1] and others.

@JurgenZach-NOAA JurgenZach-NOAA marked this pull request as draft August 20, 2024 15:22
@JurgenZach-NOAA
Copy link
Contributor Author

Archived. End of Project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants