More time-efficient order in timeslice file creation #823

JurgenZach-NOAA · 2024-08-08T15:56:11Z

Changed the order of timeslice file import and creation of dataframes. Previously, a pandas dataframe was created for each timeslice, and then the dataframes were concatenated into a combined dataframe for 15-minute resampling. Now, only numpy arrays are imported, which are then combined into one large dataframe.

Additions

Removals

Changes

For reading individual timeslices, added a routine _read_timeslice_file_numpy in nhd_io that only returns numpy arrays
Rewrote get_obs_from_timeslices: streamlined creation of the observation_df in one step

Testing

Running any example that reads usgs timeslice files
If you want to benchmark the time, the pyinstruments profiler is suggested, which will create a report with 1 millisecond time resolution:

from pyinstrument import Profiler
profiler = Profiler()
profiler.start()

 [CODE YOU WANT TO BENCHMARK]

profiler.stop()
profiler.print()

Screenshots

For Lower Colorado example [test_AnA.yaml]: benchmark for get_obs_from_timeslices shows 2-fold speedup:

BEFORE:

AFTER:

Notes

Todos

Suggest considering to delete the "quality" dataframe, which will result in further speedup

Checklist

Testing checklist

Target Environment support

Windows
Linux
Browser

Accessibility

Keyboard friendly
Screen reader friendly

Other

AminTorabi-NOAA · 2024-08-15T13:47:06Z

I test it on an example I had on vpu-17 and on this line timeslice_obs_df = pd.concat(dfList, axis = 1)
It gives error that *** pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Probably because you try to concatenate dataframes that have non-unique index values. Also when I checked there was a difference in length between dfList[0] and dfList[1] and others.

JurgenZach-NOAA · 2024-10-01T00:00:48Z

Archived. End of Project.

More time-efficient order in timeslice file creation

e08c8cd

JurgenZach-NOAA requested review from AminTorabi-NOAA, shorvath-noaa and kumdonoaa August 8, 2024 15:56

JurgenZach-NOAA self-assigned this Aug 8, 2024

Changed cpu_pool back to default

62594c8

JurgenZach-NOAA marked this pull request as draft August 9, 2024 14:58

Fixed situation with empty dataframes

4d0af2b

JurgenZach-NOAA marked this pull request as ready for review August 12, 2024 14:58

JurgenZach-NOAA marked this pull request as draft August 20, 2024 15:22

JurgenZach-NOAA closed this Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More time-efficient order in timeslice file creation #823

More time-efficient order in timeslice file creation #823

JurgenZach-NOAA commented Aug 8, 2024

AminTorabi-NOAA commented Aug 15, 2024

JurgenZach-NOAA commented Oct 1, 2024

More time-efficient order in timeslice file creation #823

More time-efficient order in timeslice file creation #823

Conversation

JurgenZach-NOAA commented Aug 8, 2024

Additions

Removals

Changes

Testing

Screenshots

Notes

Todos

Checklist

Testing checklist

Target Environment support

Accessibility

Other

AminTorabi-NOAA commented Aug 15, 2024

JurgenZach-NOAA commented Oct 1, 2024