Skip to content

Gage Data USGS USACE

Neptune-Meister edited this page Sep 12, 2024 · 2 revisions

Gage Data - USGS and USACE

Preprocessing of USGS and USACE gage data - V3

In troute's V3 mode, preprocessing of gage data is handled in the nwm_network_preprocess function, which handles the preprocessing of a network for simulations involving both diffusive and non-diffusive components. Here’s how it ingests and utilizes USGS- and USACE gage data, and how the return variables include processed gage data:

Gage Data Integration:

  • Function call: connections, param_df, wbody_conn, gages = nnu.build_connections(supernetwork_parameters)
  • Usage: nnu.build_connections is a function that constructs the network connections graph and provides a dictionary gages that maps gage IDs to segment IDs.
  • Data Handling: The gages dictionary contains gage data, which is converted into a DataFrame (link_gage_df) for further use. link_gage_df is a DataFrame mapping segment IDs to gage IDs.

Breaking Network at Gages:

  • Configuration Check: break_network_at_gages = False and streamflow_da.get('streamflow_nudging', False)
  • Usage: The presence of streamflow data assimilation parameters (streamflow_da) determines whether the network should be broken at gage locations.
  • Data Integration: If break_network_at_gages is True, the function will incorporate gage locations into the network structure, allowing for more refined simulations.

Mapping Gage Data:

  • USGS and Usace Crosswalks:

If data assimilation parameters specify crosswalk files, these are read to obtain mappings of USGS and USACE gage data.

  • Function Call: waterbody_types_df, usgs_lake_gage_crosswalk, usace_lake_gage_crosswalk = nhd_io.read_reservoir_parameter_file(...)
  • Purpose: The crosswalk files facilitate the integration of gage data into the network by providing mappings between reservoir and gage IDs.

Return Variables Relating to USGS and USACE Data

link_gage_df:

  • Data: Contains gage IDs and their corresponding segment IDs.
  • Purpose: Provides a reference for segment-to-gage relationships within the network.

usgs_lake_gage_crosswalk:

  • Data: Mapping of USGS gage IDs to lake IDs.
  • Purpose: Used to cross-reference lake IDs with gage IDs for USGS reservoirs.

usace_lake_gage_crosswalk:

  • Data: Similar to the USGS crosswalk but for USACE gage IDs.
  • Purpose: Provides cross-references for USACE reservoirs.

waterbody_types_df:

  • Data: Contains information about different types of waterbodies, including USGS and USACE types if specified.
  • Purpose: Helps categorize waterbodies and integrate them into the network.

Preprocessing of USGS and USACE gage data - V4

Class Initializations: NudgingDA and PersistenceDA

In troute's V4 mode, all data assimilation of gage data can be either through passing BMI arrays (Basic Model Interface), or from a file, as in V3.

USGS data are ingested in DataAssimilation, when the NudgingDA Class is initialized, which is derived from AbstractDA. The class handles reading, storing, and updating datasets used specifically for nudging purposes in data assimilation. Upon initialization, the class sets up several parameters related to data assimilation and prepares member variables to store the data. It determines whether streamflow nudging is enabled, and if so, proceeds to handle USGS data ingestion.

Both of the branches in the NudgingDA class initialization (BMI vs file input) share, as far as gage data is concerned, the task of reading and processing USGS timeslice files (which contain gage observations) and link these observations to stream segments in the model's network. The resulting dataframe (usgs_df) is used in streamflow nudging and, optionally, for constructing reservoir dataframes. The following parameter sets are relevant for usgs data assimilation:

  • data_assimilation_parameters (dict): Contains user-defined parameters related to data assimilation, such as the directory for USGS timeslice files and quality control settings.
  • streamflow_da_parameters (dict): Contains streamflow-specific data assimilation parameters, including the gage-segment crosswalk file.
  • run_parameters (dict): Contains configuration settings for running the model, such as the timestep (dt) and multiprocessing options (essentially important due to the cpu_pool).
  • network (object): Represents the hydrological network, including a DataFrame (link_gage_df) that links stream gages to stream segments.
  • da_run (list): List of usgs timeslice files that are processed during the data assimilation run.

After NudgingDA, PersistenceDA is initialized, which processes reservoir outflow data for both USGS-gages (as applicable), and for USACE gages (all of which are found at larger reservoirs). The class handles the ingestion of time series data related to reservoir operations and outflow persistence, formats the data into pandas dataframes, and prepares it for further reservoir persistence analysis. In addition to the same parameter sets that are important for NudgingDA setup (data_assimilation_parameters, etc), the following parameter sets from reservoir_da_parameters control the usgs- and usace-reservoir assimilation:

  • reservoir_persistence_da (dict): Boolean flag for overall reservoir persistence
  • reservoir_persistence_usgs (dict): Boolean flag for USGS reservoir persistence
  • reservoir_persistence_usace (dict): Boolean flag for USACE reservoir persistence
  • lake_gage_crosswalk = network.usace_lake_gage_crosswalk or network.usgs_lake_gage_crosswalk (dict): USGS/USACE lake-gage crosswalk

NudgingDA class and usgs dataframe initialization - BMI

In the NudgingDA initialization, in case BMI is to be used (not from_files), the BMI array usgs_Array is read, along with the corresponding time- and station indexing information (datesSecondsArray_usgs and stationStringLengthArray_usgs, respectively). Subsequently, the library bmi_array2df ("a2df") is used to unflatten the BMI array into _usgs_df, which is a member of NudgingDA:

  • a2df._unflatten_array
  • a2df._time_retrieve_from_arrays
  • a2df._stations_retrieve_from_arrays.

NudgingDA class and usgs dataframe initialization - From Files

In the event of data ingestion from files, the helper function _create_usgs_df is called, which is responsible for creating a DataFrame (usgs_df) that contains USGS gage observations. After extracting some information such as folder paths from the input parameters, the function get_obs_from_timeslices from the nhd_io library is called, which is designed to read USGS observation data from timeslice files, process it, and output a dataframe containing the observations linked to the model's network segments or waterbodies. This function plays a crucial role in integrating real-world gage observations into hydrological models for tasks like streamflow data assimilation and model calibration.

Key parameters processed in get_obs_from_timeslices are:

  • crosswalk_df (DataFrame): dataframe containing a crosswalk that maps USGS gage IDs to model destination IDs (e.g., segment IDs or waterbody IDs).
  • timeslice_files: A list of file paths to the USGS timeslice files that contain the observation data.
  • qc_threshold (int): Quality control threshold; observations with quality flags below this value are considered invalid and are removed.
  • interpolation_limit (int): The maximum gap duration (in minutes) over which missing observations can be interpolated.
  • frequency_secs (int): The desired frequency (in seconds) at which observations should be resampled and interpolated.
  • cpu_pool (int): Number of CPU cores to use for parallel processing.

Timeslice files are read in parallel processing, using the parallel function from the joblib library to enable parallel reading of multiple timeslice files. For each file in timeslice_files, the function _read_timeslice_file is called in parallel. The _read_timeslice_file function reads individual timeslice files and return two dataframes:

  • Observation dataframe: contains the actual gage observations.
  • Quality dataframe: contains quality flags corresponding to each observation.

After reading, the function checks if all dataframes are empty (in which case it logs a debug message and returns an empty dataframe). If they are not all empty, the Observation and Quality DataFrames from all timeslice files are concatenated into two dataframes:

  • timeslice_obs_df: combined observation data.
  • timeslice_qual_df: combined quality flags. These observation and quality dataframes are then joined with the crosswalk dataframe and the gage ID (converted to strings), resulting in indexing by the crosswalk destination field, excluding non-numeric data (NaN).

Subsequently, quality control filtering consists of masking negative and out-of-range quality flags (setting to NaN), followed by resampling of the dataframe following the given frequency_secs using the interpolate() function with a limit set by interpolation_limit. The interpolation is conducted with the dataframe transposed to time indexing and back again after resampling.

NudgingDA - Resulting USGS data format

The processed usgs_df containing the ingested USGS data has the following contents and format:

  • Index: Gage IDs from the geomodel.
  • Columns: Timestamps at the specified frequency (e.g., every 5 minutes).
  • Data: Interpolated USGS gage observations. In principle, any observations with quality flags below qc_threshold have been removed, however, the feature is not implemented at the moment.

An example structure of usgs_df follows:

image

After each simulation run, the run_results are processed to update the last_obs_df based on new observations in update_after_compute, and usgs_df is updated for the next iteration of the data assimilation loop in update_for_next_loop.

PersistenceDA class and usgs-/usace reservoir initialization - BMI

In the initialization, BMI is to be used if the from_files flag is turned to False, at which point the following BMI array data and index data are read, first for usgs:

  • usgs_reservoir_Array: 1D ndarray of usgs reservoir data
  • datesSecondsArray_reservoir_usgs: dates in seconds relative to dateNull
  • stationArray_reservoir_usgs and stationStringLengthArray_reservoir_usgs: 1D ndarray of the ASCII encoded station array, along with a key indicating the length of each station ID (also as ndarray) for decoding
  • nDates_reservoir_usgs / nStations_reservoir_usgs: dimensions of resulting usgs reservoir dataframe

And then for usace:

  • usace_reservoir_Array: 1D ndarray of uuace reservoir data
  • datesSecondsArray_reservoir_usace: dates in seconds relative to dateNull
  • stationArray_reservoir_usace and stationStringLengthArray_reservoir_usace: 1D ndarray of the ASCII encoded station array, along with a key indicating the length of each station ID (also as ndarray) for decoding
  • nDates_reservoir_usace / nStations_reservoir_usace: dimensions of resulting usace reservoir dataframe

The imported BMI arrays are processed using the library bmi_array2df ("a2df") to unflatten the BMI arrays into reservoir_usgs_df and reservoir_usace_df:

  • a2df._unflatten_array
  • a2df._time_retrieve_from_arrays
  • a2df._stations_retrieve_from_arrays. Further, the reservoir_usgs_param_df and reservoir_usace_param_df persistence parameters dataframe are created.

PersistenceDA class and usgs-/usace reservoir initialization - From Files:

For usgs data: if usgs_df has already been read in, the following dataframes are created from network, as well as by resampling from the usgs dataframe created in NudgingDA:

  • gage_lake_df: usgs_lake_gage_crosswalk, indexed to usgs gage ID
  • gage_link_df: link_gage_df, reindexed to gages
  • link_lake_df: crosswalk of segment- to lake IDs
  • usgs_df_15min: resampling regular usgs_df to 15 minutes, resulting in the reservoir_usgs_df. The usgs_df dataframe is then subset and re-indexed to the lake IDs if they are available instead of network link IDs, and the dataframe reservoir_usgs_param_df, which will eventually hold the persistence parameters, is initialized.

In the event that usgs_df does not exist yet, reservoir_usgs_df and reservoir_usgs_param_df are read in again in the function create_reservoir_df, which is a wrapper for the function get_obs_from_timeslices from the nhd_io library. The usgs timeslice list is passed to the latter through da_run as follows: reservoir_usgs_df, reservoir_usgs_param_df = _create_reservoir_df(data_assimilation_parameters, reservoir_da_parameters, streamflow_da_parameters, run_parameters, network, da_run, lake_gage_crosswalk = network.usgs_lake_gage_crosswalk, res_source = 'usgs')

The equivalent usace reservoir dataframes are read in without first checking whether usgs_df already exists; the call to get_obs_from_timeslices through create_reservoir_df is analogous.

As in NudgingDA, the update_for_next_loop function is called for each iteration of the data assimilation loop in, where reservoir_usgs_param_df and reservoir_usace_param_df are updated.

Clone this wiki locally