Skip to content

Forcings Ingestion

Neptune-Meister edited this page Sep 17, 2024 · 4 revisions

Forcing files - input

Building forcing sets - V3

Build forcing file sets - V3

In V3 mode, forcing input starts with running build_forcing_sets from nhd_network_utilities_v02 to obtain the sets of forcing files for each loop (run_Sets). The build_forcing_sets routine constructs sets of forcing files based on user-specified parameters, starting by retrieving qlat_forcing_sets, qlat_input_folder, nts (number of time steps), max_loop_size, and dt (time step interval) from the forcing_parameters. It verifies the existence of the qlat_input_folder and raises errors if the folder is not specified or doesn't exist.

If run_sets are provided, it loops through them and appends a final_timestamp to each set by extracting the model_output_valid_time from the last file in the set using the nhd_io.get_param_str() function.

If no run_sets are provided, it constructs a new set by first determining the time interval (dt_qlat) between the forcing files by comparing timestamps from the first two files in the folder. This is done using the get_param_str function, which retrieves the model_output_valid_time from the first and second files. The time interval is used to compute qts_subdivisions, which represents how many subdivisions of the forcing data correspond to each time step (dt), ensuring that the interval is divisible by dt. Next, the total number of files required (nfiles) is calculated based on the number of time steps (nts). The function generates a list of datetime_list_str corresponding to the file timestamps and constructs file names with the pattern YYYYMMDDHHMM.CHRTOUT_DOMAIN1. It checks for the existence of each forcing file in the folder, raising an error if any file is missing.

Further in case no run_sets were provided, run_sets is built by grouping the forcing files into sets, with each set containing up to max_loop_size files. It calculates the number of time steps for each set (nts) and accumulates them until the total time steps are reached. For each set, it extracts the final_timestamp from the last file in the group and returns the list of constructed run_sets.

Build forcing dataframe - V3

The run_sets from the previous step are further processed by build_qlateral_array from nhd_network_utilities_v02, which is called from nwm_forcing_preprocess. In addition to forcing_parameters, segment_index is another key input, which is the set of segment IDs used to filter the resulting dataframe. From forcing_parameters, the following parameters are extracted:

  • qts_subdivisions: Number of subdivisions for time steps (defaults to 1).
  • nts: Total number of time steps (defaults to 1).
  • qlat_input_folder: Folder containing qlateral files.
  • qlat_input_file: Direct input file for qlateral data.

The subsequent assembly of the forcings dataframe depends on whether forcing files are sourced from a Qlat input folder, a single Qlat input file, or a Qlat constant value:

  • Qlat folder input:

The function checks for qlat_files or constructs a list of files matching the qlat_file_pattern_filter, and then reads additional file format information like column names (qlat_file_index_col, qlat_file_value_col, gw_bucket_col, terrain_ro_col). The CHRTOUT files are read in parallel processing, whereas each qlat file is read in by one CPU using get_ql_from_chrtout from nhd_io, which is based on netCDF4 import of the CHRTOUT files. For each file, get_ql_from_chrtout is used to extract the relevant qlateral data (q_lateral, gw_bucket, terrain_runoff columns) and package it into a list ql_list.

The dataframe is built starting with extracting the feature index (idx) from the first CHRTOUT file, followed by stacking the lateral inflow data from all files into a 2D array and converting it into a pandas dataframe qlat_df. The rows represent segments (indexed by idx), and the columns represent different time steps (based on the number of qlateral files). qlat_df is then filtered to only include rows (segments) that are present in segment_index.

  • Qlat file input:

In this case, the format has to be csv, which is read in using get_ql_from_csv from nhd_io, using the pandas csv import function.

  • Qlat constant value:

This option is the default if neither a folder qlat_input_folder nor an input file qlat_input_file is provided. In that case, the function creates a constant qlateral dataframe (qlat_const), where all lateral inflows are set to a constant value (default: 0). The dataframe is created with time steps (nts // qts_subdivisions) and segment IDs (segment_index).

Building forcing sets - V4

Build forcing file sets - V4

In V4 mode, forcing sets are built within the AbstractNetwork class, after its initialization either through HYFeaturesNetwork, or NHDNetwork. The member function to build run_sets analogous to V3 is build_forcing_sets:

  • Parameter Extraction: The function build_forcing_sets extracts the following parameters from the configuration dictionary forcing_parameters:

    • qlat_forcing_sets: A pre-built set of forcing runs, if provided.
    • qlat_input_folder: The folder containing the qlateral forcing files.
    • nts: The total number of time steps in the simulation.
    • max_loop_size: The maximum number of time steps or files that can be processed in one loop (default is 12).
    • dt: The time step interval for the model. The function then verifies that the qlat_input_folder exists. If the folder does not exist or is not specified, the function halts.
  • Nexus File Conversion (if applicable): If the forcing files are of the type nex-*, and a binary folder is specified, the function converts these files to Parquet files using the helper function nex_files_to_binary. It updates forcing_parameters with the new folder with the Parquet files and file patterns after conversion. The conversion is from nex-csv files into binary Parquet files, which is conducted using another helper function rewrite_to_parquet based on pyarrow, which is called from within nex_files_to_binary.

  • Assembly of the run sets: Depending on the input forcing configuration, the run_sets are built in one of the following three ways:

    1. forcing_glob_filter is nex-: the function retrieves all files from the qlat_input_folder that match the nex- pattern. It reads the timestamp from the last row of the first file to determine the final_timestamp for the run and stores the list of all qlateral files along with the total number of time steps (nts) and the final_timestamp in a single run set.

    2. Forcing Sets Predefined (run_sets): run_sets are returned as is

    3. qlat_input_folder is provided (and no forcing sets): a sorted list of forcing files is extracted from the input folder based on the forcing_glob_filter (e.g., *.CHRTOUT_DOMAIN1 or *NEXOUT), followed by the determination of the time step interval from the first two files to compute the time step between forcing files (dt_qlat). The number of subdivisions per time step (qts_subdivisions) is then calculated as the ratio of the qlateral forcing time interval (dt_qlat) to the model time step (dt), and the number of files needed to cover the full duration of the simulation (nfiles) is determined based on the total number of time steps (nts) and the subdivisions. The list of forcing files is built from the resulting datetime list, and the existence of all forcing files is verified. The run_sets are finally built in sets with each containing up to max_loop_size files. For each run set, the number of time steps (nts) is computed as the product of the number of files and qts_subdivisions. The timestamp for the last file in each set (final_timestamp) is extracted from the file’s metadata. The function loops through the forcing files in groups (up to max_loop_size at a time) and accumulates the total number of time steps processed, adding to the run_sets list until all the required files are processed.

Clone this wiki locally