Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only read requested vars #22

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Only read requested vars #22

wants to merge 1 commit into from

Conversation

forsyth2
Copy link
Collaborator

@forsyth2 forsyth2 commented Apr 3, 2025

Summary

Objectives:

  • Only open the requested variables.

Issue resolution:

Select one: This pull request is...

  • a bug fix: increment the patch version
  • a small improvement: increment the minor version
  • a new feature: increment the minor version
  • an incompatible (non-backwards compatible) API change: increment the major version

Small Change

  • To merge, I will use "Squash and merge". That is, this change should be a single commit.
  • Logic: I have visually inspected the entire pull request myself.
  • Pre-commit checks: All the pre-commits checks have passed.

@forsyth2 forsyth2 added the semver: bug Bug fix (will increment patch version) label Apr 3, 2025
@forsyth2 forsyth2 self-assigned this Apr 3, 2025
@forsyth2
Copy link
Collaborator Author

forsyth2 commented Apr 3, 2025

Currently testing runtimes on two cfgs, reduced versions of the example cfg in E3SM-Project/zppy#694, for comparison against the runtimes reported in E3SM-Project/zppy#694 (comment)

vars = "", plots_lnd = 25 vars
# vars = ""
# plots_lnd = 25 vars

[default]
output = /lcrc/group/e3sm/ac.forsyth2/test_zi_performance_try1/v3.LR.historical_0051
www = /lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/test_zi_performance_try1
input = /lcrc/group/e3sm2/ac.wlin/E3SMv3/v3.LR.historical_0051
environment_commands = "source /lcrc/soft/climate/e3sm-unified/test_e3sm_unified_1.11.0rc13_chrysalis.sh"
campaign = "water_cycle"
case = v3.LR.historical_0051
partition = compute

[ts]
active = True
walltime = "00:50:00"
years = "1985:2014:5"

  [[ atm_monthly_glb ]]
  frequency = "monthly"
  input_files = "eam.h0"
  input_subdir = "archive/atm/hist"
  mapping_file = "glb"

  [[ lnd_monthly_glb ]]
  frequency = "monthly"
  input_files = "elm.h0"
  input_subdir = "archive/lnd/hist"
  mapping_file = "glb"
  #vars = "FSH,RH2M,LAISHA,LAISUN,QINTR,QOVER,QRUNOFF,QSOIL,QVEGE,QVEGT,SOILWATER_10CM,TSA,H2OSNO,TOTLITC,CWDC,SOIL1C,SOIL2C,SOIL3C,SOIL4C,WOOD_HARVESTC,TOTVEGC,NBP,GPP,AR,HR"
  vars = "" # This will tell zppy to use all available variables.

[mpas_analysis]
active = True
anomalyRefYear = 1985
climo_years = "1985-2014",
enso_years = "1985-2014",
mesh = "IcoswISC30E3r5"
parallelTaskCount = 6
shortTermArchive = True
ts_years = "1985-2014",
walltime = "4:00:00"

[global_time_series]
active = True
climo_years ="1985-2014",
environment_commands = "source /gpfs/fs1/home/ac.forsyth2/miniforge3/etc/profile.d/conda.sh; conda activate zi_performance_20250403"
experiment_name = "v3.LR.historical_0051"
figstr = "v3.LR.historical_0051"
make_viewer = True
moc_file = "mocTimeSeries_1985-2014.nc"
plots_atm = "TREFHT" # This will plot in the atm component; it has no effect on the original plots
plots_lnd = "FSH,RH2M,LAISHA,LAISUN,QINTR,QOVER,QRUNOFF,QSOIL,QVEGE,QVEGT,SOILWATER_10CM,TSA,H2OSNO,TOTLITC,CWDC,SOIL1C,SOIL2C,SOIL3C,SOIL4C,WOOD_HARVESTC,TOTVEGC,NBP,GPP,AR,HR" # Set this to "all" to plot all land variables
# plots_original isn't set and so will default to the original 8 plots.
ts_num_years = 5
ts_years ="1985-2014",
walltime = "10:00:00"
years = "1985-2014",
vars = 25 vars, plots_lnd = 25 vars
# vars = 25 vars
# plots_lnd = 25 vars

[default]
output = /lcrc/group/e3sm/ac.forsyth2/test_zi_performance_try2/v3.LR.historical_0051
www = /lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/test_zi_performance_try2
input = /lcrc/group/e3sm2/ac.wlin/E3SMv3/v3.LR.historical_0051
environment_commands = "source /lcrc/soft/climate/e3sm-unified/test_e3sm_unified_1.11.0rc13_chrysalis.sh"
campaign = "water_cycle"
case = v3.LR.historical_0051
partition = compute

[ts]
active = True
walltime = "00:50:00"
years = "1985:2014:5"

  [[ atm_monthly_glb ]]
  frequency = "monthly"
  input_files = "eam.h0"
  input_subdir = "archive/atm/hist"
  mapping_file = "glb"

  [[ lnd_monthly_glb ]]
  frequency = "monthly"
  input_files = "elm.h0"
  input_subdir = "archive/lnd/hist"
  mapping_file = "glb"
  vars = "FSH,RH2M,LAISHA,LAISUN,QINTR,QOVER,QRUNOFF,QSOIL,QVEGE,QVEGT,SOILWATER_10CM,TSA,H2OSNO,TOTLITC,CWDC,SOIL1C,SOIL2C,SOIL3C,SOIL4C,WOOD_HARVESTC,TOTVEGC,NBP,GPP,AR,HR"
  #vars = "" # This will tell zppy to use all available variables.

[mpas_analysis]
active = True
anomalyRefYear = 1985
climo_years = "1985-2014",
enso_years = "1985-2014",
mesh = "IcoswISC30E3r5"
parallelTaskCount = 6
shortTermArchive = True
ts_years = "1985-2014",
walltime = "4:00:00"

[global_time_series]
active = True
climo_years ="1985-2014",
environment_commands = "source /gpfs/fs1/home/ac.forsyth2/miniforge3/etc/profile.d/conda.sh; conda activate zi_performance_20250403"
experiment_name = "v3.LR.historical_0051"
figstr = "v3.LR.historical_0051"
make_viewer = True
moc_file = "mocTimeSeries_1985-2014.nc"
plots_atm = "TREFHT" # This will plot in the atm component; it has no effect on the original plots
plots_lnd = "FSH,RH2M,LAISHA,LAISUN,QINTR,QOVER,QRUNOFF,QSOIL,QVEGE,QVEGT,SOILWATER_10CM,TSA,H2OSNO,TOTLITC,CWDC,SOIL1C,SOIL2C,SOIL3C,SOIL4C,WOOD_HARVESTC,TOTVEGC,NBP,GPP,AR,HR" # Set this to "all" to plot all land variables
# plots_original isn't set and so will default to the original 8 plots.
ts_num_years = 5
ts_years ="1985-2014",
walltime = "10:00:00"
years = "1985-2014",

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Apr 3, 2025

open_mfdataset doesn't appear in #19, so I think these can be independent pull requests.

@chengzhuzhang
Copy link
Collaborator

open_mfdataset doesn't appear in #19, so I think these can be independent pull requests.

We don't need #19 (which introduced other behavior changes), if this PR can resolve the issue.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Apr 3, 2025

I'm still thinking #19 might be necessary though. Tom's analysis work in #21 showed a directory that "contains 353 files", which matches the number of land vars. In E3SM-Project/zppy#694, we had a long runtime because we were trying to read data for 353 variables when we only needed 25 of them. That would mean in the case where we do actually want 353 variables, the runtime would still be long.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Apr 3, 2025

I'm running another test cfg for this case, to confirm:

vars = "", plots_lnd = "all"
# vars = ""
# plots_lnd = "all"

[default]
output = /lcrc/group/e3sm/ac.forsyth2/test_zi_performance_try3/v3.LR.historical_0051
www = /lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/test_zi_performance_try3
input = /lcrc/group/e3sm2/ac.wlin/E3SMv3/v3.LR.historical_0051
environment_commands = "source /lcrc/soft/climate/e3sm-unified/test_e3sm_unified_1.11.0rc13_chrysalis.sh"
campaign = "water_cycle"
case = v3.LR.historical_0051
partition = compute

[ts]
active = True
walltime = "00:50:00"
years = "1985:2014:5"

  [[ atm_monthly_glb ]]
  frequency = "monthly"
  input_files = "eam.h0"
  input_subdir = "archive/atm/hist"
  mapping_file = "glb"

  [[ lnd_monthly_glb ]]
  frequency = "monthly"
  input_files = "elm.h0"
  input_subdir = "archive/lnd/hist"
  mapping_file = "glb"
  #vars = "FSH,RH2M,LAISHA,LAISUN,QINTR,QOVER,QRUNOFF,QSOIL,QVEGE,QVEGT,SOILWATER_10CM,TSA,H2OSNO,TOTLITC,CWDC,SOIL1C,SOIL2C,SOIL3C,SOIL4C,WOOD_HARVESTC,TOTVEGC,NBP,GPP,AR,HR"
  vars = "" # This will tell zppy to use all available variables.

[mpas_analysis]
active = True
anomalyRefYear = 1985
climo_years = "1985-2014",
enso_years = "1985-2014",
mesh = "IcoswISC30E3r5"
parallelTaskCount = 6
shortTermArchive = True
ts_years = "1985-2014",
walltime = "4:00:00"

[global_time_series]
active = True
climo_years ="1985-2014",
environment_commands = "source /gpfs/fs1/home/ac.forsyth2/miniforge3/etc/profile.d/conda.sh; conda activate zi_performance_20250403"
experiment_name = "v3.LR.historical_0051"
figstr = "v3.LR.historical_0051"
make_viewer = True
moc_file = "mocTimeSeries_1985-2014.nc"
plots_atm = "TREFHT" # This will plot in the atm component; it has no effect on the original plots
plots_lnd = "all"
#plots_lnd = "FSH,RH2M,LAISHA,LAISUN,QINTR,QOVER,QRUNOFF,QSOIL,QVEGE,QVEGT,SOILWATER_10CM,TSA,H2OSNO,TOTLITC,CWDC,SOIL1C,SOIL2C,SOIL3C,SOIL4C,WOOD_HARVESTC,TOTVEGC,NBP,GPP,AR,HR" # Set this to "all" to plot all land variables
# plots_original isn't set and so will default to the original 8 plots.
ts_num_years = 5
ts_years ="1985-2014",
walltime = "10:00:00"
years = "1985-2014",

Comment on lines +18 to +21
file_path_list: List[str] = []
for var in var_list:
# `var` will be of the form `FSNS`, `FLNS`, etc.
file_path_list.append(f"{directory}{var.variable_name}*.nc")
Copy link
Collaborator Author

@forsyth2 forsyth2 Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From #21:

Xarray/xCDAT will attempt to concatenate all of these files into a single Dataset object, which can be slow depending on the number of files, the file sizes, and the shape/structure of the data.

I'm almost wondering if it'd be better to read in one variable at a time: at the beginning of globalAnnual open just that variable. As far as I can tell, with two important exceptions, variables aren't requested multiple times. Those exceptions are:

  1. The variables used for derivation in globalAnnualHelper can end up requested multiple times; they don't have to be re-loaded though because they already exist in the DatasetWrapper
  2. The variables used in both "original" and "atm"/"ocn" components are loaded twice -- as plots_original variables and as plots_atm/plots_ocn variables.
> git grep -n  dataset_wrapper
zppy_interfaces/global_time_series/coupled_global.py:10:from zppy_interfaces.global_time_series.coupled_global_dataset_wrapper import (
zppy_interfaces/global_time_series/coupled_global.py:168:            dataset_wrapper: DatasetWrapper = DatasetWrapper(exp[exp_key], var_list)
zppy_interfaces/global_time_series/coupled_global.py:180:                data_array, units = dataset_wrapper.globalAnnual(var)
zppy_interfaces/global_time_series/coupled_global.py:214:        del dataset_wrapper
zppy_interfaces/global_time_series/coupled_global.py:257:            dataset_wrapper = DatasetWrapper(exp["ocean"], [ohc_variable])
zppy_interfaces/global_time_series/coupled_global.py:258:            exp["annual"]["ohc"], _ = dataset_wrapper.globalAnnual(ohc_variable)
zppy_interfaces/global_time_series/coupled_global.py:264:            dataset_wrapper = DatasetWrapper(exp["vol"], [vol_variable])
zppy_interfaces/global_time_series/coupled_global.py:265:            exp["annual"]["volume"], _ = dataset_wrapper.globalAnnual(vol_variable)
> git grep -n set_area_tuple
zppy_interfaces/global_time_series/coupled_global_dataset_wrapper.py:28:    def set_area_tuple(self):
zppy_interfaces/global_time_series/coupled_global_dataset_wrapper.py:162:                    self.set_area_tuple()

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Apr 3, 2025

Unfortunately this fix isn't working yet. See errors below. It appears RESTOM wasn't defined as a variable we should get a nc file for. But that was the case before, so how was it working before? Well, the issue is we have now asked zppy to look for RESTOM specifically, but it's actually a derived variable:

        if var == "RESTOM":
            FSNT, _ = self.globalAnnualHelper(
                "FSNT", metric, scale_factor, original_units, final_units
            )
            FLNT, _ = self.globalAnnualHelper(
                "FLNT", metric, scale_factor, original_units, final_units
            )
            data_array = FSNT - FLNT

That is, we weren't looking for a RESTOM nc file, we were looking for FSNT AND FLNT nc files. That was a benefit of just downloading all the variables upfront. To make this PR work, we'd need to move the variable derivation much earlier (and actually maybe in two places -- once just to find what vars we need and once to actually fill in formulas).

So, this is a complication that needs to be addressed before merging this PR...

Errors picking up RESTOM
> cd /lcrc/group/e3sm/ac.forsyth2/test_zi_performance_try1/v3.LR.historical_0051/post/scripts
> grep -v "OK" *status
global_time_series_1985-2014.status:ERROR (8)
> emacs global_time_series_1985-2014.o721926

2025-04-03 16:23:07,286 [CRITICAL]: coupled_global.py(set_var:170) >> [Errno 2] No such file or directory: '/lcrc/group/e3sm/ac.forsyth2/test_zi_performance_try1/v3.LR.historical_0051/post/atm/glb/ts/monthly/5yr/RESTOM*.nc'
2025-04-03 16:23:07,286 [CRITICAL]: coupled_global.py(set_var:171) >> DatasetWrapper object could not be created for atmos=/lcrc/group/e3sm/ac.forsyth2/test_zi_performance_try1/v3.LR.historical_0051/post/atm/glb/ts/monthly/5yr/

AttributeError: 'DatasetWrapper' object has no attribute 'dataset'

> ls /lcrc/group/e3sm/ac.forsyth2/test_zi_performance_try1/v3.LR.historical_0051/post/atm/glb/ts/monthly/5yr/ | grep RESTOM

# No results

> grep -in vars ts_atm_monthly_glb_1985-1989-0005.settings
15:  'extra_vars': '',
47:  'vars': 'FSNTOA,FLUT,FSNT,FLNT,FSNS,FLNS,SHFLX,QFLX,TAUX,TAUY,PRECC,PRECL,PRECSC,PRECSL,TS,TREFHT,CLDTOT,CLDHGH,CLDMED,CLDLOW,U',
48:  'vars_exclude': 'H2OSOI,LAKEICEFRAC,O_SCALAR,PCT_LANDUNIT,PCT_NAT_PFT,SOILICE,SOILICE_ICE,SOILLIQ,SOILLIQ_ICE,SOILPSI,T_SCALAR,TLAKE,TSOI,TSOI_ICE,W_SCALAR',
> cd /lcrc/group/e3sm/ac.forsyth2/test_zi_performance_try2/v3.LR.historical_0051/post/scripts
> grep -v "OK" *status
global_time_series_1985-2014.status:ERROR (8)
> emacs global_time_series_1985-2014.o721940

2025-04-03 16:29:04,343 [CRITICAL]: coupled_global.py(set_var:170) >> [Errno 2] No such file or directory: '/lcrc/group/e3sm/ac.forsyth2/test_zi_performance_try2/v3.LR.historical_0051/post/atm/glb/ts/monthly/5yr/RESTOM*.nc'
2025-04-03 16:29:04,343 [CRITICAL]: coupled_global.py(set_var:171) >> DatasetWrapper object could not be created for atmos=/lcrc/group/e3sm/ac.forsyth2/test_zi_performance_try2/v3.LR.historical_0051/post/atm/glb/ts/monthly/5yr/

AttributeError: 'DatasetWrapper' object has no attribute 'dataset'

> ls /lcrc/group/e3sm/ac.forsyth2/test_zi_performance_try2/v3.LR.historical_0051/post/atm/glb/ts/monthly/5yr/ | grep RESTOM

# No results

> grep -in vars ts_atm_monthly_glb_1985-1989-0005.settings
15:  'extra_vars': '',
47:  'vars': 'FSNTOA,FLUT,FSNT,FLNT,FSNS,FLNS,SHFLX,QFLX,TAUX,TAUY,PRECC,PRECL,PRECSC,PRECSL,TS,TREFHT,CLDTOT,CLDHGH,CLDMED,CLDLOW,U',
48:  'vars_exclude': 'H2OSOI,LAKEICEFRAC,O_SCALAR,PCT_LANDUNIT,PCT_NAT_PFT,SOILICE,SOILICE_ICE,SOILLIQ,SOILLIQ_ICE,SOILPSI,T_SCALAR,TLAKE,TSOI,TSOI_ICE,W_SCALAR',

@chengzhuzhang
Copy link
Collaborator

as i earlier mentioned, it is a bit of challenging to debug at this point. I'd hope to see the original component is decoupled from other realm component, because they are essentially very different set.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Apr 3, 2025

Trying to think of possible paths forward here:

  1. Keep main -- that is, global_time_series will be slow if a user requested vars="" for the ts glb-lnd subtask but only a few variables for plots_lnd.
  2. Add any possible variables we'd need for derivation to the file_path_list to feed open_mf_dataset so that we know we have them. (This could cause the DatasetWrapper to not get created if any of those variables are missing, so that is not ideal).
  3. Run open_mfdataset at the start of each call to globalAnnual, per variable

@chengzhuzhang I'm thinking (1) at this point given the Unified timeline (and then determine how to improve it in the next development cycle). What do you think?

see the original component is decoupled from other realm component, because they are essentially very different set.

Yes, they have become quite different over the past half-year. We should discuss this more but it's not something we can fit in before the Unified release.

@chengzhuzhang
Copy link
Collaborator

yes this make sense, we can use the same land var list we used in the original .cfg for now in the template in ts-glb task, and make a note about potential performance impact of using var = " " on the global time series plotting.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Apr 3, 2025

I'd hope to see the original component is decoupled from other realm component, because they are essentially very different set.

#23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver: bug Bug fix (will increment patch version)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Potential performance bottleneck with DatasetWrapper.dataset
2 participants