Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a time-slice capability to Omega IO and IOStream #189

Merged
merged 10 commits into from
Feb 6, 2025

Conversation

philipwjones
Copy link

@philipwjones philipwjones commented Dec 20, 2024

With these changes, users can write multiple time slices of a field to a given file. This is specified through the stream input YAML configuration by adding a file frequency in addition to the data frequency. Currently this only applies to output streams. While input multi-frame reads are supported at the base IO level, I am waiting for forcing designs to determine how best to support multiple slices in an input file. Unit tests for reading/writing multiple slices from a file have been added for the base IO level and all unit tests pass. I have modified the default YAML config to write 10-day high-frequency output for two months in two monthly files and verified that correct times have been written for the time field. The user and developer docs have been updated to describe how to configure and use this feature.

Checklist

  • Documentation:
    • Design document has been generated and added to the docs
    • User's Guide has been updated
    • Documentation has been built locally and changes look as expected
  • Testing
    • A comment in the PR documents testing used to verify the changes including any tests that are added/modified/impacted.
    • CTest unit tests for new features have been added per the approved design.
    • Polaris tests for new features have been added per the approved design (and included in a test suite)
    • Unit tests have passed. Please provide a relevant CDash build entry for verification.
    • Polaris test suite has passed

   Adds a capability to read/write multiple time slices of
   an unlimited time dimension to a single file. This modifies
   just the base IO layer - IOStream support will follow in
   another commit.
   The base IO unit test and documentation have been modified
   to test and document the new functionality.
  adds the ability to write multiple time slices to a single
  file in IOStreams.
@philipwjones
Copy link
Author

Passes unit tests on Chrys/Intel and Frontier Cray cpu and gpu

Copy link

@hyungyukang hyungyukang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philipwjones , this is great! This is what we need. Thanks a lot for your dedicated work on this, and I'm sorry for the delay in my review.

  • I have been testing this PR, and it performed as expected with the unit tests. However, I encountered an error while running the manufactured solution test. I will provide details on how to reproduce the error.

  • Should the time dimension and variable name time be changed to Time to match Omega's naming conventions?

TimeUnits FileUnits = TimeUnitsFromString(FileUnitStr);

// Various error checks
if (Err1 == 0 or Err2 == 0) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be helpful to include a log statement specifying the frequency (Freq and FreqUnits) and file frequency (FileFreq and FileFreqUnits) utilized by Omega.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we need to echo options to the log file, but I need to make some changes to the logging facility as part of the error handler so will make the changes then.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Thanks.

@hyungyukang
Copy link

hyungyukang commented Jan 13, 2025

Here is the omega.log when running the manufactured solution test with FileFreq = 1 and FileFreqUnits = days:

omega.log (bottom half)

[* info] [IOStream.cpp:2470] Successfully read stream InitialState from file OmegaMesh.nc
[* info] [OceanRun.cpp:56] ocnRun: Time step 1 complete, clock time: 0001-01-01-00:10:00.0000
[* info] [OceanRun.cpp:56] ocnRun: Time step 2 complete, clock time: 0001-01-01-00:20:00.0000
[* info] [OceanRun.cpp:56] ocnRun: Time step 3 complete, clock time: 0001-01-01-00:30:00.0000
[* info] [OceanRun.cpp:56] ocnRun: Time step 4 complete, clock time: 0001-01-01-00:40:00.0000
[* info] [OceanRun.cpp:56] ocnRun: Time step 5 complete, clock time: 0001-01-01-00:50:00.0000
[* error] [IO.cpp:334] PIO error while reading dimension MaxCellsOnEdge
[* error] [IO.cpp:334] PIO error while reading dimension TWO
[* error] [IO.cpp:334] PIO error while reading dimension MaxEdges
[* error] [IO.cpp:334] PIO error while reading dimension maxEdges
[* error] [IO.cpp:334] PIO error while reading dimension NCells
[* error] [IO.cpp:334] PIO error while reading dimension nCells
[* error] [IO.cpp:334] PIO error while reading dimension NEdges
[* error] [IO.cpp:334] PIO error while reading dimension nEdges
[* error] [IO.cpp:334] PIO error while reading dimension NTracers
[* error] [IO.cpp:334] PIO error while reading dimension nTracers
[* error] [IO.cpp:334] PIO error while reading dimension NVertLevels
[* error] [IO.cpp:334] PIO error while reading dimension nVertLevels
[* error] [IO.cpp:334] PIO error while reading dimension NVertices
[* error] [IO.cpp:334] PIO error while reading dimension nVertices
[* error] [IO.cpp:334] PIO error while reading dimension VertexDegree
[* error] [IO.cpp:334] PIO error while reading dimension vertexDegree
[* error] [IOStream.cpp:2624] Error defined dimensions for file ocn.hist.0001-01-01_00:00:00.nc
[* error] [IOStream.cpp:310] writeAll error in stream History
[* error] [IOStream.cpp:310] writeAll error in stream InitialState
[* error] [IOStream.cpp:310] writeAll error in stream RestartRead
[* critical] [OceanRun.cpp:51] Error writing streams at end of step
[* error] [OceanDriver.cpp:47] Error advancing Omega run interval
[* info] [Tracers.cpp:249] Tracers::clear() called
[* error] [OceanDriver.cpp:61] OMEGA terminating due to error

And here is an instruction to reproduce the error on Frontier (cpu):

Error reproduce
git clone git@github.com:philipwjones/E3SM.git
git checkout omega/io-time-slice
git submodule update --init --recursive externals/ekat externals/scorpio cime

cd components/omega

mkdir build

cd build

module load cmake

export PARMETIS_ROOT=/ccs/proj/cli115/software/polaris/frontier/spack/dev_polaris_0_4_0_crayclang_mpich/var/spack/environments/dev_polaris_0_4_0_crayclang_mpich/.spack-env/view

cmake \
   -DOMEGA_CIME_COMPILER=crayclang \
   -DOMEGA_BUILD_TYPE=Release \
   -DOMEGA_CIME_MACHINE=frontier \
   -DOMEGA_PARMETIS_ROOT=${PARMETIS_ROOT}\
   -DOMEGA_BUILD_TEST=ON \
   -Wno-dev \
   ../

./omega_build.sh

# Copy OmegaMesh.nc (the manufactured solution init fields (200km)) and omega.yml
cp /lustre/orion/cli115/proj-shared/hgkang/shared/OmegaMesh.nc ./
cp /lustre/orion/cli115/proj-shared/hgkang/shared/omega.yml ./

salloc -A cli115 -J e3sm -t 1:00:00 -p batch -N 1

./omega_run.sh

When I commented out FileFreq and FileFreqUnits in omega.yml, the simulation completed successfully but produced several single-time frame output files as expected. I will also debug the issue.

@philipwjones
Copy link
Author

@hyungyukang Thanks, I'll look into the manufactured solution case, but it looks like it thinks the file should already exist and is trying to read dims for consistency. So it probably has to do with assumptions about whether a file was written at the alarm boundary, esp when a simulation starts on a boundary.

Regarding time vs Time, there is no way to add a std name to dimensions so the dimension has to keep time as the name for CF compliance unless we want to translate it later. So I left the time field name consistent with the dimension name, though I could keep Time with time as a std name in that case. I'll let @xylar weigh in on the preference - don't know whether we're going to have a translation step at some point anyway as part of CMORizing.

@xylar
Copy link

xylar commented Jan 13, 2025

@philipwjones, sorry for the delay on this. I'll get to this as soon as I can.

@philipwjones
Copy link
Author

Quick update - there are indeed problems with the way I've computed the frame number in some specific cases (in addition to the bug above, there is one associated with calendar-based intervals like monthly and annual). These are limited to a small chunk of code that I should be able to fix pretty quickly.

@xylar
Copy link

xylar commented Jan 16, 2025

Regarding time vs Time, there is no way to add a std name to dimensions so the dimension has to keep time as the name for CF compliance unless we want to translate it later. So I left the time field name consistent with the dimension name, though I could keep Time with time as a std name in that case. I'll let @xylar weigh in on the preference

I do feel strongly that the coordinate time and the dimension time need to have the same name. That's what lots of tools are expecting. And it should be lowercase.

don't know whether we're going to have a translation step at some point anyway as part of CMORizing.

Yes, I think that's still inevitable but we do want cf-compliant output to the extent possible even so.

Copy link

@xylar xylar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to run in this directory with the following omega.yml:

$ pwd
/lcrc/group/e3sm/ac.xylar/polaris_0.5/chrysalis/test_20250116/cosine_bell_io_time_slice/ocean/spherical/icos/cosine_bell/convergence_both/with_viz/forward/480km_1440s

$ cat omega.yml
Omega:
  TimeIntegration:
    CalendarType: No Leap
    TimeStepper: RungeKutta4
    TimeStep: 0000_00:24:00.000
    StartTime: 0001-01-01_00:00:00
    StopTime: none
    RunDuration: 0024_00:00:00.000
  Dimension:
    NVertLevels: 1
  Decomp:
    HaloWidth: 3
    DecompMethod: MetisKWay
  State:
    NTimeLevels: 2
  Advection:
    FluxThicknessType: Center
    FluxTracerType: Center
  Tendencies:
    ThicknessFluxTendencyEnable: false
    PVTendencyEnable: false
    KETendencyEnable: false
    SSHTendencyEnable: false
    VelDiffTendencyEnable: false
    ViscDel2: 1.0e3
    VelHyperDiffTendencyEnable: false
    ViscDel4: 1.2e11
    TracerHorzAdvTendencyEnable: true
    TracerDiffTendencyEnable: false
    EddyDiff2: 10.0
    TracerHyperDiffTendencyEnable: false
    EddyDiff4: 0.0
    UseCustomTendency: false
    ManufacturedSolutionTendency: false
  Tracers:
    Base: [Temperature, Salinity]
    Debug: [Debug1, Debug2, Debug3]
  ManufacturedSolution:
    WavelengthX: 5.0e6
    WavelengthY: 4.33013e6
    Amplitude: 1.0
  IOStreams:
    # InitialState should only be used when starting from scratch.
    # For restart runs, the frequency units should be changed from
    # "OnStartup" to "never" so that the initial state file is not read.
    InitialState:
      UsePointerFile: false
      Filename: init.nc
      Mode: read
      Precision: double
      Freq: 1
      FreqUnits: OnStartup
      UseStartEnd: false
      Contents:
      - State
      - Tracers
    History:
      UsePointerFile: false
      Filename: output.nc
      Mode: write
      IfExists: append
      Precision: double
      Freq: 2073600
      FreqUnits: seconds
      UseStartEnd: false
      Contents:
      - Tracers
      - LayerThickness
      - NormalVelocity

I am seeing these errors:

PIO: WARNING: Opening file (output.nc) with iotype=3 (PIO_IOTYPE_NETCDF4C) failed (ierr=2, No such file or directory). Retrying with iotype=PIO_IOTYPE_NETCDF
...

PIO: FATAL ERROR: Aborting... FATAL ERROR: No such file or directory (file = output.nc) (/home/ac.xylar/e3sm_work/omega/omega/io-time-slice/externals/scorpio/src/clib/pioc_support.c: 5090)

It seems like maybe the file is getting opened in append mode even if it doesn't exist, and that PIO is unhappy with that? Or am I doing something weird?

Comment on lines 97 to 98
- Append if you want to append (eg multiple time slices) to the existing
file (this option is not currently supported).
file
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philipwjones, could you clarify (here or elsewhere) a bit more about what happens when you choose append? In particular, if the the model goes back in time (e.g. as a result of a restart), will time entries that already exist in the file be skipped or overwritten (or, heaven forbid, appended!) when Omega attempts to rewrite the same time entry?

In MPAS-Ocean, the clobber model refers to specific time entries, not to files. So the equivalent of replace would mean overwriting time entries as they are reached, and append would mean skipping existing time entries and only adding new ones. I am not saying that this is what Omega should do, just that we want to be clear about what Omega does do. At the moment, it seems like Omega likely does only one or the other of the MPAS-Ocean behaviors. If that is overwriting time entries even in append mode, I think that's okay. If someone changes how the model runs, the output will get replaced with the updated version. But if time entries get skipped in append mode, that might be an unexpected behavior for users fiddling around with the yaml file. They could probably get the intended behavior by deleting files, though. So again, just a question of documenting what the behavior is.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, append is just an indicator that the file may already exist and that we should write to the existing file if it exists and if not, create a new one. It was really only meant for multi-slice files (for single-slice files, I think we want to replace typically). The intent for multi-slice files was to overwrite any time slices that already existed in the file if it's re-running the same interval. I will clarify in the docs. As I was debugging Hyun's problem, I found that I was only computing the frame/slice correctly for some frequency options (intervals that didn't depend on calendar like seconds, minutes, etc) so trying to fix that now for the calendar intervals.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, sounds good. That's the behavior I would like us to have.

Comment on lines 119 to 128
- **FileFreq:** An optional entry for including multiple time slices in a
single file (only supported for output presently)
- **FreqUnits:** A field that, combined with the integer frequency,
determines the frequency of new files when multiple time slices are
included in a file. The file frequency must be less than or equal to
the data frequency. Acceptable values include all of the frequency values
listed above except the one-time values (eg OnStartup/Shutdown or AtTime).
The filename template should reflect the frequency and the time used for
the file will correspond to the template at the time the file is first
opened.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should these be set to if you just want all time entries to go to a single file? Should they be omitted? Or set to a very large frequency?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Omega write contents to these files on startup (before any time steps have run)? Do we need an option for specifying whether to write on startup or not?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xylar yes, if you want all entries in a single file, you'll need a file frequency that's very large. And no, we currently don't have a write on startup option, though you could kludge it by defining a separate stream with freq OnStartup that writes to the same file. Can you remind me what the use case is for writing on startup? Would you just need contents and metadata? Or would you actually want a full write?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In MPAS-Ocean, it has been pretty standard to write out history files that include the initial state. This is particularly important for diagnostic variables that aren't part of an initial condition or restart file.

But we can cross that bridge when we come to it. I think there may be a few places in Polaris where we assume that the first output for test cases is the initial condition. We'll have to read the initial condition instead, but that's easy.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xylar yes, if you want all entries in a single file, you'll need a file frequency that's very large.

Is the fact that I didn't have FileFreq and FileFreqUnits perhaps why I got the errors I reported above? I can try adding those and running again.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but there may be other bugs too. Worth a try.

And on the startup thing, it's not a huge deal to add since I already have the OnStartup flag internally. But it'll probably require the explicit flag and then not using it as freq units - requiring a bit of change in interpretation for one-time read/writes. Maybe it's clearer that way anyway.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't feel like you have to include that change here. It might be something we need down the road so it sounds good to keep it in mind.

@rljacob
Copy link
Member

rljacob commented Jan 16, 2025

Yes to following CF conventions in output which uses "time".

@@ -114,6 +116,16 @@ a template can be:
- Hours for a frequency every Freq hours (*not* Freq times per hour)
- Minutes for a frequency every Freq minutes (*not* Freq times per minute)
- Seconds for a frequency every Freq seconds (*not* Freq times per seconds)
- **FileFreq:** An optional entry for including multiple time slices in a
single file (only supported for output presently)
- **FreqUnits:** A field that, combined with the integer frequency,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **FreqUnits:** A field that, combined with the integer frequency,
- **FileFreqUnits:** A field that, combined with the integer frequency,

@xylar
Copy link

xylar commented Jan 17, 2025

@philipwjones, I'm still getting the No such file or directory (file = output.nc) error even when I added:

      FileFreq: 9999
      FileFreqUnits: years

My test is in:

/lcrc/group/e3sm/ac.xylar/polaris_0.5/chrysalis/test_20250117/cosine_bell_io_time_slice/ocean/spherical/icos/cosine_bell/convergence_both/with_viz/forward/480km_1440s

if you need a reproducer.

  The append option needs to know whether a file exists or not to
  use either the create or open function.
  This change allows developers to specify that a field (eg time)
  needs to retain their full precision even when included in an IOStream
  that requests reduced precision.
  - fixed frame calculation for multi-slice files so that it works for
    calendar intervals like months or years
  - fixes issues when overwriting existing frames or files in append mode
@philipwjones
Copy link
Author

Ok @xylar and @hyungyukang I believe I have fixed all the issues that you reported and this should be ready to review again. The prior version was not correctly opening and over-writing files and slices when in append mode, so I have fixed those. And the prior version was not computing the frames in a multi-slice file correctly for all time intervals. That has now been fixed.

@hyungyukang
Copy link

@philipwjones , thanks a lot. I just did some quick tests with the manufactured solution test, and now it is working properly!

dimensions:
	MaxCellsOnEdge = 2 ;
	MaxEdges = 6 ;
	NCells = 2500 ;
	NEdges = 7500 ;
	NTracers = 5 ;
	NVertLevels = 1 ;
	NVertices = 5000 ;
	VertexDegree = 3 ;
	time = UNLIMITED ; // (10 currently)

...

// global attributes:
		:SimulationTime = "0001-01-01_10:00:00" ;
		:SimulationTime0 = "0001-01-01_01:00:00" ;
		:SimulationTime1 = "0001-01-01_02:00:00" ;
		:SimulationTime2 = "0001-01-01_03:00:00" ;
		:SimulationTime3 = "0001-01-01_04:00:00" ;
		:SimulationTime4 = "0001-01-01_05:00:00" ;
		:SimulationTime5 = "0001-01-01_06:00:00" ;
		:SimulationTime6 = "0001-01-01_07:00:00" ;
		:SimulationTime7 = "0001-01-01_08:00:00" ;
		:SimulationTime8 = "0001-01-01_09:00:00" ;
		:SimulationTime9 = "0001-01-01_10:00:00" ;

I am going to test more under several circumstances, but I have a quick question:

The time-slice capability only works with IfExists: append, correct? When I rerun the model, I have to remove the existing output file. Otherwise, I received this error message:

[* error] [IOStream.cpp:253] Error reading stream RestartRead

I believe it would be beneficial for it to support replace as well or to have a corresponding function in append so users don't need to manually remove them beforehand, if its implementation is not too complex at this stage? Please correct me if I'm understanding wrong.

@philipwjones
Copy link
Author

@hyungyukang Time slices only work with append, otherwise you would lose all prior time slices when the file was replaced. Having said that, you shouldn't need to manually remove existing files. It should either replace files (under the replace option) or over-write time slices within an existing file (under the append option). I tested that by running Ctest twice without clearing the directory and it worked fine for me. What is your setup that's causing an error?

@hyungyukang
Copy link

@hyungyukang Time slices only work with append, otherwise you would lose all prior time slices when the file was replaced. Having said that, you shouldn't need to manually remove existing files. It should either replace files (under the replace option) or over-write time slices within an existing file (under the append option). I tested that by running Ctest twice without clearing the directory and it worked fine for me. What is your setup that's causing an error?

@philipwjones , I'm running the manufactured solution test. You can retrieve my setup by following this (building processes are the same):

# Copy OmegaMesh.nc (the manufactured solution init fields (200km)) and omega.yml
cp /lustre/orion/cli115/proj-shared/hgkang/shared/OmegaMesh.nc ./build
cp /lustre/orion/cli115/proj-shared/hgkang/shared/omega.yml ./build

salloc -A cli115 -J e3sm -t 1:00:00 -p batch -N 1

cd build/

# 1st run (ocn.hist.0001-01-01_00:00:00.nc is written properly.)
./omega_run.sh

# 2nd run (error encounter)
./omega_run.sh

Copy link

@xylar xylar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philipwjones, I really appreciate you working so hard on this an iterating with us.

In my testing with Polaris on Chrysalis, things are working as expected. I needed to add the following to history streams that use a single output file (output.nc):

      IfExists: append
      # effectively never
      FileFreq: 9999
      FileFreqUnits: years

When I do that, I can successfully output either every 10 years or every 1 year in a 10-year run, which the expected results. (I tried switching from outputting every 10 years to every 1 year without deleting the file and got a a critical error, which was very easy to understand.)

@xylar
Copy link

xylar commented Feb 4, 2025

I will note that I saw a transient error:

[* error] [IO.cpp:343] PIO error while reading dimension time

The output.nc file was invalid after that (could not run ncdump on it). However, when I reran Omega after removing the file, everything went fine.

@philipwjones
Copy link
Author

@hyungyukang In your case, the issue is that you're (correctly) using a pointer file for restarts. This is fine for normal restarts, but if you're trying to re-run an interval, you have to edit/delete the pointer file or else it will try to extend the run from the last restart written.

@xylar Yes, the behavior trying to switch output intervals is expected - if it finds a different interval, it doesn't know what it should do so bails with an error. The other error you encountered means that it couldn't find the time dimension while trying to read and if the file was invalid, then something might be amiss there.

@philipwjones
Copy link
Author

Actually @hyungyukang , maybe I'm misunderstanding what you're running. Was your second run supposed to extend from restart? In that case, you do have to modify the input yml file from an initial state read to a restart read (as noted in the comments in the yml file).

@philipwjones
Copy link
Author

@hyungyukang Sorry, being stupid this morning. Ignore the previous comments... trying to reproduce now.

@hyungyukang
Copy link

@hyungyukang Sorry, being stupid this morning. Ignore the previous comments... trying to reproduce now.

@philipwjones , no problem! Thank you for taking a look! 😃

  fixed a bug in which string metadata was being read but had not been
  allocated with a large enough size and needed a call to get the
  length first
@philipwjones
Copy link
Author

@hyungyukang There was indeed a bug - I had added a metadata read check to see if metadata already existed with a proper value, but for the string read, the string was not allocated with the proper length. Should be fixed now - worked for me anyway

@xylar It appears there is an issue with the conda environment for PR tests - see the error from the CI test. This PR satisfied the pre-commit checks that I ran manually so should be ok.

@xylar
Copy link

xylar commented Feb 5, 2025

@philipwjones, yeah, there's a known issue causing these intermittent failures. Hopefully, it will be fixed soon. A rerun nearly always takes care of it.

Copy link

@hyungyukang hyungyukang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philipwjones , I really appreciate your hard work on this. I have tested this PR with the manufactured solution test under various conditions by changing I/O related options in omega.yml. In all scenarios, I/O works as expected. I was also able to visualize SSH of the manufactured solution test using an output file containing 96 time levels. Thanks again!

out_res

@philipwjones
Copy link
Author

Thanks @xylar and @hyungyukang for the testing. Glad it's all finally working. Wish I could say this is the last time I need to touch streams...

@philipwjones philipwjones merged commit b193d4e into E3SM-Project:develop Feb 6, 2025
2 checks passed
@philipwjones philipwjones deleted the omega/io-time-slice branch February 6, 2025 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants