Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new test for mirroring with nested states #7

Merged
merged 10 commits into from
Dec 10, 2024

Conversation

uturuncoglu
Copy link
Contributor

This PR aims to add new NUOPC application prototype to test new mirroring option transferAllAsNests.

@uturuncoglu
Copy link
Contributor Author

@theurich I add a new simple prototype app for testing new option for mirroring. In this case, ocean component sends sst and atmosphere model extracts the nested state to reach the field provided by ocean component. I used existing miroring test in here but I remove lots of things from it to make it simple and understandable.

@uturuncoglu
Copy link
Contributor Author

@theurich this works without any issue in one processor but it seems handing if I use multiple. I could not understand way but if you have any idea or seeing anything missing just let me know. I'll also try to fix it.

@uturuncoglu
Copy link
Contributor Author

@theurich Okay. It is working if I comment following lines from the esm.F90.

diff --git a/AtmOcnMirrorFieldsNestedProto/esm.F90 b/AtmOcnMirrorFieldsNestedProto/esm.F90
index a1270d3..0123295 100644
--- a/AtmOcnMirrorFieldsNestedProto/esm.F90
+++ b/AtmOcnMirrorFieldsNestedProto/esm.F90
@@ -9,8 +9,8 @@
 !==============================================================================
 
 ! For testing with threading keep one or both SETVM macros active
-#define WITH_ATM_SETVM
-#define WITH_OCN_SETVM
+!#define WITH_ATM_SETVM
+!#define WITH_OCN_SETVM
 
 ! For testing with partially overlapping components keep one active, other _off
 #define WITH_ATM_PETLIST_off

I am not sure this part is necessary. Do we want to also test this with threading. Since editing mirroring test is testing threading maybe it would be nice to see this simple. Anyway, let me know what you think.

@uturuncoglu uturuncoglu marked this pull request as ready for review December 4, 2024 16:30
@uturuncoglu
Copy link
Contributor Author

@theurich Okay. I removed unnecessary things and i also run all the apps (using Intel compiler on Hercules) in my end and here is the results.

PASS: AsyncIOBlockingProto
PASS: AsyncIONonblockingProto
PASS: AtmOcnConOptsProto
PASS: AtmOcnConProto
PASS: AtmOcnCplListProto
PASS: AtmOcnCplSetProto
PASS: AtmOcnFDSynoProto
PASS: AtmOcnIceSimpleImplicitProto
PASS: AtmOcnImplicitProto
PASS: AtmOcnLndProto
PASS: AtmOcnLogNoneProto
PASS: AtmOcnMedIngestFromConfigProto
PASS: AtmOcnMedIngestFromInternalProto
PASS: AtmOcnMedPetListProto
PASS: AtmOcnMedPetListTimescalesProto
PASS: AtmOcnMedPetListTimescalesSplitFastProto
PASS: AtmOcnMedProto
PASS: AtmOcnMirrorFieldsProto
PASS: AtmOcnMirrorFieldsNestedProto
PASS: AtmOcnPetListProto
PASS: AtmOcnProto
PASS: AtmOcnRtmTwoTimescalesProto
PASS: AtmOcnScalarProto
PASS: AtmOcnSelectProto
PASS: AtmOcnSelectProto
PASS: AtmOcnSelectProto
PASS: AtmOcnSimpleImplicitProto
PASS: AtmOcnTransferGridProto
PASS: AtmOcnTransferLocStreamProto
PASS: AtmOcnTransferMeshProto
PASS: CustomFieldDictionaryProto
PASS: DriverInDriverDataDepProto
PASS: DriverInDriverProto
PASS: DynPhyProto
PASS: ExternalDriverAPIProto
PASS: ExternalDriverAPIWeakCplDAProto
PASS: GenericMediatorProto
FAIL: HierarchyProto
PASS: NamespaceProto
PASS: NestingMultipleProto
PASS: NestingSingleProto
PASS: NestingTelescopeMultipleProto
PASS: SingleModelProto
PASS: SingleModelOpenMPProto
PASS: SingleModelOpenMPUnawareProto
FAIL: ESMX_StartHereProto
FAIL: ESMX_StartHereProto-Step1
FAIL: ESMX_StartHereProto-Step2
FAIL: ESMX_StartHereProto-Step3
FAIL: ESMX_StartHereProto-Step4
FAIL: ESMX_SingleModelInFortranProto
FAIL: ESMX_SingleModelInFortranProto-DL
FAIL: ESMX_SingleModelInCProto-DL
FAIL: ESMX_AtmOcnProto
FAIL: ESMX_AtmOcnProto-Alt
FAIL: ESMX_AtmOcnFortranAndCProto
PASS: ESMX_ExternalDriverAPIProto

Some of tests are failing with following error,

HierarchyProto/PET0.ESMF_LogFile:20241204 124242.946 ERROR            PET0         /work2/noaa/nems/tufuk/COP/esmf/src/Infrastructure/Base/src/ESMCI_Info.C:830 Info::get(): Key not found (JSON trace will follow): /NUOPC/Instance/CompName
HierarchyProto/PET0.ESMF_LogFile:20241204 124242.946 ERROR            PET0 ESMCI_Info.C:832 Info::get()         Attribute not set  - [json.exception.out_of_range.403] key 'CompName' not found
HierarchyProto/PET0.ESMF_LogFile:20241204 124242.946 ERROR            PET0 ESMCI_Info.C:836 Info::get()         Attribute not set  - Internal subroutine call returned Error

The ESMf version that I am testing with does not have your fix (related with the info at FV3). So, maybe we are getting error because of it. Or maybe this leading to the edge case that you mentioned in the call. Any idea? I could also test again once you merge your fix.

@uturuncoglu
Copy link
Contributor Author

@theurich I think ESMX are fine since the make is having hard time to find ESMX_Builder. I am not sure why but in my installation esmf.mk file was showing wrong path: ESMF_APPSDIR=/work2/noaa/nems/tufuk/COP/esmf/apps/appsO/Linux.intel.64.intelmpi.default Once I fixed it it was fine.

1 similar comment
@uturuncoglu
Copy link
Contributor Author

@theurich I think ESMX are fine since the make is having hard time to find ESMX_Builder. I am not sure why but in my installation esmf.mk file was showing wrong path: ESMF_APPSDIR=/work2/noaa/nems/tufuk/COP/esmf/apps/appsO/Linux.intel.64.intelmpi.default Once I fixed it it was fine.

@theurich
Copy link
Member

theurich commented Dec 4, 2024

@theurich I think ESMX are fine since the make is having hard time to find ESMX_Builder. I am not sure why but in my installation esmf.mk file was showing wrong path: ESMF_APPSDIR=/work2/noaa/nems/tufuk/COP/esmf/apps/appsO/Linux.intel.64.intelmpi.default Once I fixed it it was fine.

For ESMX_Builder, if you use an esmf.mk from an ESMF build without make install, you must call make build_apps to make the ESMX_Builder available where esmf.mk thinks it is.

@uturuncoglu
Copy link
Contributor Author

Okay but in my case, esmf.mk is pointing to wrong directory for the ESMX_Builder. I had to go and fix it manually and change it from /work2/noaa/nems/tufuk/COP/esmf/apps/appsO/Linux.intel.64.intelmpi.default to /work2/noaa/nems/tufuk/COP/esmf/install/esmf-dev/bin. I have no apps/ folder under esmf in my case. I was using following options to build ESMF,

export ESMF_BOPT=O
export ESMF_COMM=intelmpi
export ESMF_COMPILER=intel
export ESMF_DIR=/work2/noaa/nems/tufuk/COP/esmf
export ESMF_INSTALL_BINDIR=bin
export ESMF_INSTALL_LIBDIR=lib
export ESMF_INSTALL_MODDIR=include
export ESMF_INSTALL_PREFIX=$ESMF_DIR/install/esmf-dev
export ESMF_LAPACK=internal
export ESMF_NETCDF=nc-config
export ESMF_NFCONFIG=nf-config
export ESMF_OS=Linux
export ESMF_PIO=external
export ESMF_PIO_INCLUDE=/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rc1/envs/unified-env/install/intel/2021.9.0/parallelio-2.5.10-5m25ri6/include
export ESMF_PIO_LIBPATH=/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rc1/envs/unified-env/install/intel/2021.9.0/parallelio-2.5.10-5m25ri6/lib
export ESMF_SHARED_LIB_BUILD=OFF

So, maybe there is something wrong in there but let me double check.

@theurich
Copy link
Member

theurich commented Dec 4, 2024

So you were using make install I guess? If so, make sure you are also using the esmf.mk from within that installation, and not from within your build dir tree!

@uturuncoglu
Copy link
Contributor Author

@theurich Okay. That makes sense. That was my fault. Thanks.

@theurich
Copy link
Member

theurich commented Dec 5, 2024

@uturuncoglu Once you have esmf-org/esmf#241 updated against develop, please check if HierarchyProto now passes. If it still fails, we have to investigate.

@theurich
Copy link
Member

theurich commented Dec 5, 2024

... and also good to update this branch against the proto develop branch. Thanks!

@uturuncoglu
Copy link
Contributor Author

@theurich I am also updating in here. HierarchyProto is still failing.

@theurich
Copy link
Member

theurich commented Dec 5, 2024

@theurich I am also updating in here. HierarchyProto is still failing.

@ufuk - once the "Development Tests" action comes back clean on the ESMF side, I will take a look at the HierarchyProto. I will let you know what I find.

@theurich
Copy link
Member

theurich commented Dec 5, 2024

@theurich I am also updating in here. HierarchyProto is still failing.

@ufuk - once the "Development Tests" action comes back clean on the ESMF side, I will take a look at the HierarchyProto. I will let you know what I find.

Sorry, I meant to tag @uturuncoglu!

@uturuncoglu
Copy link
Contributor Author

@theurich Okay. that makes sense. BTW, in the particular test, ATM has two sub-component DYN and PHYS. So, maybe this is hitting to another edge case. Also, ESMX tests are faling too with something strange error. For example, ESMX_AtmOcnFortranAndCProto is facing like following,

ld: /work2/noaa/nems/tufuk/COP/esmf/install/esmf-dev/lib/libesmf.a(NUOPC_F.o):(.data+0x0): multiple definition of `label_InternalState'; CMakeFiles/Catmos.dir/catmos.c.o:(.data+0x0): first defined here
ld: /work2/noaa/nems/tufuk/COP/esmf/install/esmf-dev/lib/libesmf.a(NUOPC_F.o):(.data+0x8): multiple definition of `label_Advance'; CMakeFiles/Catmos.dir/catmos.c.o:(.data+0x8): first defined here
ld: /work2/noaa/nems/tufuk/COP/esmf/install/esmf-dev/lib/libesmf.a(NUOPC_F.o):(.data+0x10): multiple definition of `label_AdvanceClock'; CMakeFiles/Catmos.dir/catmos.c.o:(.data+0x10): first defined here
ld: /work2/noaa/nems/tufuk/COP/esmf/install/esmf-dev/lib/libesmf.a(NUOPC_F.o):(.data+0x18): multiple definition of `label_CheckImport'; CMakeFiles/Catmos.dir/catmos.c.o:(.data+0x18): first defined here
ld: /work2/noaa/nems/tufuk/COP/esmf/install/esmf-dev/lib/libesmf.a(NUOPC_F.o):(.data+0x20): multiple definition of `label_SetRunClock'; CMakeFiles/Catmos.dir/catmos.c.o:(.data+0x20): first defined here
ld: /work2/noaa/nems/tufuk/COP/esmf/install/esmf-dev/lib/libesmf.a(NUOPC_F.o):(.data+0x28): multiple definition of `label_TimestampExport'; CMakeFiles/Catmos.dir/catmos.c.o:(.data+0x28): first defined here
ld: /work2/noaa/nems/tufuk/COP/esmf/install/esmf-dev/lib/libesmf.a(NUOPC_F.o):(.data+0x30): multiple definition of `label_Finalize'; CMakeFiles/Catmos.dir/catmos.c.o:(.data+0x30): first defined here
ld: /work2/noaa/nems/tufuk/COP/esmf/install/esmf-dev/lib/libesmf.a(NUOPC_F.o):(.data+0x38): multiple definition of `label_Advertise'; CMakeFiles/Catmos.dir/catmos.c.o:(.data+0x38): first defined here
ld: /work2/noaa/nems/tufuk/COP/esmf/install/esmf-dev/lib/libesmf.a(NUOPC_F.o):(.data+0x40): multiple definition of `label_ModifyAdvertised'; CMakeFiles/Catmos.dir/catmos.c.o:(.data+0x40): first defined here
ld: /work2/noaa/nems/tufuk/COP/esmf/install/esmf-dev/lib/libesmf.a(NUOPC_F.o):(.data+0x48): multiple definition of `label_RealizeProvided'; CMakeFiles/Catmos.dir/catmos.c.o:(.data+0x48): first defined here
ld: /work2/noaa/nems/tufuk/COP/esmf/install/esmf-dev/lib/libesmf.a(NUOPC_F.o):(.data+0x50): multiple definition of `label_AcceptTransfer'; CMakeFiles/Catmos.dir/catmos.c.o:(.data+0x50): first defined here
ld: /work2/noaa/nems/tufuk/COP/esmf/install/esmf-dev/lib/libesmf.a(NUOPC_F.o):(.data+0x58): multiple definition of `label_RealizeAccepted'; CMakeFiles/Catmos.dir/catmos.c.o:(.data+0x58): first defined here
ld: /work2/noaa/nems/tufuk/COP/esmf/install/esmf-dev/lib/libesmf.a(NUOPC_F.o):(.data+0x60): multiple definition of `label_SetClock'; CMakeFiles/Catmos.dir/catmos.c.o:(.data+0x60): first defined here
ld: /work2/noaa/nems/tufuk/COP/esmf/install/esmf-dev/lib/libesmf.a(NUOPC_F.o):(.data+0x68): multiple definition of `label_DataInitialize'; CMakeFiles/Catmos.dir/catmos.c.o:(.data+0x68): first defined here

I have also failure in ESMX_SingleModelInFortranProto-DL and ESMX_SingleModelInCProto-DL too. So, there might be issue also in that side independent from the StateReconcile fix.

@uturuncoglu
Copy link
Contributor Author

@theurich There is the full list of failures for your reference,

FAIL: HierarchyProto
FAIL: ESMX_SingleModelInFortranProto-DL
FAIL: ESMX_SingleModelInCProto-DL
FAIL: ESMX_AtmOcnFortranAndCProto

@theurich
Copy link
Member

theurich commented Dec 5, 2024

@theurich There is the full list of failures for your reference,

FAIL: HierarchyProto
FAIL: ESMX_SingleModelInFortranProto-DL
FAIL: ESMX_SingleModelInCProto-DL
FAIL: ESMX_AtmOcnFortranAndCProto

I am building your ESMF branch locally, and then will see if I can reproduce these failures...

@uturuncoglu
Copy link
Contributor Author

@theurich Okay. I am also building develop in my end to double check ESMX errors.

@uturuncoglu
Copy link
Contributor Author

@theurich JFYI, ESMX_AtmOcnFortranAndCProto also fails with same way with ESMF current develop. HierarchyProto is passing without any issue. So, probably failing in my branch due to newly introduced metadata.

@theurich
Copy link
Member

theurich commented Dec 5, 2024

@theurich JFYI, ESMX_AtmOcnFortranAndCProto also fails with same way with ESMF current develop. HierarchyProto is passing without any issue. So, probably failing in my branch due to newly introduced metadata.

@uturuncoglu I am pretty sure the ESMX failures you are seeing likely are an issue on your setup. They all pass for me (for both develop and also your ESMF branch). Also develop passed last night cleanly (including the ESMX tests), so I don't think it is a general problem.

I see the HierarchyProto failing with your ESMF branch for esmf-org/esmf#241, so I am looking into that now.

@uturuncoglu
Copy link
Contributor Author

@theurich Just to be sure that this is related with the StateReconcile work. I roll back to the esmf version that has mods for new metadata but not StateReconcile work. I was not trying to tun NUOPC app prototypes before merging state reconcile work.

Anyway, the last commit that I have in this ESMF version is,

commit 93a7b1a6d03fd043769977db32b4bd4f7503493b (tag: v8.8.0b05)
Author: Robert Oehmke <oehmke@ucar.edu>
Date:   Tue Nov 5 16:08:50 2024 -0700

    Fix error message.

and then I run newly added app proto (AtmOcnMirrorFieldsNestedProto) and HierarchyProto with this version and both of them worked without any issue. So, it seems that the error that we are getting from HierarchyProto is related to the changes that is done for the state reconcile.

@theurich
Copy link
Member

theurich commented Dec 6, 2024

@uturuncoglu Yes, that does make sense.... it turns out to be a combination of both: StateReconcile() optimization and the mirror changes. In fact we are hitting exactly the edge case in State level Info reconciliation that is no longer supported in the optimized StateReconcile(). However, I am working on a solution for this now. I plan to open a PR into your PR branch once I have the changes ready.... so you can review, and we can discuss.

@uturuncoglu
Copy link
Contributor Author

@theurich Okay. Sure. Thanks.

@theurich
Copy link
Member

theurich commented Dec 6, 2024

@uturuncoglu The code under esmf-org/esmf#328 passes both the HierarchyProto and your new AtmOcnMirrorFieldsNestedProto.
Locally I also made sure that the AtmOcnMirrorFieldsNestedProto passes even if the components are defined on different petLists's. Some of those combinations hung for me before, but now pass with esmf-org/esmf#328.

@uturuncoglu
Copy link
Contributor Author

@theurich It seems that we do not need to change anything in the proto code except if we change the name of the transfer attribute that needs to be consistent with the proto app. Right?

@theurich
Copy link
Member

theurich commented Dec 7, 2024

@theurich It seems that we do not need to change anything in the proto code except if we change the name of the transfer attribute that needs to be consistent with the proto app. Right?

Yes, that is correct. For now I wanted your prototype to keep working without any change. Once we decide on the renamed transfer option, we can adjust your proto code.

@theurich
Copy link
Member

theurich commented Dec 9, 2024

@uturuncoglu I pushed some small changes into your new prototype app. The two main items are:

  1. Under esm.F90 I added the petList handling again. This was to make sure the implemented functionality works for concurrent components. This does work now.
  2. Under atm.F90, i.e. the acceptor of mirror transfer, during Advance() the standard NUOPC Attribute "Namespace" is used now to identify the fields under the nested states in the importState. This is opposed to the nestedStateName, which would depend on the NUOPC implementation. The "Namespace" setting on the other hand, by NUOPC standard, will always be identical to the user provided component label used for the connected component as per RunSequence.

@uturuncoglu
Copy link
Contributor Author

@theurich Thanks for your help. It looks good to me. We could discuss the next step about both PR in the core call.

@theurich
Copy link
Member

theurich commented Dec 9, 2024

@uturuncoglu I made the updates. Please review.

Copy link
Contributor Author

@uturuncoglu uturuncoglu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me. Probably, you already run on your end but let me know if you want me to test too.

@theurich
Copy link
Member

theurich commented Dec 9, 2024

It looks good to me. Probably, you already run on your end but let me know if you want me to test too.

Yes I tested. The other thing though that would be nice if you could test is your actual use case code. As we know, it will need a little updating. It will be nice to have that working with the feature/provider_metadata branch before we merge it into develop, just to be sure we didn't miss anything you need for your actual use case.

@uturuncoglu
Copy link
Contributor Author

@theurich Okay. Let me test it. Update you soon.

@uturuncoglu
Copy link
Contributor Author

@theurich The real application is also running without any issue.

@theurich
Copy link
Member

@theurich The real application is also running without any issue.

Perfect... I'll go ahead and merge the PR.

@theurich theurich merged commit c216261 into esmf-org:develop Dec 10, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants