Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[redgreengpu] CRAY_ACC_ERROR: host region overlaps present region but is not contained for 'pgp3a(:,:,:,:)' #134

Open
okkevaneck opened this issue Aug 13, 2024 · 9 comments

Comments

@okkevaneck
Copy link

I've compiled and installed the redgreenbranch on LUMI-G and I ran the ectrans-benchmark-gpu-dp binary. This unfortunately resulted in the following error message:

ACC: libcrayacc/acc_present.c:679 CRAY_ACC_ERROR - Host region (b6bc740 to b6fb140) overlaps present region (b6bc140 to b6fae40 index 64) but is not contained for 'pgp3a(:,:,:,:)' from ../../../pfs/lustrep4/scratch/project_465000527/ovaneck/ectrans_dwarf/src/sources/ectrans/src/trans/gpu/internal/trltog_mod.F90:460

I'm clueless to what the problem may be, so I've also included my installation setup as a tar.gz for anyone to try:
ectrans_dwarf.tar.gz

Simply acquire an interactive LUMI-G compute node and execute ./install_redgreengpu.sh.
This will clone, build, and install all required sources.
Then afterwards, go into a login node, and cd into the run directory.
Then sbatch the run_sbatch_lumi-g.sh script to get the error output in the err.<slurm_job_id>.0 file within the results/sbatch/ folder.

@samhatfield
Copy link
Collaborator

Hi @okkevaneck - we have seen these errors before. I have just tested redgreengpu on LUMI-G and I am able to run under my own build/run framework. So the question is, what is different about yours. I'll look into it.

By the way, ecKit and FCKit are not dependencies of ecTrans so you don't need to build those.

More generally, so everyone is on the same page, let me summarise the current support of AMD GPUs with ecTrans:

  • ecTrans v1.4.0 is the first official release with GPU support. GPU-support is provided with a combination OpenACC directives and a HIP layer offloading key kernels to hipFFT and hipBLAS. Support for CUDA/cuFFT/cuBLAS is provided by compile-time code replacement. The GPU-compatible source tree is taken from an optimised version of redgreengpu developed with Nvidia, which is very fast and mature on Nvidia GPUs. Unfortunately we have some longstanding problems on AMD GPUs, namely with random failures in the hipFFT planning. This is a work in progress.
  • In parallel there is also the redgreengpu branch. This branch is missing key optimisations present in v1.4.0 but is relatively mature across both Nvidia and AMD GPUs. For now this is the only branch that works on LUMI. At some point in the future we hope to retire this branch and move to v1.4.0, but first we have to fix the hipFFT bug.

@okkevaneck
Copy link
Author

Hi @samhatfield, thank you for the quick reply!
Interesting that it's different, let me know if I can provide you with any extra info.

Good to know eckit and fckit are not dependencies, this will reduce our installation time by some bit.

Also many thanks for the overview of the current state.
We heard from @reuterbal that we should use the redgreengpu branch as the main branch is currently not stable om AMD architectures, but it's also good to know the developments.

@samhatfield
Copy link
Collaborator

I wasn't able to follow your build instructions completely successfully. I get the interactive node with

salloc --nodes=1 --tasks=1 --cpus-per-task=32 --account=project_465000454 --gpus-per-task=1 --partition=dev-g --time=00:30:00

(is this wrong?)

Then I execute

srun -n 1 ./install_redgreengpu.sh lumi

The build finishes, but when I look at src/build/ectrans.log, I see

-- HIP target architecture: gfx803

It should be gfx90a. Sure enough, when I test the resulting binary, it doesn't work:

> srun -n 1 ./src/build/ectrans/bin/ectrans-benchmark-gpu-dp
"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"
srun: error: nid005006: task 0: Aborted (core dumped)
srun: launch/slurm: _step_signal: Terminating StepId=7971168.2

Is there something I'm missing?

@okkevaneck
Copy link
Author

I allocate the node slightly different and SSH onto the compute node, maybe that's what's causing the difference.

To allocate a node, I run:

#!/usr/bin/env bash

JOB_NAME="ia_gpu_dev"
GPUS_PER_NODE=8
NODES=1
NTASKS=8
PARTITION="dev-g"
ACCOUNT="project_465000454"
TIME="01:00:00"

# Allocate interactive node with the set variables above.
salloc \
    --gpus-per-node=$GPUS_PER_NODE \
    --exclusive \
    --nodes=$NODES \
    --ntasks=$NTASKS \
    --partition=$PARTITION \
    --account=$ACCOUNT \
    --time=$TIME \
    --mem=0 \
    --job-name=$JOB_NAME

Then to get onto the compute node, I execute the following from a login node:
ROCR_VISIBLE_DEVICES=0 srun --cpu-bind=mask_cpu:0xfe000000000000 --nodes=1 --pty bash -i

And then I execute the script without any SLURM command, as we're already on the compute node:
./install_redgreengpu.sh lumi

I forgot about the ROCR_VISIBLE_DEVICES=0 and --cpu-bind=mask_cpu:0xfe000000000000, I think this is what could cause the behavior you're seeing.
Let me know if it helped!

@samhatfield
Copy link
Collaborator

Will give it a go, thanks! I'm waiting quite long today to get allocated a node.

@samhatfield
Copy link
Collaborator

Now I see

-- HIP target architecture: gfx90a gfx90a gfx90a gfx90a gfx90a gfx90a gfx90a gfx90a

which is good. I still found it difficult to get an interactive session on a compute node:

> ROCR_VISIBLE_DEVICES=0 srun --cpu-bind=mask_cpu:0xfe000000000000 --nodes=1 --pty bash -i                                                                                                                                                           srun: Warning: can't honor --ntasks-per-node set to 1 which doesn't match the requested tasks 8 with the number of requested nodes 1. Ignoring --ntasks-per-node.
srun: error: Unable to create step for job 7971505: More processors requested than permitted

Instead I ran

ROCR_VISIBLE_DEVICES=0 srun --ntasks=1 --pty bash -i

Now I've successfully built the binary. And I think I've found the cause of the problem. Could you try running without --nproma $NPROMA?

In my setup, I get the exact same error as you when I include --nproma 32. To be honest, this option is sort of irrelevant for ecTrans benchmarking because it determines the data layout in grid point space, but no calculations are done in grid point space. We usually don't specify this option at all when benchmarking ecTrans. But we do like to keep the option so we can replicate situations from the IFS (where NPROMA very much has consequences) in ecTrans. Therefore this option should work, and this is clearly a bug!

For now, if you just want to benchmark ecTrans, you can leave this option off. In the mean time I'll try to find the cause of this bug.

@okkevaneck
Copy link
Author

Hmm interesting, I wonder why the interactive node works for me..

I tried running without --nproma 32 and it works, thank you very much!
It does make me wonder, how do you alter the workload size with this version?
I looked at an older version in the beginning of this year, which had the options to scale through the NLAT and NLON variables.

@samhatfield
Copy link
Collaborator

Great to hear it works. I'm figuring out how we might fix this so we can run with any NPROMA. Let's keep this issue open until we decide how to proceed.

With the benchmark program the problem size in both spectral and grid point space can be set by a single parameter -t, --truncation. This is the cutoff zonal and total wavenumber in spectral space. The higher this number, the higher the resolution, and the bigger the work arrays.

By default the benchmark driver will use an octahedral grid for grid point space with a cubic-accuracy representation of waves, which basically means the number of latitudes must be 2 * (truncation + 1). -t, --truncation 79 (which is the default if you don't specify the option) therefore gives an octahedral grid with 160 latitudes. The number of longitude points per latitude depends on the latitude -> it is greatest at the equator and tapers to 20 at the poles.

@okkevaneck
Copy link
Author

Ah that's how it works!
Many thanks Sam!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants