jobs failing with sigbus and unknown userid errors #377

mniederhuber · 2024-01-24T03:18:37Z

Description of the bug

This may be two separate issues, but I'm having jobs fail either with a Java Runtime Environment SIGBUS (0x7) error, or with FATAL: Couldn't determine user account information: user: unknown userid [some number]

In both cases, the process that fails and the file associated with the error changes with each rerun.
With subsequent reruns, the process and sample that failed previously will complete, but a new process and sample will error out. Eventually the pipeline can be completed with multiple reruns.
The pipeline runs fine with the test profile.

From what I can figure out the Java Runtime Environment error has something to do with running out of memory in the java vm.
but I'm stumped on the userid error.

I have made a small modification to the modules.config file to add --no-model param to MACS. But I am not seeing any errors with MACS, and again, the test profile runs as expected.

Any help would be greatly appreciated!

Command used and terminal output

#!/bin/bash
#SBATCH --mem=8G
#SBATCH -t 8:00:00
#SBATCH -n 1
#SBATCH -o var/log/chip-%j.out
#SBATCH -e var/log/chip-%j.err

module load nextflow

nextflow -log var/log/.chipseq run nf-core/chipseq \
	-profile singularity \
	-c config/slurm.config \
	-resume \
	-params-file config/chip_params.yaml

Relevant files

bug.zip

System information

Nextflow: 23.04.2
Hardware: HPC
Executor: slurm
Container: Singularity
OS: RHEL8
nf-core/chipseq: 2.0.0

The text was updated successfully, but these errors were encountered:

mniederhuber · 2024-01-25T18:48:54Z

It turns out this is almost certainly an issue on the HPC side.
slurm / admins were causing jobs that were submitted to one partition to get redirected to a different partition.
By default, nextflow polls job status from the submitted partition, and if it can't find the job things get messed up.
We have been able to fix this by adding the following to our config file:

executor {
    name = "slurm"
    queueGlobalStatus = true
}

This will tell nextflow to poll for job status globally and not just within the submitted partition.

It would be great if this was either the default behavior from nextflow, or there was a more informative error message in cases where a job has been redirected.

mniederhuber added the bug Something isn't working label Jan 24, 2024

mniederhuber closed this as completed Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobs failing with sigbus and unknown userid errors #377

jobs failing with sigbus and unknown userid errors #377

mniederhuber commented Jan 24, 2024 •

edited

Loading

mniederhuber commented Jan 25, 2024

jobs failing with sigbus and unknown userid errors #377

jobs failing with sigbus and unknown userid errors #377

Comments

mniederhuber commented Jan 24, 2024 • edited Loading

Description of the bug

Command used and terminal output

Relevant files

System information

mniederhuber commented Jan 25, 2024

mniederhuber commented Jan 24, 2024 •

edited

Loading