Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bfile works but pfile errors out #394

Open
batzler opened this issue Dec 4, 2024 · 5 comments
Open

bfile works but pfile errors out #394

batzler opened this issue Dec 4, 2024 · 5 comments
Labels
bug Something isn't working
Milestone

Comments

@batzler
Copy link

batzler commented Dec 4, 2024

Description of the bug

I'm submitting a pgsc_calc job successfully using bfile (plink bed) however when I use the pfile format (plink pgen) it fails for unknown reasons (I cant interpret the error anyway...).

Input files are as follows

-rw-r----- 1 batzler bsi 23377581643 Dec 4 10:53 pg22.bed
-rw-r----- 1 batzler bsi 30106514 Dec 4 10:53 pg22.bim
-rw-r----- 1 batzler bsi 8074325 Dec 4 10:53 pg22.fam
-rw-r----- 1 batzler bsi 1396 Dec 4 10:53 pg22.log
-rw-r----- 1 batzler bsi 1294 Dec 4 13:34 pg_imputed22.log
-rw-r----- 1 batzler bsi 23982440039 Dec 4 13:34 pg_imputed22.pgen
-rw-r----- 1 batzler bsi 7259937 Dec 4 13:34 pg_imputed22.psam
-rw-r----- 1 batzler bsi 78753551 Dec 4 13:34 pg_imputed22.pvar

plink bed/binary files were created from the pgen files

$PLINK2 --pfile pg_imputed22 --make-bed --out pg$CHR

When I run through pipeline using format bfile everything executes properly.

When running with the pfile format and the pgen files
Error traceback is as follows

Traceback (most recent call last):
File "/app/pgscatalog.utils/.venv/bin/pgscatalog-match", line 8, in
sys.exit(run_match())
^^^^^^^^^^^
File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/cli/match_cli.py", line 87, in run_match
ipc_path = get_match_candidates(
^^^^^^^^^^^^^^^^^^^^^
File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/cli/match_cli.py", line 124, in get_match_candidates
with variants as target_df:
File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/lib/variantframe.py", line 54, in enter
self.arrowpaths = loose(self.variants, tmpdir=self._tmpdir)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/functools.py", line 909, in wrapper
return dispatch(args[0].class)(*args, **kw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/lib/_arrow.py", line 94, in _
return batch_read(reader, tmpdir=tmpdir, cols_keep=cols_keep)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/lib/_arrow.py", line 102, in batch_read
batches = reader.next_batches(batch_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/polars/io/csv/batched_reader.py", line 134, in next_batches
batches = self._reader.next_batches(n)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: found more fields than defined in 'Schema'

Consider setting 'truncate_ragged_lines=True'.

Command used and terminal output

nextflow run pgscatalog/pgsc_calc -profile singularity --min_overlap 0.0001 --input ${samplesheet} --scorefile ${scorefile} --output ${outdir} -r ${pgsc_calc_version} -c ${project}/nxf_config.config --target_build ${target_build} --genotypes_cache $cachedir

Nextflow command is the same whether running format bfile or format pfile.  Only thing I change is the samplesheet to represent the different path_prefix and format

Relevant files

No response

System information

Nextflow version
nextflow/23.04.2

slurm executor
apptainer/singularity
linux

@batzler batzler added the bug Something isn't working label Dec 4, 2024
@nebfield
Copy link
Member

nebfield commented Dec 5, 2024

Thanks for the bug report. I think your problems are related to the columns in the pvar file. What column names do you have in the pvar file?

@batzler
Copy link
Author

batzler commented Dec 5, 2024 via email

@nebfield
Copy link
Member

nebfield commented Dec 5, 2024

Thanks for the details. I think the problem is related to PGScatalog/pygscatalog#29

If you try:

$ plink2 --pfile  pg_imputed22 --make-just-pvar cols=-xheader,-maybequal,-maybefilter,-maybeinfo,-maybecm --out pg_imputed22

This should overwrite the existing pvar file to remove the FILTER and INFO columns, which are causing the problem.

If the plink command does remove the columns, please do the same for all of your chromosomes and test the calculator again 😄

@batzler
Copy link
Author

batzler commented Dec 5, 2024 via email

@nebfield
Copy link
Member

nebfield commented Dec 6, 2024

Great, thanks! I'll leave this issue open because the calculator should ignore these extra columns automatically. We'll fix it in the next release 😄

@nebfield nebfield added this to the v2.1.0 milestone Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants