Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: results missing for single sample #283

Closed
Fiwx opened this issue Apr 18, 2024 · 6 comments · Fixed by #274
Closed

Error: results missing for single sample #283

Fiwx opened this issue Apr 18, 2024 · 6 comments · Fixed by #274
Labels
bug Something isn't working user-query User queries & requests

Comments

@Fiwx
Copy link

Fiwx commented Apr 18, 2024

Description of the bug

In the current dev build, the report is made, but it does not contain any columns except for SUM:

   sampleset    IID           PGS           SUM Z_MostSimilarPop Z_norm1 Z_norm2
   <chr>        <chr>         <chr>       <dbl> <lgl>            <lgl>   <lgl>  
 1 testfile testfile.txt PGS00075… -0.220  NA               NA      NA     
 2 reference    HG00096       PGS00075…  0.258  NA               NA      NA     
 3 reference    HG00097       PGS00075…  0.0698 NA               NA      NA     
 4 reference    HG00099       PGS00075…  0.0588 NA               NA      NA     
 5 reference    HG00100       PGS00075…  0.382  NA               NA      NA     
 6 reference    HG00101       PGS00075…  0.525  NA               NA      NA     
 7 reference    HG00102       PGS00075… -0.327  NA               NA      NA     
 8 reference    HG00103       PGS00075… -0.285  NA               NA      NA     
 9 reference    HG00105       PGS00075…  0.214  NA               NA      NA     
10 reference    HG00106       PGS00075… -1.13   NA               NA      NA 

This is in both the .html report as well as the raw testfile_pgs.txt.gz file.

In that file, only the SUM column is populated.

However, when using the next most current version (alpha 4), all columns are correctly populated (despite technically failing on the report making step, #242).

I know the build is dev and not released yet, but it might happen on alpha 5 too (I'm unable to test it because of the _vcf filename error).

Command used and terminal output

nextflow run pgscatalog/pgsc_calc -profile singularity --input samplesheet.csv --pgs_id PGS000758 --target_build GRCh37 --min_overlap 0.0 --run_ancestry pgsc_1000G_v1.tar.zst -c custom.config -r v2.0.0-alpha.4e

Or

nextflow run pgscatalog/pgsc_calc -profile singularity --input samplesheet.csv --pgs_id PGS000758 --target_build GRCh37 --min_overlap 0.0 --run_ancestry pgsc_1000G_v1.tar.zst -c custom.config -r dev

Relevant files

No response

System information

Ubuntu, Docker, Singularity, current Nextflow

@Fiwx Fiwx added the bug Something isn't working label Apr 18, 2024
@nebfield
Copy link
Member

Thanks for the bug report! Sorry, I can't reproduce on the dev branch. Here's what I tried:

$ cd /path/to/pgsc_calc
$ rm -r work results  # guarantee a fresh run
$ nextflow run main.nf -profile docker,arm \
    --run_ancestry ../pgsc_1000G_v1.tar.zst \
    --input ../hgdp/split/samplesheet.csv \
    --pgs_id PGS000758 \
    --target_build GRCh38
$ head <(gzcat results/hgdp/score/hgdp_pgs.txt.gz) | column -t
sampleset  IID        PGS                     SUM                  Z_MostSimilarPop      Z_norm1               Z_norm2              percentile_MostSimilarPop
hgdp       HGDP00001  PGS000758_hmPOS_GRCh38  -0.3588901000000001  1.664053980073892     0.8584426674256617    0.8134837998289269   95.49902152641879
hgdp       HGDP00003  PGS000758_hmPOS_GRCh38  -0.40938197          1.573172743793694     0.7932776666255601    0.7547724248426887   94.71624266144813
hgdp       HGDP00005  PGS000758_hmPOS_GRCh38  0.0158892699999999   2.338626192945269     1.6154600095825224    1.527703433039602    98.4344422700587
hgdp       HGDP00007  PGS000758_hmPOS_GRCh38  -0.9636511           0.5755336433533662    -0.2936306959068118   -0.2642672147215713  72.40704500978474

I noticed --min_overlap is 0 on your run. What kind of variant matching rates do you normally get?

@smlmbrt is the expert (and out of office 🌴 until next week ) but perhaps low variant match rates could contribute to NA values. Some changes were made to the ancestry normalisation steps to handle low variance cases in the most recent release.

@Fiwx
Copy link
Author

Fiwx commented Apr 19, 2024

Thanks for testing it. I'll try again. The match rates were high (99.x%) in the alpha 4 version run, and the genome is imputed (between 20-30 million variants).

@Fiwx
Copy link
Author

Fiwx commented Apr 20, 2024

I tried again after creating a clean new set up, and got the same results.

$ nextflow run pgscatalog/pgsc_calc -profile singularity --input \
    /home/ubuntu/custom/dev/ca/samplesheet.csv --pgs_id PGS000758 \
    --target_build GRCh37 --min_overlap 0.0 --run_ancestry \
    /home/ubuntu/custom/data/pgsc_1000G_v1.tar.zst -c \
    /home/ubuntu/custom/references/custom.config -r dev

Version
2.0.0-alpha.5

reference n target N variants in panel n (matched) % matched  
1 1000G 27904794 85277655 27904796 32.72
Sampleset Scoring file Number of variants Passed matching Match % Total matched Total unmatched
newautosomal PGS000758_hmPOS_GRCh37 33938 TRUE 99.2 33668 270
   sampleset    IID           PGS           SUM Z_MostSimilarPop Z_norm1 Z_norm2
   <chr>        <chr>         <chr>       <dbl> <lgl>            <lgl>   <lgl>  
 1 newautosomal autosomal.txt PGS00075… -0.220  NA               NA      NA     
 2 reference    HG00096       PGS00075…  0.258  NA               NA      NA     
 3 reference    HG00097       PGS00075…  0.0698 NA               NA      NA     
 4 reference    HG00099       PGS00075…  0.0588 NA               NA      NA     
 5 reference    HG00100       PGS00075…  0.382  NA               NA      NA     
 6 reference    HG00101       PGS00075…  0.525  NA               NA      NA     
 7 reference    HG00102       PGS00075… -0.327  NA               NA      NA     
 8 reference    HG00103       PGS00075… -0.285  NA               NA      NA     
 9 reference    HG00105       PGS00075…  0.214  NA               NA      NA     
10 reference    HG00106       PGS00075… -1.13   NA               NA      NA   
# ℹ 2,481 more rows
# ℹ 1 more variable: percentile_MostSimilarPop <lgl>

image

@smlmbrt
Copy link
Member

smlmbrt commented Apr 22, 2024

I think @nebfield is right, it has probably triggered this exception which should only be applied to the target when there's more than 3 samples: https://github.com/PGScatalog/pgscatalog_utils/blob/b5962bf5f12bb2aba9d51a3c569a0d831072ecf0/pgscatalog_utils/ancestry/tools.py#L250-L253

@Fiwx
Copy link
Author

Fiwx commented Apr 22, 2024

@smlmbrt Great! I'll try a test with more samples sometime.

Also, is it possible to recover z-scores and percentiles which are not normalized for ancestry from only the values in SUM?

My understanding is that SUM is calculated only on variants that are a subset of the variants in the submitted sample(s), the reference panel (e.g. 1000G), and the scorefile.

Since the variants are the same between the reference and submitted samples, I assume it would be a "fair" comparison (e.g., no normalization for matched variant number needed).

@smlmbrt smlmbrt changed the title Results missing Error: results missing for single sample Apr 23, 2024
@smlmbrt
Copy link
Member

smlmbrt commented Apr 23, 2024

Also, is it possible to recover z-scores and percentiles which are not normalized for ancestry from only the values in SUM?

The percentiles and z for the most similar population are not normalised, they just use that as the reference distribution.

My understanding is that SUM is calculated only on variants that are a subset of the variants in the submitted sample(s), the reference panel (e.g. 1000G), and the scorefile.

Correct.

Since the variants are the same between the reference and submitted samples, I assume it would be a "fair" comparison (e.g., no normalization for matched variant number needed).

I think this will depend on your use case, the reason for using reference populations is because the mean and variance of the PGS distribution is caused by allele frequency and LD. If these are unmatched than an individual's relative place in a distribution will be incorrect.

@smlmbrt smlmbrt added the user-query User queries & requests label Apr 25, 2024
@smlmbrt smlmbrt linked a pull request May 23, 2024 that will close this issue
@nebfield nebfield mentioned this issue May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working user-query User queries & requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants