Error: results missing for single sample #283

Fiwx · 2024-04-18T23:21:52Z

Description of the bug

In the current dev build, the report is made, but it does not contain any columns except for SUM:

   sampleset    IID           PGS           SUM Z_MostSimilarPop Z_norm1 Z_norm2
   <chr>        <chr>         <chr>       <dbl> <lgl>            <lgl>   <lgl>  
 1 testfile testfile.txt PGS00075… -0.220  NA               NA      NA     
 2 reference    HG00096       PGS00075…  0.258  NA               NA      NA     
 3 reference    HG00097       PGS00075…  0.0698 NA               NA      NA     
 4 reference    HG00099       PGS00075…  0.0588 NA               NA      NA     
 5 reference    HG00100       PGS00075…  0.382  NA               NA      NA     
 6 reference    HG00101       PGS00075…  0.525  NA               NA      NA     
 7 reference    HG00102       PGS00075… -0.327  NA               NA      NA     
 8 reference    HG00103       PGS00075… -0.285  NA               NA      NA     
 9 reference    HG00105       PGS00075…  0.214  NA               NA      NA     
10 reference    HG00106       PGS00075… -1.13   NA               NA      NA

This is in both the .html report as well as the raw testfile_pgs.txt.gz file.

In that file, only the SUM column is populated.

However, when using the next most current version (alpha 4), all columns are correctly populated (despite technically failing on the report making step, #242).

I know the build is dev and not released yet, but it might happen on alpha 5 too (I'm unable to test it because of the _vcf filename error).

Command used and terminal output

nextflow run pgscatalog/pgsc_calc -profile singularity --input samplesheet.csv --pgs_id PGS000758 --target_build GRCh37 --min_overlap 0.0 --run_ancestry pgsc_1000G_v1.tar.zst -c custom.config -r v2.0.0-alpha.4e

Or

nextflow run pgscatalog/pgsc_calc -profile singularity --input samplesheet.csv --pgs_id PGS000758 --target_build GRCh37 --min_overlap 0.0 --run_ancestry pgsc_1000G_v1.tar.zst -c custom.config -r dev

Relevant files

No response

System information

Ubuntu, Docker, Singularity, current Nextflow

The text was updated successfully, but these errors were encountered:

nebfield · 2024-04-19T11:03:27Z

Thanks for the bug report! Sorry, I can't reproduce on the dev branch. Here's what I tried:

$ cd /path/to/pgsc_calc
$ rm -r work results  # guarantee a fresh run
$ nextflow run main.nf -profile docker,arm \
    --run_ancestry ../pgsc_1000G_v1.tar.zst \
    --input ../hgdp/split/samplesheet.csv \
    --pgs_id PGS000758 \
    --target_build GRCh38
$ head <(gzcat results/hgdp/score/hgdp_pgs.txt.gz) | column -t
sampleset  IID        PGS                     SUM                  Z_MostSimilarPop      Z_norm1               Z_norm2              percentile_MostSimilarPop
hgdp       HGDP00001  PGS000758_hmPOS_GRCh38  -0.3588901000000001  1.664053980073892     0.8584426674256617    0.8134837998289269   95.49902152641879
hgdp       HGDP00003  PGS000758_hmPOS_GRCh38  -0.40938197          1.573172743793694     0.7932776666255601    0.7547724248426887   94.71624266144813
hgdp       HGDP00005  PGS000758_hmPOS_GRCh38  0.0158892699999999   2.338626192945269     1.6154600095825224    1.527703433039602    98.4344422700587
hgdp       HGDP00007  PGS000758_hmPOS_GRCh38  -0.9636511           0.5755336433533662    -0.2936306959068118   -0.2642672147215713  72.40704500978474

I noticed --min_overlap is 0 on your run. What kind of variant matching rates do you normally get?

@smlmbrt is the expert (and out of office 🌴 until next week ) but perhaps low variant match rates could contribute to NA values. Some changes were made to the ancestry normalisation steps to handle low variance cases in the most recent release.

Fiwx · 2024-04-19T15:21:27Z

Thanks for testing it. I'll try again. The match rates were high (99.x%) in the alpha 4 version run, and the genome is imputed (between 20-30 million variants).

Fiwx · 2024-04-20T21:20:56Z

I tried again after creating a clean new set up, and got the same results.

$ nextflow run pgscatalog/pgsc_calc -profile singularity --input \
    /home/ubuntu/custom/dev/ca/samplesheet.csv --pgs_id PGS000758 \
    --target_build GRCh37 --min_overlap 0.0 --run_ancestry \
    /home/ubuntu/custom/data/pgsc_1000G_v1.tar.zst -c \
    /home/ubuntu/custom/references/custom.config -r dev

Version
2.0.0-alpha.5

reference	n target	N variants in panel	n (matched)	% matched
1	1000G	27904794	85277655	27904796	32.72

Sampleset	Scoring file	Number of variants	Passed matching	Match %	Total matched	Total unmatched
newautosomal	PGS000758_hmPOS_GRCh37	33938	TRUE	99.2	33668	270

   sampleset    IID           PGS           SUM Z_MostSimilarPop Z_norm1 Z_norm2
   <chr>        <chr>         <chr>       <dbl> <lgl>            <lgl>   <lgl>  
 1 newautosomal autosomal.txt PGS00075… -0.220  NA               NA      NA     
 2 reference    HG00096       PGS00075…  0.258  NA               NA      NA     
 3 reference    HG00097       PGS00075…  0.0698 NA               NA      NA     
 4 reference    HG00099       PGS00075…  0.0588 NA               NA      NA     
 5 reference    HG00100       PGS00075…  0.382  NA               NA      NA     
 6 reference    HG00101       PGS00075…  0.525  NA               NA      NA     
 7 reference    HG00102       PGS00075… -0.327  NA               NA      NA     
 8 reference    HG00103       PGS00075… -0.285  NA               NA      NA     
 9 reference    HG00105       PGS00075…  0.214  NA               NA      NA     
10 reference    HG00106       PGS00075… -1.13   NA               NA      NA   
# ℹ 2,481 more rows
# ℹ 1 more variable: percentile_MostSimilarPop <lgl>

smlmbrt · 2024-04-22T08:41:37Z

I think @nebfield is right, it has probably triggered this exception which should only be applied to the target when there's more than 3 samples: https://github.com/PGScatalog/pgscatalog_utils/blob/b5962bf5f12bb2aba9d51a3c569a0d831072ecf0/pgscatalog_utils/ancestry/tools.py#L250-L253

Fiwx · 2024-04-22T10:54:52Z

@smlmbrt Great! I'll try a test with more samples sometime.

Also, is it possible to recover z-scores and percentiles which are not normalized for ancestry from only the values in SUM?

My understanding is that SUM is calculated only on variants that are a subset of the variants in the submitted sample(s), the reference panel (e.g. 1000G), and the scorefile.

Since the variants are the same between the reference and submitted samples, I assume it would be a "fair" comparison (e.g., no normalization for matched variant number needed).

smlmbrt · 2024-04-23T09:51:14Z

Also, is it possible to recover z-scores and percentiles which are not normalized for ancestry from only the values in SUM?

The percentiles and z for the most similar population are not normalised, they just use that as the reference distribution.

My understanding is that SUM is calculated only on variants that are a subset of the variants in the submitted sample(s), the reference panel (e.g. 1000G), and the scorefile.

Correct.

Since the variants are the same between the reference and submitted samples, I assume it would be a "fair" comparison (e.g., no normalization for matched variant number needed).

I think this will depend on your use case, the reason for using reference populations is because the mean and variance of the PGS distribution is caused by allele frequency and LD. If these are unmatched than an individual's relative place in a distribution will be incorrect.

Fiwx added the bug Something isn't working label Apr 18, 2024

smlmbrt mentioned this issue Apr 22, 2024

Fix variance check for single-samples PGScatalog/pgscatalog_utils#87

Closed

smlmbrt changed the title ~~Results missing~~ Error: results missing for single sample Apr 23, 2024

smlmbrt added the user-query User queries & requests label Apr 25, 2024

smlmbrt mentioned this issue Apr 25, 2024

ToDo: test case with single sample #287

Open

smlmbrt linked a pull request May 23, 2024 that will close this issue

alpha.6 #274

Merged

nebfield mentioned this issue May 23, 2024

alpha.6 #274

Merged

nebfield closed this as completed in #274 May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: results missing for single sample #283

Error: results missing for single sample #283

Fiwx commented Apr 18, 2024

nebfield commented Apr 19, 2024

Fiwx commented Apr 19, 2024

Fiwx commented Apr 20, 2024

smlmbrt commented Apr 22, 2024

Fiwx commented Apr 22, 2024

smlmbrt commented Apr 23, 2024

Error: results missing for single sample #283

Error: results missing for single sample #283

Comments

Fiwx commented Apr 18, 2024

Description of the bug

Command used and terminal output

Relevant files

System information

nebfield commented Apr 19, 2024

Fiwx commented Apr 19, 2024

Fiwx commented Apr 20, 2024

smlmbrt commented Apr 22, 2024

Fiwx commented Apr 22, 2024

smlmbrt commented Apr 23, 2024