Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug report on ReadLevel_Features_extraction.py #30

Open
JunhuiLi1017 opened this issue Mar 3, 2023 · 4 comments
Open

bug report on ReadLevel_Features_extraction.py #30

JunhuiLi1017 opened this issue Mar 3, 2023 · 4 comments

Comments

@JunhuiLi1017
Copy link

Hi there,

It seems that there is a bug in the script ReadLevel_Features_extraction.py

I used the same snv list and bam file to extract features of snv, it will output different numbers of features in different times, and also will output the different number of mosaic variants.

Thanks,
Junhui

@douym
Copy link
Collaborator

douym commented Mar 5, 2023

Hi @JunhuiLi1017 ,

Thanks for your info. That’s odd, since ReadLevel_Features_extraction.py should not have any stochastic procedures, and ReadLevel_Features_extraction.py would not give genotype predictions.

Could you compare the two output files of ReadLevel_Features_extraction.py and tell me what features are different? For example, for one specific variant, are the feature(s) different in two output files? Further, what do the different lines that present in only one file look like?

Thanks,

Yanmei

@JunhuiLi1017
Copy link
Author

JunhuiLi1017 commented Mar 6, 2023

Hi @douym,

Thanks for your response.

Here are the details of the difference between 2 outputs(output1 and output2) from the same input file and script.

For all samples, output1 and output2 are all different. For example, I have an SNV list with 660 variants.

in the output1:
we got 275 variants with feature information.

in the output2:
we got 266 variants with feature information.

The no. of the variant with a common position between output1 and output2 is 224, where 35 variants are different for GC content and context. Here is an example of a specific mutation with the same position but GC content and context are different.

<style> </style>
id GCcontent context
-chr1-96186790-A-G 0.45454546 GCA
-chr1-96186790-A-G 0.33333333 CTT

@douym
Copy link
Collaborator

douym commented Mar 9, 2023

Hi @douym,

Thanks for your response.

Here are the details of the difference between 2 outputs(output1 and output2) from the same input file and script.

For all samples, output1 and output2 are all different. For example, I have an SNV list with 660 variants.

in the output1: we got 275 variants with feature information.

in the output2: we got 266 variants with feature information.

The no. of the variant with a common position between output1 and output2 is 224, where 35 variants are different for GC content and context. Here is an example of a specific mutation with the same position but GC content and context are different.

<style> </style>

id GCcontent context
-chr1-96186790-A-G 0.45454546 GCA
-chr1-96186790-A-G 0.33333333 CTT

Hi @JunhuiLi1017 ,

Thanks for kindly share a mini bam with me, but I could not repeat the bug you report. I ran for five times of the following command and the results are all the same:
python ReadLevel_Features_extraction.py debug.bed debug.features ./ ../reference/hg19.fa hg19/k24.umap.wg.bw 1 bam

cat debug.features*|sort|uniq -c
5 id conflict_num mappability type length GCcontent ref_softclip alt_softclip querypos_p leftpos_p seqpos_p mapq_p baseq_p baseq_t ref_baseq1b_p ref_baseq1b_t alt_baseq1b_p alt_baseq1b_t sb_p context major_mismatches_mean minor_mismatches_mean mismatches_p AF dp mosaic_likelihood het_likelihood refhom_likelihood althom_likelihood mapq_difference sb_read12_p dp_diff indel_proportion_SNPonly alt2_proportion_SNPonly
5 debugchr196186790AG 0 1.0 SNP 0 0.5238095238095238 0.0 0.0 0.05368455874274253 0.05368455874274253 0.7627475790602667 1.0 0.0001268495297680491 3.832496257329701 0.0015666667700549417 2.837196152055328 0.8500332141963067 0.6843881185220807 1.0 ATG 0.006867794947265147 0.015011037527593824 0.0025382235725089314 0.35714285714285715 42 0.5066377597179058 0.4933622402820942 1.6222399541098435e-29 6.569311047411848e-75 0.0 0.530584882534201 0.0 0.0

@JunhuiLi1017
Copy link
Author

JunhuiLi1017 commented Mar 10, 2023

Thanks, @douym , when I run bsub -n 5 -R rusage[mem=1000] -W 0:20 -q short -e $sam.snv.err -o $sam.snv.out -R 'span[hosts=1]' 'source ~/anaconda3/etc/profile.d/conda.sh && conda activate py3.7.1 && python ReadLevel_Features_extraction.py $snv_bed $snv_out $bam_dir $REF $umap 4 bam > $snv_log', it always gives me inconsistent results, while when I run python ReadLevel_Features_extraction.py $snv_bed $snv_out $bam_dir $REF $umap 4 bam > $snv_log directly, it will output consistent results.

update1: if I use bsub -n 1 instead of bsub -n 5, the output is always consistent.
update2: if I use bsub -n 5 and python ReadLevel_Features_extraction.py $snv_bed $snv_out $bam_dir $REF $umap 1 bam instead of python ReadLevel_Features_extraction.py $snv_bed $snv_out $bam_dir $REF $umap 4 bam, the output is always consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants