bug report on ReadLevel_Features_extraction.py #30

JunhuiLi1017 · 2023-03-03T19:54:13Z

Hi there,

It seems that there is a bug in the script ReadLevel_Features_extraction.py

I used the same snv list and bam file to extract features of snv, it will output different numbers of features in different times, and also will output the different number of mosaic variants.

Thanks,
Junhui

The text was updated successfully, but these errors were encountered:

douym · 2023-03-05T08:53:28Z

Hi @JunhuiLi1017 ,

Thanks for your info. That’s odd, since ReadLevel_Features_extraction.py should not have any stochastic procedures, and ReadLevel_Features_extraction.py would not give genotype predictions.

Could you compare the two output files of ReadLevel_Features_extraction.py and tell me what features are different? For example, for one specific variant, are the feature(s) different in two output files? Further, what do the different lines that present in only one file look like?

Thanks,

Yanmei

JunhuiLi1017 · 2023-03-06T21:41:42Z

Hi @douym,

Thanks for your response.

Here are the details of the difference between 2 outputs(output1 and output2) from the same input file and script.

For all samples, output1 and output2 are all different. For example, I have an SNV list with 660 variants.

in the output1:
we got 275 variants with feature information.

in the output2:
we got 266 variants with feature information.

The no. of the variant with a common position between output1 and output2 is 224, where 35 variants are different for GC content and context. Here is an example of a specific mutation with the same position but GC content and context are different.

id	GCcontent	context
-chr1-96186790-A-G	0.45454546	GCA
-chr1-96186790-A-G	0.33333333	CTT

douym · 2023-03-09T15:04:18Z

Hi @douym,

Thanks for your response.

Here are the details of the difference between 2 outputs(output1 and output2) from the same input file and script.

For all samples, output1 and output2 are all different. For example, I have an SNV list with 660 variants.

in the output1: we got 275 variants with feature information.

in the output2: we got 266 variants with feature information.

The no. of the variant with a common position between output1 and output2 is 224, where 35 variants are different for GC content and context. Here is an example of a specific mutation with the same position but GC content and context are different.
<style> </style>
id GCcontent context
-chr1-96186790-A-G 0.45454546 GCA
-chr1-96186790-A-G 0.33333333 CTT

Hi @JunhuiLi1017 ,

Thanks for kindly share a mini bam with me, but I could not repeat the bug you report. I ran for five times of the following command and the results are all the same:
python ReadLevel_Features_extraction.py debug.bed debug.features ./ ../reference/hg19.fa hg19/k24.umap.wg.bw 1 bam

cat debug.features*|sort|uniq -c
5 id conflict_num mappability type length GCcontent ref_softclip alt_softclip querypos_p leftpos_p seqpos_p mapq_p baseq_p baseq_t ref_baseq1b_p ref_baseq1b_t alt_baseq1b_p alt_baseq1b_t sb_p context major_mismatches_mean minor_mismatches_mean mismatches_p AF dp mosaic_likelihood het_likelihood refhom_likelihood althom_likelihood mapq_difference sb_read12_p dp_diff indel_proportion_SNPonly alt2_proportion_SNPonly
5 debug~~chr1~~96186790AG 0 1.0 SNP 0 0.5238095238095238 0.0 0.0 0.05368455874274253 0.05368455874274253 0.7627475790602667 1.0 0.0001268495297680491 3.832496257329701 0.0015666667700549417 2.837196152055328 0.8500332141963067 0.6843881185220807 1.0 ATG 0.006867794947265147 0.015011037527593824 0.0025382235725089314 0.35714285714285715 42 0.5066377597179058 0.4933622402820942 1.6222399541098435e-29 6.569311047411848e-75 0.0 0.530584882534201 0.0 0.0

JunhuiLi1017 · 2023-03-10T14:04:29Z

Thanks, @douym , when I run bsub -n 5 -R rusage[mem=1000] -W 0:20 -q short -e $sam.snv.err -o $sam.snv.out -R 'span[hosts=1]' 'source ~/anaconda3/etc/profile.d/conda.sh && conda activate py3.7.1 && python ReadLevel_Features_extraction.py $snv_bed $snv_out $bam_dir $REF $umap 4 bam > $snv_log', it always gives me inconsistent results, while when I run python ReadLevel_Features_extraction.py $snv_bed $snv_out $bam_dir $REF $umap 4 bam > $snv_log directly, it will output consistent results.

update1: if I use bsub -n 1 instead of bsub -n 5, the output is always consistent.
update2: if I use bsub -n 5 and python ReadLevel_Features_extraction.py $snv_bed $snv_out $bam_dir $REF $umap 1 bam instead of python ReadLevel_Features_extraction.py $snv_bed $snv_out $bam_dir $REF $umap 4 bam, the output is always consistent.

JunhuiLi1017 closed this as completed Mar 10, 2023

JunhuiLi1017 reopened this Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug report on ReadLevel_Features_extraction.py #30

bug report on ReadLevel_Features_extraction.py #30

JunhuiLi1017 commented Mar 3, 2023

douym commented Mar 5, 2023

JunhuiLi1017 commented Mar 6, 2023 •

edited

Loading

douym commented Mar 9, 2023

JunhuiLi1017 commented Mar 10, 2023 •

edited

Loading

bug report on ReadLevel_Features_extraction.py #30

bug report on ReadLevel_Features_extraction.py #30

Comments

JunhuiLi1017 commented Mar 3, 2023

douym commented Mar 5, 2023

JunhuiLi1017 commented Mar 6, 2023 • edited Loading

douym commented Mar 9, 2023

JunhuiLi1017 commented Mar 10, 2023 • edited Loading

JunhuiLi1017 commented Mar 6, 2023 •

edited

Loading

JunhuiLi1017 commented Mar 10, 2023 •

edited

Loading