Correct measure of fit #8

Wychor · 2019-09-18T11:19:37Z

The measure of fit produced by this pipeline should be corrected for factors that can introduce random scores.
It should be corrected for the chromosome size. Naturally a bigger chromosome will represent a larger portion of a superscaffold purely by chance.
This could be of significance in genomes where there is a big difference between the largest and smallest chromosomes.
Additionally this correction should be based on the length of the chromosome minus the amount of N nucleotides in the chromosome since these are not aligned.

Wychor · 2019-09-18T11:36:37Z

One possible way of fixing this is by establishing the background measure of fit by taking all fits but the best, averaging this and correcting for the reference size.
Example of how this is an issue and how it should be corrected for.
6 references, 1-5 are 5mbp and 6 is 10mbp
Say reference 1 scored 40%, references 2-5 scored around 20% and reference 6 scored just under 40%. Reference 1 scores the best but since reference 6 is also very high analysis says there is no certainty that reference 1 is the right assignment.
Now we correct for size: %/(currentREF/smallestREF)=corrected%
Reference 1 becomes 40%/(5mbp/5mpb) = 40%
References 2-5 become 20%/(5/5)= 20%
Reference 6 becomes 40%/(10/5) = 20%
Now analysis would say reference 1 is significantly different and would assign the query to reference 1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct measure of fit #8

Correct measure of fit #8

Wychor commented Sep 18, 2019

Wychor commented Sep 18, 2019

Correct measure of fit #8

Correct measure of fit #8

Comments

Wychor commented Sep 18, 2019

Wychor commented Sep 18, 2019