Match Scores #12

imk1 · 2017-02-03T18:55:22Z

What is the exact definition of the match score? If I somehow missed it in the paper, it would be great if you could direct me to it. Thanks!

jhkorhonen · 2017-02-07T17:51:48Z

Section 2 of the IEEE Transactions paper gives a good overview of this, so I recommend you check it out (let me know if you can't get the pdf). I guess there are roughly two answers to this:

At the very core, the matrices are just assumed to be additive scoring matrices, so the score is just the sum of the entries as defined by the current string context (Section 2.1 of the paper, especially equation (1)). For example, this is what you want to do if your input is already in PWM format; in the commandline tool this corresponds to giving inputs with -S.
For count and frequency matrices the direct additive scoring doesn't make sense, but it's conventional to convert them to additive PWMs using the log-likelihood scoring (Section 2.2 of the paper). This is what the commandline tool does when giving inputs with -m.

Hope this helps! Let me know if there is something still unclear.

zetamui · 2018-07-17T14:18:10Z

I guess I have the same question as @imk1, I wonder what the match score in the 5th field is about. I am looking to use it for filtering, so I would need to know if larger the value the better, or vice versa, and what positives and negatives mean. Thanks!

jhkorhonen · 2018-07-18T12:22:33Z

I've updated the wiki to describe the scoring model in more detail. I recommend going over it to see all the details.

In short, the score itself can be basically viewed as the log of the ratio between probability of the model (PWM) generating the pattern and the probability of the background distribution generating the pattern. So bigger number is better. Note that for MOODS it's generally better to fix the threshold beforehand, as this makes scanning faster.

zetamui · 2018-07-18T12:44:20Z

Thanks for explaining more, and for the quick response! :-D

dktanwar · 2019-12-10T14:30:03Z

Hi, is there a threshold that could be set?

I mean, what score could be called as significant?

alexlenail · 2020-02-02T04:27:37Z

Hi @jhkorhonen! :)

Thanks for updating the scoring documentation. I'm still confused on one point: How should I interpret a negative score? I had thought MOODS would only report positive scores -- cases where the PWM-model was more likely than the null. I'm using a p-value cutoff of 0.0002

ckuenne · 2021-02-09T10:17:57Z

Hi,

I'm also running into negative scores. Bug?

test.fas:

>test
AATAATTCTATGGTT

test.matrix:

>MA0869.1	Sox11
305    305      0    305    305      0      0     41      0    305      0      0      0      0      1
0      0    305      0      0      0      1     39    305      0      0      0      0      0     14
0      0      0      0      0      0      0    305      0      0    305      2    305      0      5
0      0      0      0      0    305    305    425      1      0     99    305      0    305    305

cmd:
moods-dna.py -m test.matrix -s test.fas -p 0.0001 -o out_moods.txt

out_moods.txt:
test,test.matrix,0,+,-6.17774439196,AATAATTCTATGGTT,

jhkorhonen · 2021-02-15T11:14:35Z

Basically the way p-value thresholds are defined is that if you set the p-value threshold to p, the actual concrete score threshold T is set, roughly speaking, so that the probability that the background distribution generates a sequence with score at least T is p. In other words, you can think about p-value threshold p giving you the top-scoring p fraction of all positions.

While this seems to be a common approach in the literature, it does indeed work a bit weirdly with the log-likehood scoring, which has its own probabilistic interpretation. If your matrix is is very specific, then the top p fraction of best hits will include hits that are better explained by the background distribution, as the probability of the background generating a sequence with positive log-likelihood score is very small.

Alternatively, one can simply set an absolute threshold T with the log-likelihood scores; if you want hits where the probability that the PSSM generated the sequence is at least C times the probability that the background generated the sequence, set T = log C (note that by default MOODS uses natural logarithm.) The complication is that there is generally no guarantee that a given PSSM can give scores as high as T, e.g. if the PSSM distribution is close to the background.

(See also https://github.com/jhkorhonen/MOODS/wiki/Brief-theoretical-introduction)

ckuenne · 2021-02-15T11:32:28Z

Got it. Thanks for this detailed explanation.

jhkorhonen added the question label Feb 7, 2017

snystrom mentioned this issue Sep 10, 2019

scores and p-values GreenleafLab/motifmatchr#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Match Scores #12

Match Scores #12

imk1 commented Feb 3, 2017

jhkorhonen commented Feb 7, 2017

zetamui commented Jul 17, 2018

jhkorhonen commented Jul 18, 2018

zetamui commented Jul 18, 2018

dktanwar commented Dec 10, 2019

alexlenail commented Feb 2, 2020

ckuenne commented Feb 9, 2021 •

edited

Loading

jhkorhonen commented Feb 15, 2021

ckuenne commented Feb 15, 2021

Match Scores #12

Match Scores #12

Comments

imk1 commented Feb 3, 2017

jhkorhonen commented Feb 7, 2017

zetamui commented Jul 17, 2018

jhkorhonen commented Jul 18, 2018

zetamui commented Jul 18, 2018

dktanwar commented Dec 10, 2019

alexlenail commented Feb 2, 2020

ckuenne commented Feb 9, 2021 • edited Loading

jhkorhonen commented Feb 15, 2021

ckuenne commented Feb 15, 2021

ckuenne commented Feb 9, 2021 •

edited

Loading