Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match Scores #12

Open
imk1 opened this issue Feb 3, 2017 · 9 comments
Open

Match Scores #12

imk1 opened this issue Feb 3, 2017 · 9 comments
Labels

Comments

@imk1
Copy link

imk1 commented Feb 3, 2017

What is the exact definition of the match score? If I somehow missed it in the paper, it would be great if you could direct me to it. Thanks!

@jhkorhonen
Copy link
Owner

Section 2 of the IEEE Transactions paper gives a good overview of this, so I recommend you check it out (let me know if you can't get the pdf). I guess there are roughly two answers to this:

  • At the very core, the matrices are just assumed to be additive scoring matrices, so the score is just the sum of the entries as defined by the current string context (Section 2.1 of the paper, especially equation (1)). For example, this is what you want to do if your input is already in PWM format; in the commandline tool this corresponds to giving inputs with -S.
  • For count and frequency matrices the direct additive scoring doesn't make sense, but it's conventional to convert them to additive PWMs using the log-likelihood scoring (Section 2.2 of the paper). This is what the commandline tool does when giving inputs with -m.

Hope this helps! Let me know if there is something still unclear.

@zetamui
Copy link

zetamui commented Jul 17, 2018

I guess I have the same question as @imk1, I wonder what the match score in the 5th field is about. I am looking to use it for filtering, so I would need to know if larger the value the better, or vice versa, and what positives and negatives mean. Thanks!

@jhkorhonen
Copy link
Owner

I've updated the wiki to describe the scoring model in more detail. I recommend going over it to see all the details.

In short, the score itself can be basically viewed as the log of the ratio between probability of the model (PWM) generating the pattern and the probability of the background distribution generating the pattern. So bigger number is better. Note that for MOODS it's generally better to fix the threshold beforehand, as this makes scanning faster.

@zetamui
Copy link

zetamui commented Jul 18, 2018

Thanks for explaining more, and for the quick response! :-D

@dktanwar
Copy link

Hi, is there a threshold that could be set?

I mean, what score could be called as significant?

@alexlenail
Copy link

Hi @jhkorhonen! :)

Thanks for updating the scoring documentation. I'm still confused on one point: How should I interpret a negative score? I had thought MOODS would only report positive scores -- cases where the PWM-model was more likely than the null. I'm using a p-value cutoff of 0.0002

@ckuenne
Copy link

ckuenne commented Feb 9, 2021

Hi,

I'm also running into negative scores. Bug?

test.fas:

>test
AATAATTCTATGGTT

test.matrix:

>MA0869.1	Sox11
305    305      0    305    305      0      0     41      0    305      0      0      0      0      1
0      0    305      0      0      0      1     39    305      0      0      0      0      0     14
0      0      0      0      0      0      0    305      0      0    305      2    305      0      5
0      0      0      0      0    305    305    425      1      0     99    305      0    305    305

cmd:
moods-dna.py -m test.matrix -s test.fas -p 0.0001 -o out_moods.txt

out_moods.txt:
test,test.matrix,0,+,-6.17774439196,AATAATTCTATGGTT,

@jhkorhonen
Copy link
Owner

Basically the way p-value thresholds are defined is that if you set the p-value threshold to p, the actual concrete score threshold T is set, roughly speaking, so that the probability that the background distribution generates a sequence with score at least T is p. In other words, you can think about p-value threshold p giving you the top-scoring p fraction of all positions.

While this seems to be a common approach in the literature, it does indeed work a bit weirdly with the log-likehood scoring, which has its own probabilistic interpretation. If your matrix is is very specific, then the top p fraction of best hits will include hits that are better explained by the background distribution, as the probability of the background generating a sequence with positive log-likelihood score is very small.

Alternatively, one can simply set an absolute threshold T with the log-likelihood scores; if you want hits where the probability that the PSSM generated the sequence is at least C times the probability that the background generated the sequence, set T = log C (note that by default MOODS uses natural logarithm.) The complication is that there is generally no guarantee that a given PSSM can give scores as high as T, e.g. if the PSSM distribution is close to the background.

(See also https://github.com/jhkorhonen/MOODS/wiki/Brief-theoretical-introduction)

@ckuenne
Copy link

ckuenne commented Feb 15, 2021

Got it. Thanks for this detailed explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants