-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different log_odds output between MOODS and Biopython #15
Comments
Hi, this is definitely a valid question – I've seen quite a few variations in how various software handles this, and if you really get down to it, the question as to what is the really correct way to even model something like TF binding on DNA sequence starts to get really complicated. My general impression here is that the PWM framework overall is really somewhat ad hoc... Anyway, the short answer is that MOODS uses a slightly more general way of doing the additive smoothing of the input matrices that BioPython. Basically one can think that the BioPython way corresponds to having the "prior" assumption to be that the bases are equally likely, and the MOODS formula corresponds to the "prior" assumption that the bases are distributed according to the background. (Indeed, one can formulate the pseudocounts as Dirichlet priors when interpreting this in the Bayesian framework.) This does not in particular matter if you assume flat background distribution (i.e. the background is the same for each row/nucleotide); the only thing that changes is the scaling of the pseudocount (which corresponds to changing the weight of the prior assumption.) I am not 100% sure on this, but I think using constant pseudocounts is problematic if you don't assume a flat background: if you consider an empty model, i.e. count matrix with zeroes in all positions (that is, you have not actually observed anything), then I would expect the resulting position weight matrix to assign the same score to all possible sequences. The practical answer is that you can always do the log-odds calculation outside MOODS in any way you prefer, and tell MOODS to treat the input matrices as PWMs (giving the input files with Hope this helps, or at least doesn't make things even more confusing. Please let me know if you have additional questions! |
Thanks for the explanation! It really helps. I can really see now how the Biopython approach can be problematic with non-uniform backgrounds, which might save me from some future headaches. But then I believe there is a mistake in MOODS code in implementing the "generalised laplace smoothing". In case of uniform background, it should give the same output as Biopython, shouldn't it? The Wikipedia article you linked seems concordant on this. The code change would be (I've added a comment to the changed lines):
As you say, right now the difference is just in scaling (by a factor of 4, in the ATCG) - but if using bigger alphabets (amino acids?) the difference becomes non-trivial. Do you agree, or am I misinterpreting the formulas? |
I would say it's mostly that, for whatever historical reason, the interpretation of the parameter |
Hi,
I've noticed that MOODS uses a slightly different formula for the log-odds calculation than what BioPython uses, which leads to a different output.
Biopython uses an "intuitive" algorithm, that is:
log(p / b, 2)
to get log-oddsMOODS pretty much follows the same procedure, but pseudocounts are multiplied by background frequency.
Which one is correct? (or, is there no "correct", well established formula for this?)
Any clarification you could give on this would be very appreciated.
PS: I've made a small script re-implementing both formulas, for clarity. It's verified against the actual output of biopython and MOODS from one of my experiments.
The text was updated successfully, but these errors were encountered: