Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different log_odds output between MOODS and Biopython #15

Open
fabio-t opened this issue May 24, 2017 · 3 comments
Open

Different log_odds output between MOODS and Biopython #15

fabio-t opened this issue May 24, 2017 · 3 comments
Assignees

Comments

@fabio-t
Copy link
Contributor

fabio-t commented May 24, 2017

Hi,

I've noticed that MOODS uses a slightly different formula for the log-odds calculation than what BioPython uses, which leads to a different output.

Biopython uses an "intuitive" algorithm, that is:

  • add pseudocounts to count matrix
  • normalise count matrix by column total to get frequency matrix
  • apply log(p / b, 2) to get log-odds

MOODS pretty much follows the same procedure, but pseudocounts are multiplied by background frequency.

Which one is correct? (or, is there no "correct", well established formula for this?)

Any clarification you could give on this would be very appreciated.

PS: I've made a small script re-implementing both formulas, for clarity. It's verified against the actual output of biopython and MOODS from one of my experiments.

from math import log

# verified output on my data
# biopython: 0.6571122864769917
# moods:     0.6727053953628221

c = [4.0, 1.0, 2.0, 3.0]
ps = 0.1
bg = 0.25

print "biopython:\n"

tot = sum(x + ps for x in c)
#print tot

cn = [(x + ps) / tot for x in c]
#print cn

print [log(x / bg, 2) for x in cn]

print "\n#########\n"

print "moods:\n"

c = [4.0, 1.0, 2.0, 3.0]
ps = 0.1
bg = 0.25

tot = sum([x + ps*bg for x in c])
#print tot

cn = [(x + ps*bg) / tot for x in c]
#print cn

print [log(x / bg, 2) for x in cn]
@jhkorhonen
Copy link
Owner

Hi,

this is definitely a valid question – I've seen quite a few variations in how various software handles this, and if you really get down to it, the question as to what is the really correct way to even model something like TF binding on DNA sequence starts to get really complicated. My general impression here is that the PWM framework overall is really somewhat ad hoc...


Anyway, the short answer is that MOODS uses a slightly more general way of doing the additive smoothing of the input matrices that BioPython. Basically one can think that the BioPython way corresponds to having the "prior" assumption to be that the bases are equally likely, and the MOODS formula corresponds to the "prior" assumption that the bases are distributed according to the background. (Indeed, one can formulate the pseudocounts as Dirichlet priors when interpreting this in the Bayesian framework.)

This does not in particular matter if you assume flat background distribution (i.e. the background is the same for each row/nucleotide); the only thing that changes is the scaling of the pseudocount (which corresponds to changing the weight of the prior assumption.)

I am not 100% sure on this, but I think using constant pseudocounts is problematic if you don't assume a flat background: if you consider an empty model, i.e. count matrix with zeroes in all positions (that is, you have not actually observed anything), then I would expect the resulting position weight matrix to assign the same score to all possible sequences.


The practical answer is that you can always do the log-odds calculation outside MOODS in any way you prefer, and tell MOODS to treat the input matrices as PWMs (giving the input files with -S instead of -m in the python tool, for example).


Hope this helps, or at least doesn't make things even more confusing. Please let me know if you have additional questions!

@fabio-t
Copy link
Contributor Author

fabio-t commented May 24, 2017

Thanks for the explanation! It really helps.

I can really see now how the Biopython approach can be problematic with non-uniform backgrounds, which might save me from some future headaches.

But then I believe there is a mistake in MOODS code in implementing the "generalised laplace smoothing". In case of uniform background, it should give the same output as Biopython, shouldn't it? The Wikipedia article you linked seems concordant on this.

The code change would be (I've added a comment to the changed lines):

// Transforms a weight matrix into a PSSM
score_matrix log_odds(const score_matrix &mat, const vector<double> &bg, const double ps)
{
    size_t a = mat.size();
    size_t n = mat[0].size();

    score_matrix ret(a, vector<double>(n));

    for (size_t i = 0; i < n; ++i)
    {
        double column_sum = 0;
        for (size_t j = 0; j < a; ++j){
            column_sum += mat[j][i] + ps; ### total (after for loop) is equal to N + alpha * d
        }
        for (size_t j = 0; j < a; ++j) {
            ret[j][i] = log((mat[j][i] + ps*bg[j]*a)/column_sum) - log(bg[j]); ### xi + ui * alpha * d divided by column_sum
        }
    }
    return ret;
}

As you say, right now the difference is just in scaling (by a factor of 4, in the ATCG) - but if using bigger alphabets (amino acids?) the difference becomes non-trivial.

Do you agree, or am I misinterpreting the formulas?

@jhkorhonen
Copy link
Owner

I would say it's mostly that, for whatever historical reason, the interpretation of the parameter ps is the total amount of added pseudocount instead of pseudocount per row. Though I should probably at the very least document this better.

@jhkorhonen jhkorhonen self-assigned this May 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants