Different log_odds output between MOODS and Biopython #15

fabio-t · 2017-05-24T11:54:08Z

Hi,

I've noticed that MOODS uses a slightly different formula for the log-odds calculation than what BioPython uses, which leads to a different output.

Biopython uses an "intuitive" algorithm, that is:

add pseudocounts to count matrix
normalise count matrix by column total to get frequency matrix
apply log(p / b, 2) to get log-odds

MOODS pretty much follows the same procedure, but pseudocounts are multiplied by background frequency.

Which one is correct? (or, is there no "correct", well established formula for this?)

Any clarification you could give on this would be very appreciated.

PS: I've made a small script re-implementing both formulas, for clarity. It's verified against the actual output of biopython and MOODS from one of my experiments.

from math import log

# verified output on my data
# biopython: 0.6571122864769917
# moods:     0.6727053953628221

c = [4.0, 1.0, 2.0, 3.0]
ps = 0.1
bg = 0.25

print "biopython:\n"

tot = sum(x + ps for x in c)
#print tot

cn = [(x + ps) / tot for x in c]
#print cn

print [log(x / bg, 2) for x in cn]

print "\n#########\n"

print "moods:\n"

c = [4.0, 1.0, 2.0, 3.0]
ps = 0.1
bg = 0.25

tot = sum([x + ps*bg for x in c])
#print tot

cn = [(x + ps*bg) / tot for x in c]
#print cn

print [log(x / bg, 2) for x in cn]

The text was updated successfully, but these errors were encountered:

jhkorhonen · 2017-05-24T13:12:49Z

Hi,

this is definitely a valid question – I've seen quite a few variations in how various software handles this, and if you really get down to it, the question as to what is the really correct way to even model something like TF binding on DNA sequence starts to get really complicated. My general impression here is that the PWM framework overall is really somewhat ad hoc...

Anyway, the short answer is that MOODS uses a slightly more general way of doing the additive smoothing of the input matrices that BioPython. Basically one can think that the BioPython way corresponds to having the "prior" assumption to be that the bases are equally likely, and the MOODS formula corresponds to the "prior" assumption that the bases are distributed according to the background. (Indeed, one can formulate the pseudocounts as Dirichlet priors when interpreting this in the Bayesian framework.)

This does not in particular matter if you assume flat background distribution (i.e. the background is the same for each row/nucleotide); the only thing that changes is the scaling of the pseudocount (which corresponds to changing the weight of the prior assumption.)

I am not 100% sure on this, but I think using constant pseudocounts is problematic if you don't assume a flat background: if you consider an empty model, i.e. count matrix with zeroes in all positions (that is, you have not actually observed anything), then I would expect the resulting position weight matrix to assign the same score to all possible sequences.

The practical answer is that you can always do the log-odds calculation outside MOODS in any way you prefer, and tell MOODS to treat the input matrices as PWMs (giving the input files with -S instead of -m in the python tool, for example).

Hope this helps, or at least doesn't make things even more confusing. Please let me know if you have additional questions!

fabio-t · 2017-05-24T14:15:02Z

Thanks for the explanation! It really helps.

I can really see now how the Biopython approach can be problematic with non-uniform backgrounds, which might save me from some future headaches.

But then I believe there is a mistake in MOODS code in implementing the "generalised laplace smoothing". In case of uniform background, it should give the same output as Biopython, shouldn't it? The Wikipedia article you linked seems concordant on this.

The code change would be (I've added a comment to the changed lines):

// Transforms a weight matrix into a PSSM
score_matrix log_odds(const score_matrix &mat, const vector<double> &bg, const double ps)
{
    size_t a = mat.size();
    size_t n = mat[0].size();

    score_matrix ret(a, vector<double>(n));

    for (size_t i = 0; i < n; ++i)
    {
        double column_sum = 0;
        for (size_t j = 0; j < a; ++j){
            column_sum += mat[j][i] + ps; ### total (after for loop) is equal to N + alpha * d
        }
        for (size_t j = 0; j < a; ++j) {
            ret[j][i] = log((mat[j][i] + ps*bg[j]*a)/column_sum) - log(bg[j]); ### xi + ui * alpha * d divided by column_sum
        }
    }
    return ret;
}

As you say, right now the difference is just in scaling (by a factor of 4, in the ATCG) - but if using bigger alphabets (amino acids?) the difference becomes non-trivial.

Do you agree, or am I misinterpreting the formulas?

jhkorhonen · 2017-05-25T10:45:51Z

I would say it's mostly that, for whatever historical reason, the interpretation of the parameter ps is the total amount of added pseudocount instead of pseudocount per row. Though I should probably at the very least document this better.

jhkorhonen self-assigned this May 25, 2017

jhkorhonen added the enhancement label May 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different log_odds output between MOODS and Biopython #15

Different log_odds output between MOODS and Biopython #15

fabio-t commented May 24, 2017

jhkorhonen commented May 24, 2017

fabio-t commented May 24, 2017 •

edited

Loading

jhkorhonen commented May 25, 2017

Different log_odds output between MOODS and Biopython #15

Different log_odds output between MOODS and Biopython #15

Comments

fabio-t commented May 24, 2017

jhkorhonen commented May 24, 2017

fabio-t commented May 24, 2017 • edited Loading

jhkorhonen commented May 25, 2017

fabio-t commented May 24, 2017 •

edited

Loading