Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency in likelihood and loss calculations for dynamical model #743

Open
paula-tataru opened this issue Nov 1, 2021 · 1 comment
Labels
bug Something isn't working

Comments

@paula-tataru
Copy link

paula-tataru commented Nov 1, 2021

Hi,

I noticed that the reported likelihood is calculated using a subset of the cells, while the loss function (which is minimized while fitting the parameters) is calculated using a superset of the cells used for the likelihood.

The likelihood is defined here and is calculated by default with weighted = "upper".

The loss is defined here and is calculated by default with weighted = True, which leads to a superset of the cells being used compared to the likelihood.

This seems problematic, as the parameters fitting could find parameters with a good loss but a bad likelihood compared to some other parameters. The likelihood function does not fully reflect what is being optimized.

I manged to illustrate that this actually happens in practice on a small subset of the pancreas data:

import scvelo as scv
import numpy as np
import scanpy as sp

def getValues(adata):
    lk = adata.var["fit_likelihood"]
    lk = lk.tolist()
    loss = adata.varm['loss']
    l = list()
    for i in range(len(loss)):
        ll = loss[i][np.where(~np.isnan(loss[i]))]
        if len(ll) > 0:
            l.append(ll[-1])
        else:
            l.append(np.nan)
    loss = l
    
    return (lk, loss)


def compare(i, l1, l2, reverse = True):
    # likelihood is greater or equal, but loss is worse
    if l1[0][i] >= l2[0][i] and l1[1][i] > l2[1][i]:
        return True
    if reverse:
        return compare(i, l2, l1, reverse = False)
    return False

adata = scv.datasets.pancreas()
subset = sp.pp.subsample(adata, n_obs=50, copy = True)
scv.pp.filter_and_normalize(subset, min_shared_counts=20, n_top_genes=20)
genes = subset.var_names
cells = subset.obs_names
adata = adata[cells, genes]
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=20)

scv.pp.moments(adata, n_neighbors=30, use_rep="X")

scv.tl.recover_dynamics(adata, max_iter = 5)
l1 = getValues(adata)

scv.tl.recover_dynamics(adata, max_iter = 10)
l2 = getValues(adata)

for i in range(len(l1[0])):
    if compare(i, l1, l2):
        print(i, "(", l1[0][i], l1[1][i], ") and (", l2[0][i], l2[1][i], ")")

Running the above code prints the following:

2 ( 0.32460918352860496 18.544373756518986 ) and ( 0.3224038242020261 18.405310456355128 )

The two different runs find two sets of parameters where for one the likelihood is better, but the loss is worse.

What is the reason why the likelihood is not calculated on the same cells as the loss?

/Paula

@paula-tataru paula-tataru added the bug Something isn't working label Nov 1, 2021
@WeilerP
Copy link
Member

WeilerP commented Nov 20, 2021

@paula-tataru, sorry for getting back to you only now - thanks for looking at this in detail. I'll look into it and get back to you ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants