Inconsistency in likelihood and loss calculations for dynamical model #743

paula-tataru · 2021-11-01T16:06:42Z

Hi,

I noticed that the reported likelihood is calculated using a subset of the cells, while the loss function (which is minimized while fitting the parameters) is calculated using a superset of the cells used for the likelihood.

The likelihood is defined here and is calculated by default with weighted = "upper".

The loss is defined here and is calculated by default with weighted = True, which leads to a superset of the cells being used compared to the likelihood.

This seems problematic, as the parameters fitting could find parameters with a good loss but a bad likelihood compared to some other parameters. The likelihood function does not fully reflect what is being optimized.

I manged to illustrate that this actually happens in practice on a small subset of the pancreas data:

import scvelo as scv
import numpy as np
import scanpy as sp

def getValues(adata):
    lk = adata.var["fit_likelihood"]
    lk = lk.tolist()
    loss = adata.varm['loss']
    l = list()
    for i in range(len(loss)):
        ll = loss[i][np.where(~np.isnan(loss[i]))]
        if len(ll) > 0:
            l.append(ll[-1])
        else:
            l.append(np.nan)
    loss = l
    
    return (lk, loss)


def compare(i, l1, l2, reverse = True):
    # likelihood is greater or equal, but loss is worse
    if l1[0][i] >= l2[0][i] and l1[1][i] > l2[1][i]:
        return True
    if reverse:
        return compare(i, l2, l1, reverse = False)
    return False

adata = scv.datasets.pancreas()
subset = sp.pp.subsample(adata, n_obs=50, copy = True)
scv.pp.filter_and_normalize(subset, min_shared_counts=20, n_top_genes=20)
genes = subset.var_names
cells = subset.obs_names
adata = adata[cells, genes]
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=20)

scv.pp.moments(adata, n_neighbors=30, use_rep="X")

scv.tl.recover_dynamics(adata, max_iter = 5)
l1 = getValues(adata)

scv.tl.recover_dynamics(adata, max_iter = 10)
l2 = getValues(adata)

for i in range(len(l1[0])):
    if compare(i, l1, l2):
        print(i, "(", l1[0][i], l1[1][i], ") and (", l2[0][i], l2[1][i], ")")

Running the above code prints the following:

2 ( 0.32460918352860496 18.544373756518986 ) and ( 0.3224038242020261 18.405310456355128 )

The two different runs find two sets of parameters where for one the likelihood is better, but the loss is worse.

What is the reason why the likelihood is not calculated on the same cells as the loss?

/Paula

WeilerP · 2021-11-20T16:39:47Z

@paula-tataru, sorry for getting back to you only now - thanks for looking at this in detail. I'll look into it and get back to you ASAP.

paula-tataru added the bug Something isn't working label Nov 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency in likelihood and loss calculations for dynamical model #743

Inconsistency in likelihood and loss calculations for dynamical model #743

paula-tataru commented Nov 1, 2021 •

edited

Loading

WeilerP commented Nov 20, 2021

Inconsistency in likelihood and loss calculations for dynamical model #743

Inconsistency in likelihood and loss calculations for dynamical model #743

Comments

paula-tataru commented Nov 1, 2021 • edited Loading

WeilerP commented Nov 20, 2021

paula-tataru commented Nov 1, 2021 •

edited

Loading