Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set alpha=Inf as the default for testEmptyDrops. #118

Merged
merged 1 commit into from
Dec 10, 2024
Merged

Set alpha=Inf as the default for testEmptyDrops. #118

merged 1 commit into from
Dec 10, 2024

Conversation

LTLA
Copy link
Collaborator

@LTLA LTLA commented Dec 10, 2024

This is because a finite alpha is not universally safer; despite representing an overdispersed multinomial, using it to compute p-values for a multinomial-distributed vector will actually be anticonservative!

This means that an inaccurate alpha cannot be brushed under the carpet. Previously, I was trusting that any finite alpha would be safer than alpha=Inf, but it seems that this is not true. We can't just ignore poor estimates of alpha that we get from the assumed-ambient counts.

So, we just keep it simple and default alpha=Inf, which is exactly correct for the multinomial case (which should hopefully handle most use cases). This also reduces runtime and aligns with the CellRanger defaults.

This is because a finite alpha is not universally safer; despite
representing an overdispersed multinomial, using it to compute p-values
for a multinomial-distributed vector will actually be anticonservative!

This means that an inaccurate alpha cannot be brushed under the carpet.
Previously, I was trusting that any finite alpha would be safer than
alpha=Inf, but it seems that this is not true. We can't just ignore poor
estimates of alpha that we get from the assumed-ambient counts.

So, we just keep it simple and default alpha=Inf, which is exactly
correct for the multinomial case (which should hopefully be most cases).
This also reduces runtime and aligns with the CellRanger defaults.
@LTLA
Copy link
Collaborator Author

LTLA commented Dec 10, 2024

To demonstrate:

alpha <- 2000 # assumed "incorrect" alpha
nbarcodes <- 1000
ngenes <- 10000

# Simulating multinomial draws and computing "incorrect" log-likelihoods under a Dirichlet multinomial.
total <- 800
ambient <- runif(ngenes)
ambient <- ambient / sum(ambient)

am.alpha <- ambient * alpha
pm <- numeric(1000)
for (it in seq_along(pm)) {
    sim <- rmultinom(1, total, prob=ambient)
    # Only considering the data-dependent part of the PMF, for simplicity.
    pm[it] <- sum(lgamma(sim + am.alpha) - lfactorial(sim) - lgamma(am.alpha)) 
}

# Performing Monte-Carlo draws from a Dirichlet-multinomial and computing their log-likelihoods.
sim.probs <- numeric(1000)
for (it in seq_along(sim.probs)) {
    sim.p <- rgamma(ngenes, shape=am.alpha, rate=1)
    sim.x <- rmultinom(1, total, prob=sim.p)
    sim.probs[it] <- sum(lgamma(sim.x + am.alpha) - lfactorial(sim.x) - lgamma(am.alpha))
}

boxplot(list(Observed=pm, Expected=sim.probs), ylab="Dirichlet-multinomial log-likelihood")

res

So we can see that the expected log-likelihoods are always higher than the observed ones, resulting in Monte-Carlo p-values of zero that are obviously anticonservative. This is unexpected as overstating the variance usually results in higher p-values when testing for deviation, which is why I thought that alpha < Inf would be generally safe in the first place.

The underlying reason is because the DM at low alpha expects the count vector to be sparse, as probability mass is concentrated in a subset of the entries after the gamma sampling. However, the pure multinomial yields denser count vectors, which is unusual from a DM perspective. This results in a lower log-likelihood as observed above.

@LTLA LTLA merged commit e33763c into devel Dec 10, 2024
1 check passed
@LTLA LTLA deleted the alpha_inf branch December 10, 2024 22:47
@lcolladotor
Copy link

Thanks Aaron @LTLA!

Does this mean that DropletUtils::emptyDrops() now fully matches the CellRanger parameters for EmptyDrops? Or are there still some differences between the two?

In other words, since CellRanger both OrdMag and EmptyDrops as noted at https://www.10xgenomics.com/support/software/cell-ranger/latest/algorithms-overview/cr-gex-algorithm, is now the only difference the OrdMag part when analyzing the filtered CellRanger outputs versus the raw outputs in R and filtered using DropletUtils::emptyDrops()?

Thanks!

Best,
Leo

@LTLA
Copy link
Collaborator Author

LTLA commented Dec 11, 2024

There are still a few differences between the algorithms, which are currently described in ?emptyDropsCellRanger. This change eliminates one of the undocumented differences.

@lcolladotor
Copy link

Thanks Aaron!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants