Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of missing values in PCA plots #171

Open
bpolacco opened this issue Jan 28, 2021 · 3 comments
Open

Handling of missing values in PCA plots #171

bpolacco opened this issue Jan 28, 2021 · 3 comments

Comments

@bpolacco
Copy link
Contributor

bpolacco commented Jan 28, 2021

I noticed recently that the artMS PCA plots were very different from my own PCA plots on the same dataset. The difference was in how missing values were treated. I remove rows with missing values. It looks like artMS sets missing values to 0. This has major consequences when the mean log2.intensity is about 25 and standard deviation on the order of ~1. Adding zeros injects huge and un-interesting variance that the PCA will work to display, to the detriment of displaying actual interesting variance in the data. Here's some toy code using random data to demonstrate

plotPCA  <- function(mat){
  pca <- prcomp(t(mat))
  pca.dt <- as.data.table(pca$x, keep.rownames = TRUE)
  print (ggplot (pca.dt, aes(x = PC1, y = PC2, label = rn)) + geom_label())
}

# set up a toy matrix
mat <- sapply (1:10, function(x)rnorm(26, mean = 20+x))
colnames(mat) <- LETTERS[1:10]
rownames(mat) <- paste0("protein.", letters)

# inject some missing data in an alternative matrix
mat.w0 <- mat
for (rn in sample(rownames(mat), 13)){
  mat.w0[rn, sample(colnames(mat), 3)] <- 0
}

# no missing data
# this looks as expected with a clear trend A->H in PC1
plotPCA(mat)

# with missing data
# this probably looks random in both PC1 and PC2 because its describing mostly the pattern of missing values
plotPCA(mat.w0)

I think the easiest way to deal with this is to limit the matrix to complete cases; complete.cases function is handy for this. This may discard too much data on large or noisy datasets or a dataset with one very sparse run, so a check and possible warning would be good. Alternatively, there are packages to impute missing values in PCA, but I don't know enough about them to recommend a single package, especially not one robust to all cases.

@bpolacco
Copy link
Contributor Author

From my code above, here's run with no missing values:

image

Same but with missing values added:
image

@bpolacco
Copy link
Contributor Author

bpolacco commented Jan 28, 2021

Another possible solution is to "center" the log2.intensity at zero before setting missing to zero, then the effect will be less. Looks good with my toy data, not sure with real world data yet.

mat <- t(scale(t(mat), center = TRUE, scale = FALSE))

@biodavidjm
Copy link
Owner

Thanks @bpolacco
Let me carefully study it and get back to you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants