You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed recently that the artMS PCA plots were very different from my own PCA plots on the same dataset. The difference was in how missing values were treated. I remove rows with missing values. It looks like artMS sets missing values to 0. This has major consequences when the mean log2.intensity is about 25 and standard deviation on the order of ~1. Adding zeros injects huge and un-interesting variance that the PCA will work to display, to the detriment of displaying actual interesting variance in the data. Here's some toy code using random data to demonstrate
plotPCA <- function(mat){
pca <- prcomp(t(mat))
pca.dt <- as.data.table(pca$x, keep.rownames = TRUE)
print (ggplot (pca.dt, aes(x = PC1, y = PC2, label = rn)) + geom_label())
}
# set up a toy matrix
mat <- sapply (1:10, function(x)rnorm(26, mean = 20+x))
colnames(mat) <- LETTERS[1:10]
rownames(mat) <- paste0("protein.", letters)
# inject some missing data in an alternative matrix
mat.w0 <- mat
for (rn in sample(rownames(mat), 13)){
mat.w0[rn, sample(colnames(mat), 3)] <- 0
}
# no missing data
# this looks as expected with a clear trend A->H in PC1
plotPCA(mat)
# with missing data
# this probably looks random in both PC1 and PC2 because its describing mostly the pattern of missing values
plotPCA(mat.w0)
I think the easiest way to deal with this is to limit the matrix to complete cases; complete.cases function is handy for this. This may discard too much data on large or noisy datasets or a dataset with one very sparse run, so a check and possible warning would be good. Alternatively, there are packages to impute missing values in PCA, but I don't know enough about them to recommend a single package, especially not one robust to all cases.
The text was updated successfully, but these errors were encountered:
Another possible solution is to "center" the log2.intensity at zero before setting missing to zero, then the effect will be less. Looks good with my toy data, not sure with real world data yet.
mat <- t(scale(t(mat), center = TRUE, scale = FALSE))
I noticed recently that the artMS PCA plots were very different from my own PCA plots on the same dataset. The difference was in how missing values were treated. I remove rows with missing values. It looks like artMS sets missing values to 0. This has major consequences when the mean log2.intensity is about 25 and standard deviation on the order of ~1. Adding zeros injects huge and un-interesting variance that the PCA will work to display, to the detriment of displaying actual interesting variance in the data. Here's some toy code using random data to demonstrate
I think the easiest way to deal with this is to limit the matrix to complete cases;
complete.cases
function is handy for this. This may discard too much data on large or noisy datasets or a dataset with one very sparse run, so a check and possible warning would be good. Alternatively, there are packages to impute missing values in PCA, but I don't know enough about them to recommend a single package, especially not one robust to all cases.The text was updated successfully, but these errors were encountered: