-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathsparseSvd.Rmd
509 lines (369 loc) · 16.6 KB
/
sparseSvd.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
---
title: "4. Sparse Singular Value Decomposition"
author: "Lieven Clement"
date: "statOmics, Ghent University (https://statomics.github.io)"
output:
bookdown::pdf_document2:
toc: true
number_sections: true
latex_engine: xelatex
always_allow_html: true
---
```{r, child="_setup.Rmd"}
```
```{r echo = FALSE, warning = FALSE}
library(tidyverse)
library(gridExtra)
```
# Introduction
In high dimensional data the PCs may have unclear interpretation because they depend on all $p$ original variables.
## Example: Toxicogenomics in early drug development
### Background
- Effect of compound on gene expression.
- Insight in action and toxicity of drug in early phase
- Determine activity with bio-assay: e.g. binding afinity of compound to cellwall recepter (target, IC50).
- Early phase: 20 to 50 compounds
- Based on in vitro results one aims to get insight in how to build better compound (higher on-target activity less toxicity.
- Small variantions in moelculair structure lead to variations in BA and gene expression.
- Aim: Build model to predict bio-activity based on gene expression in liver cell line.
### Data
- 30 chemical compounds have been screened for toxicity
- Bioassay data on toxicity screening
- Gene expressions in a liver cell line are profiled for each compound (4000 genes)
```{r}
toxData <- read_csv(
"https://raw.githubusercontent.com/statOmics/HDA2020/data/toxDataCentered.csv",
col_types = cols()
)
svdX <- svd(toxData[,-1])
```
Data is already centered:
```{r}
toxData %>%
colMeans %>%
range
```
```{r}
toxData %>%
names %>%
head
```
- First column contains data on Bioassay.
- The higher the score on Bioassay the more toxic the compound
- Other columns contain data on gene expression X1, ... , X4000
### Data exploration
```{r}
toxData %>%
ggplot(aes(x="",y=BA)) +
geom_boxplot(outlier.shape=NA) +
geom_point(position="jitter")
```
```{r}
svdX <- toxData[,-1] %>%
svd
k <- 2
Vk <- svdX$v[,1:k]
Uk <- svdX$u[,1:k]
Dk <- diag(svdX$d[1:k])
Zk <- Uk%*%Dk
colnames(Zk) <- paste0("Z",1:k)
colnames(Vk) <- paste0("V",1:k)
pca <- Zk %>%
as.data.frame %>%
mutate(BA = toxData %>% pull(BA)) %>%
ggplot(aes(x= Z1, y = Z2, color = BA)) +
geom_point(size = 3) +
scale_colour_gradient2(low = "blue",mid="white",high="red") +
geom_point(size = 3, pch = 21, color = "black")
pca
```
- Scores on the first two principal components (or MDS plot). - Each point corresponds to a compound.
- Color code refers to the toxicity score (higher score more toxic).
- Clear separation between compounds according to toxicity.
---
- Next logic step in a PCA is to interpret the principal components.
- We thus have to assess the loadings.
- We can add a vector for each gene to get a biplot, but this would require plotting 4000 vectors, which would render the plot unreadable.
Alternative graph to look at the many loadings of the first two PCs.
```{r}
grid.arrange(
Vk %>%
as.data.frame %>%
mutate(geneID = 1:nrow(Vk)) %>%
ggplot(aes(x = geneID, y = V1)) +
geom_point(pch=21) +
geom_hline(yintercept = c(-2,0,2)*sd(Vk[,1]), col = "red") ,
Vk %>%
as.data.frame %>%
mutate(geneID = 1:nrow(Vk)) %>%
ggplot(aes(x = geneID, y = V2)) +
geom_point(pch=21) +
geom_hline(yintercept = c(-2,0,2)*sd(Vk[,2]), col = "red"),
ncol=2)
```
- It is almost impossible to interpret the PCs because there are 4000 genes contributing to each PC.
- In an attempt to find the most important genes (in the sense that they drive the interpretation of the PCs), the plots show horizontal reference lines: the average of the loadings, and the average ± twice the standard deviation of the loadings. In between the lines we expects about 95% of the loadings (if they were normally distributed).
- The points outside the band come from the genes that have rather large loadings (in absolute value) and hence are important for the interpretation of the PCs.
- Note, that particularly for the first PC, only a few genes show a markedly large loadings that are negative. This means that an upregulation of these genes will lead to low scores on PC1.
- These genes will very likely play an important role in the toxicity mechanism.
- Indeed, low scores on PC1 are in the direction of more toxicity.
---
## Sparse matrix decomposition
Basic idea of sparse matrix decomposition of an $n \times p$ matrix $\mathbf{X}$}
Find a rank $k$ approximation of $\mathbf{X}$ of the form
\[
\mathbf{X}_k = \sum_{j=1}^k \delta_j \mathbf{u}_j\mathbf{v}_j^t
\]
such that many of the elements in the $\mathbf{v}_j$ and or $\mathbf{u}_j$ are exactly zero.
We focus on the **Penalised Matrix Decomposition** (PMD) method of Witten *et al.* (2009, Biostatistics, 10, 3, 515-534).
When many of the loadings in $\mathbf{v}_j$ are zero, the interpretation of the PC only depends on the features that correspond to the remaining non-zero loadings.
---
# Penalised Matrix Decomposition
## Rank 1 approximation
The rank-1 PMD approximation of the $n \times p$ matrix $\mathbf{X}$ is of the form
\[
\mathbf{X}_1 = \delta_1\mathbf{u}_1\mathbf{v}_1^t,
\]
and it is the solution to the following minimisation problem:
\[
\min_{\delta,u,v} \Vert \mathbf{X} - \delta\mathbf{u}\mathbf{v}^t\Vert_F^2 \text{ subject to }
\delta \geq 0, \Vert \mathbf{u}\Vert_2^2=\Vert \mathbf{v}\Vert_2^2=1, \Vert\mathbf{u}\Vert_1\leq c_1, \Vert \mathbf{v}\Vert_1 \leq c_2
\]
for some user-defined constants $c_1,c_2>0$.
The constraints $\Vert\mathbf{u}\Vert_1\leq c_1$ and $\Vert \mathbf{v}\Vert_1 \leq c_2$ are known as $L_1$ or lasso constraints.
Note,
\[
\Vert\mathbf{u}\Vert_1\ = \sum_{i=1}^n \vert u_i \vert \text{ and }\Vert\mathbf{v}\Vert_1\ = \sum_{i=1}^p \vert v_i\vert.
\]
If $c_1=+\infty$, then the constraint on $\mathbf{u}$ is removed (similar for $c_2$).
---
Note that
\[
\min_{\delta,u,v} \Vert \mathbf{X} - \delta\mathbf{u}\mathbf{v}^t\Vert_F^2 = \min_{A:\text{rank}(A)=1} \Vert \mathbf{X} - A\Vert_F^2
\]
with $A=X_1$ (truncated SVD of $\mathbf{X}$, truncated after 1 term). However, the PMD requires a constrained minimisation. The unconstrained solution arises with $c_1=c_2=+\infty$.
Note that $\Vert \mathbf{u}\Vert_2^2=\Vert \mathbf{v}\Vert_2^2=1$ may also be written as $\mathbf{u}^t\mathbf{u}=\mathbf{v}^t\mathbf{v}=1$ (i.e. the vectors are normalised, just like for the singular vectors of the SVD).
- We also applied the $L_1$ constraint to the least-squares estimator of the regression parameters in a linear regression model.
- Effect was that many of the parameter estimates are set exactly to zero.
- Here, the same effect will take place. Many of the elements in $\mathbf{u}$ and $\mathbf{v}$ will be set exactly to zero.
- In particular, the smaller $c_2$, the more elements of $\mathbf{v}$ will be set to zero (shrinkage).
- Similarly, the smaller $c_1$, the more elements of $\mathbf{u}$ will be set to zero.
- The choice of $c_1$ and $c_2$ depends on the objective of the data analysis and will be discussed later.
The solution is known as a rank-1 PMD approximation of $\mathbf{X}$ because $\delta_1\mathbf{u}_1\mathbf{v}_1^t$ is a matrix of rank 1.
---
## Rank-2 approximation
The rank-2 PMD approximation is found as follows:
1. find the rank-1 PMD as before: $\mathbf{X}_1=\delta_1\mathbf{u}_1\mathbf{v}_1^t$;
2. compute
\[
\tilde{\mathbf{X}}_2 = \mathbf{X}-\mathbf{X}_1,
\]
and compute the rank-1 PMD of $\tilde{\mathbf{X}}_2$, which gives $\delta_2, \mathbf{u}_2, \mathbf{v}_2$;
3. the rank-2 PMD of $\mathbf{X}$ is then given by
\[
\mathbf{X}_2 = \delta_1\mathbf{u}_1\mathbf{v}_1^t+\delta_2\mathbf{u}_2\mathbf{v}_2^t.
\]
Note that the $\mathbf{u}$'s and $\mathbf{v}$'s are not necessarily orthogonal (only if $c_1, c_2=+\infty$).
---
## Rank-k approximation
The rank-$k$ PMD approximation is found as follows:
1. find the rank-1 PMD as before: $\tilde{\mathbf{X}}_1=\delta_1\mathbf{u}_1\mathbf{v}_1^t$;
2. set $j=1$;
3. compute \[
\tilde{\mathbf{X}}_{j+1} = \mathbf{X}-\tilde{\mathbf{X}}_1-\cdots - \tilde{\mathbf{X}}_j,
\] and compute the rank-1 PMD of $\tilde{\mathbf{X}}_{j+1}$, which gives $\delta_{j+1}, \mathbf{u}_{j+1}, \mathbf{v}_{j+1}$;
4. if $j<k$, increase $j$ with one and go back to step 3; otherwise go to step 5;
5. the rank-$k$ PMD of $\mathbf{X}$ is then given by
\[
\mathbf{X}_k = \sum_{j=1}^k \delta_j\mathbf{u}_j\mathbf{v}_j^t.
\]
Without the $L_1$ constraints, this algorithm gives the ordinary SVD.
As before the $\delta_j$s, $\mathbf{u}_j$s and $\mathbf{v}_j$s are stacked into matrices $\mathbf{D}_k$, $\mathbf{U}_k$ and $\mathbf{V}_k$: $\mathbf{X}_k = \mathbf{U}_k\mathbf{D}_k\mathbf{V}_k^t$.
Note that the $\mathbf{u}$ vectors are generally not orthogonal (i.e. $\mathbf{u}_i^t\mathbf{u}_j\neq 0$ for $i\neq j$) (the same holds for the $\mathbf{v}$ vectors).
---
## Sparse PCA via PMD
Just like the SVD can be used to find the PCA solution, the PMD can be used to find a sparse PCA solution (SPC).
For a PCA, it is only important to have sparse loadings (in the $\mathbf{v}$s); the scores (involving the $\mathbf{u}$s) need not be sparse.
The rank-1 solutions thus satisfy
\[
\min_{\delta,u,v} \Vert \mathbf{X} - \delta\mathbf{u}\mathbf{v}^t\Vert_F^2 \text{ subject to }
\delta \geq 0, \Vert \mathbf{u}\Vert_2^2=\Vert \mathbf{v}\Vert_2^2=1, \Vert \mathbf{v}\Vert_1 \leq c
\]
(i.e. no sparsity constraint on $\mathbf{u}$).
Hence the PCA scores
\[
\mathbf{Z}_k = \mathbf{U}_k\mathbf{D}_k
\]
relate to $\mathbf{X}_k$ through the sparse loadings in $\mathbf{V}_k$.
---
It can be shown that the SPC solution (rank-1) also maximises
\[
\mathbf{v}^t\mathbf{X}^t\mathbf{X}\mathbf{v} \text{ subject to }
\Vert \mathbf{u}\Vert_2^2=\Vert \mathbf{v}\Vert_2^2=1, \Vert \mathbf{v}\Vert_1 \leq c
\]
(i.e. it maximises the variance of the PCs, subject to sparsity constraint on loadings).
From these two slides we may conclude that we can use the PMD as the basis of a sparse PCA, and that the interpretation is as before in terms of maximising variance. Only now we hope that only a few features will give a non-zero loading in the $\mathbf{v}$-vectors. Biplots may again be constructed (with fewer vectors to be plotted).
---
# Toxicogenomics analysis with a sparse PCA
## SPCA
```{r}
library(PMA)
spcX <- SPC(
toxData[,-1] %>%
as.matrix,
K=30,
sumabsv = 5)
par(mfrow=c(1,2))
barplot(spcX$d^2,ylab="eigenvalue",cex.lab=1.5)
barplot(spcX$prop.var.explained,
ylab="cummulative variance",cex.lab=1.5)
```
- The plot on the left shows the eigenvalues (squared singular values) from the PMD.
- This suggests that the first one or two are the most important, but all eigenvalues remain substantially larger than zero.
- The figure on the right shows the cumulative percentage of variance retained in the PCs.
- These percentages are now no longer calculated as before as ratio's of sums of eigenvalues.
- A more complicated calculation is required now, because the PCs are no longer uncorrelated with one another.
- Thus part of the information (variance) of one PC is shared with part of the information of another PCs (i.e. non-zero correlation or covariance).
## Loading
Loadings of the first two PCs, with constraint $\Vert\mathbf{v}\Vert_1 < 5$.
```{r}
k <- 2
Vk <- spcX$v[,1:k]
colnames(Vk) <- paste0("V",1:k)
grid.arrange(
Vk %>%
as.data.frame %>%
mutate(geneID = 1:nrow(Vk)) %>%
ggplot(aes(x = geneID, y = V1)) +
geom_point(pch=21) +
geom_hline(yintercept = c(-2,0,2)*sd(Vk[,1]), col = "red") ,
Vk %>%
as.data.frame %>%
mutate(geneID = 1:nrow(Vk)) %>%
ggplot(aes(x = geneID, y = V2)) +
geom_point(pch=21) +
geom_hline(yintercept = c(-2,0,2)*sd(Vk[,2]), col = "red"),
ncol=2)
```
These plots show the loadings of the first two sparse PCs. Many of the loadings are now exactly equal to zero. Only few genes show large loadings.
## Change constraint
We now repeat the analysis, but with the constraint $\Vert \mathbf{v} \Vert_1 < 1$. This gives a much sparser solution with only two genes with non-zero loadings in each dimension.
```{r}
spcX1 <- SPC(
toxData[,-1] %>%
as.matrix,
K=30,
sumabsv = 1)
k <- 2
Vk <- spcX1$v[,1:k]
colnames(Vk) <- paste0("V", 1:k)
grid.arrange(
Vk %>%
as.data.frame %>%
mutate(geneID = 1:nrow(Vk)) %>%
ggplot(aes(x = geneID, y = V1)) +
geom_point(pch=21) +
geom_hline(yintercept = c(-2,0,2)*sd(Vk[,1]), col = "red") ,
Vk %>%
as.data.frame %>%
mutate(geneID = 1:nrow(Vk)) %>%
ggplot(aes(x = geneID, y = V2)) +
geom_point(pch=21) +
geom_hline(yintercept = c(-2,0,2)*sd(Vk[,2]), col = "red"),
ncol=2)
```
---
We now repeat the analysis, but with the constraint $\Vert \mathbf{v} \Vert_1 < 10$. This gives a much less sparse solution with several genes with non-zero loadings.
```{r}
spcX10 <- SPC(
toxData[,-1] %>%
as.matrix,
K=30,
sumabsv = 10)
k <- 2
Vk <- spcX10$v[,1:k]
colnames(Vk) <- paste0("V",1:k)
grid.arrange(
Vk %>%
as.data.frame %>%
mutate(geneID = 1:nrow(Vk)) %>%
ggplot(aes(x = geneID, y = V1)) +
geom_point(pch=21) +
geom_hline(yintercept = c(-2,0,2)*sd(Vk[,1]), col = "red") ,
Vk %>%
as.data.frame %>%
mutate(geneID = 1:nrow(Vk)) %>%
ggplot(aes(x = geneID, y = V2)) +
geom_point(pch=21) +
geom_hline(yintercept = c(-2,0,2)*sd(Vk[,2]), col = "red"),
ncol=2)
```
---
## Variance explained
Variance explained for $c=1,5,10$ and $+\infty$.
```{r}
data.frame(
PCs = 1:30,
"c1" = spcX1$prop.var.explained,
"c5" = spcX$prop.var.explained,
"c10" = spcX10$prop.var.explained,
"conventional" = cumsum(svdX$d^2)/sum(svdX$d^2)
) %>%
gather("method","varExpl",-1) %>%
ggplot(aes(x = PCs, y = varExpl, color = method)) +
geom_line() +
geom_point() +
ylab("Variance explained (%)")
```
This graph illustrates that the percentages of variance explained by the PCs are smaller for sparser solutions. Thus in choosing an appropriate penalty parameter $c_2$ one should not only look at the sparseness of the solution, but also at the percentage of variance contained in the first few PCs. If this percentage is very small, then the graphs of the scores (or biplots) are no longer very informative (i.e. they will not tell the full story). Thus a compromise must be looked for.
```{r}
library(latex2exp)
Uk5 <- spcX$u[,1:k]
Dk5 <- diag(spcX$d[1:k])
Zk5 <- Uk5%*%Dk5
colnames(Zk5) <- paste0("Z",1:k)
spc5 <- Zk5 %>%
as.data.frame %>%
mutate(BA = toxData %>% pull(BA)) %>%
ggplot(aes(x= Z1, y = Z2, color = BA)) +
geom_point(size = 3) +
scale_colour_gradient2(low = "blue",mid="white",high="red") +
geom_point(size = 3, pch = 21, color = "black") +
ggtitle(TeX("$c_2 = 5$"))
Uk1 <- spcX1$u[,1:k]
Dk1 <- diag(spcX1$d[1:k])
Zk1 <- Uk1%*%Dk1
colnames(Zk1) <- paste0("Z",1:k)
spc1 <- Zk1 %>%
as.data.frame %>%
mutate(BA = toxData %>% pull(BA)) %>%
ggplot(aes(x= Z1, y = Z2, color = BA)) +
geom_point(size = 3) +
scale_colour_gradient2(low = "blue",mid="white",high="red") +
geom_point(size = 3, pch = 21, color = "black") +
ggtitle(TeX("$c_2 = 1$"))
Uk10 <- spcX10$u[,1:k]
Dk10 <- diag(spcX10$d[1:k])
Zk10 <- Uk10%*%Dk10
colnames(Zk10) <- paste0("Z",1:k)
spc10 <- Zk10 %>%
as.data.frame %>%
mutate(BA = toxData %>% pull(BA)) %>%
ggplot(aes(x= Z1, y = Z2, color = BA)) +
geom_point(size = 3) +
scale_colour_gradient2(low = "blue",mid="white",high="red") +
geom_point(size = 3, pch = 21, color = "black") +
ggtitle(TeX("$c_2 = 10$"))
grid.arrange(spc1, spc5, spc10, pca + ggtitle(TeX("$c_2 = \\infty")), ncol = 2)
```
Finally we show the plots of the scores on the first two sparse PCs for $c_2=1$ (top-left), $c_2=5$ (top-right) and $c_2=10$ (bottom-left) and $c_2=\infty$ (bottom-right, convention SVD).
Note, that the most sparse solution (top-left) shows a strong positive correlation between the first two PCs (as a consequence of the loss of orthogonality of the singular vectors).
All graphs succeed quite well in separating the toxic from the non-toxic compounds based on their gene expression.
- Hence, we may expect that the two genes identified from the most sparse solution play an important role in the development of toxic effects.
- Note, however, that our data analysis did not provide a scientific proof for this conclusion: (sparse) PCA is only an exploratory or descriptive data analysis method that helps the data-analyst to gain insight into the data.
- The fact that we found a few genes that appear to be associated with toxicity is of course a nice result.
- It is important to understand that the toxicity outcome was not used in the (sparse) PCA. We only used the toxicity to color the points in the graphs. PCA is an unsupervised method.
# Acknowledgement {-}
- Olivier Thas for sharing his materials of Analysis of High Dimensional Data 2019-2020, which I used as the starting point for this chapter.
```{r, child="_session-info.Rmd"}
```