-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path03-method.Rmd
294 lines (212 loc) · 12.2 KB
/
03-method.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
# Methods for Disease Module Identification and Disease Similarity {#methods}
In this chapter, I will introduce the main methods used in Network Medicine. We will start by understanding what a Disease Module is (Session \@ref(diseasemodule)), how we can calculate its significance, and also understand its importance. Next, we will explore the disease separation (Session \@ref(dissep)), how to calculate, and make interpretations.
## Disease Module {#diseasemodule}
In biological networks, genes are often involved in the same topological communities are also associated with similar biological processes [@Ahn2010]. It also reflects on *how diseases localized themselves in the interaction*; meaning that, disease modules are highly localized in specific network neighborhoods [@Menche2015] (Figure \@ref(fig:diseasemodule)).
### Largest connected component
The size of the largest connected component (LCC) is the number of nodes that form a connected subgraph (in our case, it is the number of proteins that are interconnected in the PPI). Many properties of this quantity allow us to understand how a particular disease interacts with the interactome. It is important to note here that this measure is highly dependent on the completeness of an interactome. If a link between a protein and its counterparts is unknown -- therefore missing -- we might say that that particular node is not involved in a disease module (or that the LCC is not significant).
```{r diseasemodule,fig.cap='Disease-Module. In a schematic of a PPI, in pink, we see genes associated with a disease, forming a connected component of size 4.', results='hide', echo=FALSE, warning=FALSE, message=FALSE}
require(NetSci)
require(magrittr)
require(dplyr)
require(igraph)
set.seed(124)
N = 25
DM = c("C", "F", "N", "B", "K")
A = data.frame(source = sample(LETTERS[1:5], size = N, replace = T),
target = sample(LETTERS[1:15], replace = T, size = N), type = "PPI")
A$type = ifelse(A$source %in% DM & A$target %in% DM, "DM", "no")
A = unique(A)
A %<>% filter(source != target)
g = igraph::graph_from_data_frame(A, directed = F)
V(g)$color = "#FFCDB2"
V(g)$size = (degree(g)+1)*5
V(g)$label.color = '#B5838D'
V(g)$color = ifelse(V(g)$name %in% DM, '#B65064', V(g)$color )
E(g)$color = '#E5989B'
E(g)$color = ifelse(E(g)$type == "DM", E(g)$color, "gray70")
E(g)$width = 0.6
E(g)$width = ifelse(E(g)$type == "DM", 0.6, E(g)$width)
E(g)$curved = 0.1
V(g)$frame.color = V(g)$color
par(mar = c(0,0,0,0))
plot(g)
```
However, just computing this number might not be informative, and it is expected a randomness. To calculate this randomness, we often calculate the significance of the LCC by selecting proteins in the interactome with similar degrees (aka degree preserving randomization).
To calculate the significance of the LCC, one can calculate its Z-Score or simply calculate the empirical probability under the curve from the empirical distribution. The Z-score is given by:
$$
Z-Score_{LCC} = \frac{LCC - \mu_{LCC}}{\sigma_{LCC}}.
$$
### Example in real data
Our first task now is to understand if some diseases, from our `Cleaned_GDA` are able to form a Disease-Module. Let's start doing it for Schizophrenia and later we will add some more diseases.
The idea now is: Gather the genes associated to our disease in the data, find them in the PPI, check if they form a connected component, check the significance of the component and visualize the Disease-Module.
```{r, echo=TRUE, results='hide', warning=FALSE, message=FALSE}
# First, let's attach all packages we will need.
require(NetSci)
require(magrittr)
require(dplyr)
require(igraph)
```
```{r, results='hide'}
#First, let's select genes that are associated with Schizophrenia.
SCZ_Genes =
Cleaned_GDA %>%
filter(diseaseName %in% 'schizophrenia') %>%
pull(geneSymbol) %>%
unique()
# Next, let's see how they are localized in the PPI.
# Fist, we have to make sure all genes are in the PPI.
# Later, we calculate the LCC.
# And lastly, let's visualize it.
SCZ_PPI = SCZ_Genes[SCZ_Genes %in% V(gPPI)$name]
gScz = gPPI %>%
induced.subgraph(., SCZ_PPI)
components(gScz)
```
```{r}
components(gScz)$csize %>% max
```
```{r}
# The size of the LCC is 683. But... How does it compare to a random selection genes?
LCC_scz = LCC_Significance(N = 1000, Targets = SCZ_PPI,
G = gPPI)
Histogram_LCC(LCC_scz)
```
```{r}
gScz
V(gScz)$size = degree(gScz) %>%
CoDiNA::normalize()
V(gScz)$size = (V(gScz)$size + 0.1)*5
V(gScz)$color = '#83c5be'
V(gScz)$frame.color = '#006d77'
V(gScz)$label = ifelse(V(gScz)$size > 4, V(gScz)$name, NA )
V(gScz)$label.color = '#e29578'
E(gScz)$width = edge.betweenness(gScz, directed = F) %>% CoDiNA::normalize()
E(gScz)$width = E(gScz)$width + 0.01
E(gScz)$weight = E(gScz)$width
par(mar = c(0,0,0,0))
plot(gScz)
```
```{r}
gScz %<>% delete.vertices(., degree(.) == 0)
V(gScz)$size = degree(gScz) %>%
CoDiNA::normalize()
V(gScz)$size = (V(gScz)$size + 0.1)*5
V(gScz)$color = '#83c5be'
V(gScz)$frame.color = '#006d77'
V(gScz)$label = ifelse(V(gScz)$size > 4, V(gScz)$name, NA )
V(gScz)$label.color = '#e29578'
E(gScz)$width = edge.betweenness(gScz, directed = F) %>% CoDiNA::normalize()
E(gScz)$width = E(gScz)$width + 0.01
E(gScz)$weight = E(gScz)$width
par(mar = c(0,0,0,0))
plot(gScz)
```
### Exercises
1. Calculate the LCC, and visualize the modules for the following diseases:
- Autistic Disorder;
- Obesity;
- Hyperlipidemia;
- Rheumatoid Arthritis.
2. Choose any disease of your interest and do the same thing.
## Gene Overlap
A first intuitive way to measure the overlap of two gene sets is by calculating its overlap, or its normalized overlap, the **Jaccard Index**. The Jaccard index is calculated by taking the ratio of **Intersection of two sets over the Union of those sets**. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:
$$
J(A,B) = \frac{|A \cap B|}{|A \cup B|} = \frac{|A \cap B|}{|A| + |B| - |A \cap B|}.
$$
Note that by design, $0 \leq J(A,B) \leq 1$. If A and B are both empty, define $J(A,B) = 1$.
Let's calculate the Jaccard Index for the five diseases we calculated its LCCs.
```{r}
Dis_Ex1 = c('schizophrenia',
"autistic disorder",
'obesity',
'hyperlipidemia',
'rheumatoid arthritis')
GDA_Interest = Cleaned_GDA %>%
filter(diseaseName %in% Dis_Ex1) %>%
select(diseaseName, geneSymbol) %>%
unique()
Jaccard_Ex2 = Jaccard(GDA_Interest)
Jaccard_Ex2
```
```{r}
# Let's visualize the Venn diagram (Euler Diagram) of those overlaps.
require(eulerr)
Euler_List = list (
SCZ = GDA_Interest$geneSymbol[GDA_Interest$diseaseName == 'schizophrenia'],
ASD = GDA_Interest$geneSymbol[GDA_Interest$diseaseName == 'autistic disorder'],
OB = GDA_Interest$geneSymbol[GDA_Interest$diseaseName == 'obesity'],
HD = GDA_Interest$geneSymbol[GDA_Interest$diseaseName == 'hyperlipidemia'],
RA = GDA_Interest$geneSymbol[GDA_Interest$diseaseName == 'rheumatoid arthritis'])
EULER = euler(Euler_List)
plot(EULER, quantities = TRUE)
```
## Disease Separation {#dissep}
When looking into the Jaccard Index, we have a sense of how similar two diseases are based on genes that are **known** to be associated with both diseases. The main problem with this is that we assume that all genes associated with a disease is known, and we do not take the topology of the underlying network into account.
The **separation** is a complementary quantity that is a bit less sensitive to the incompleteness of the PPI, we can measure the distances $d_s$ of each disease-associated node to all other disease associated nodes. Taking into account only the shortest distance between them results in a distribution $P(d_s)$. The mean value $<d_s>$ can be interpreted as the diameter of the disease model. **Note** the diameter here is the average distance instead of the maximal distance.
The **concept of network localization** can be further generalized to examine the relationship between any different sets of nodes, for example, proteins associated with two different diseases.
The network serves as a **map**, where diseases are represented by different neighborhoods.
How close and the degree of overlap of two network neighborhoods can be found to be highly predictive of the pathological similarity of those diseases [@Menche2015] (Figure \@ref(fig:separation)).
To quantify the distance of two sets of nodes A and B, we first compute the distribution $P(d_{AB})$ of all shortest distances $d_{AB}$ between nodes A and B and the respective mean distance $<d_{AB}>$.
The network based separation $S_{AB}$ can be obtained by comparing the mean shortest distance **within** the respective node sets and the mean shortest distance **between** them.
$$
S_{AB} = <d_{AB}> - \frac{<d_{AA}> + <d_BB>}{2}.
$$
**Note**: negative $S_{AB}$ indicates topological overlap of the two node sets, while a positive $S_{AB}$ indicates a topological separation of the two node sets.
The size of the overlap is highly predictive of pathological and functional similarity, elevated co-expression, symptoms similarity, and high comorbidity diseases.
```{r separation,fig.cap='Disease-Separation. In a schematic PPI, we see genes associated with a disease A (in pink), and genes associated to disease B (in green).', results='hide', echo=FALSE}
set.seed(124)
N = 35
DM = c("C", "K", "N", "E", "O")
DM2 = c("A", "G", "L", "M")
A = data.frame(source = sample(LETTERS[1:6], size = N, replace = T),
target = sample(LETTERS[1:15], replace = T, size = N), type = "PPI")
A$type = ifelse(A$source %in% DM & A$target %in% DM, "DM1", "no")
A$type = ifelse(A$source %in% DM2 & A$target %in% DM2, "DM2", A$type)
A = unique(A)
A %<>% filter(source != target)
g = igraph::graph_from_data_frame(A, directed = F) %>%
simplify(remove.multiple = F)
V(g)$color = "#FFCDB2"
V(g)$size = (degree(g)+1)*5
V(g)$label.color = '#B5838D'
V(g)$color = ifelse(V(g)$name %in% DM, '#B65064', V(g)$color )
V(g)$color = ifelse(V(g)$name %in% DM2, '#74c69d', V(g)$color )
V(g)$label.color = ifelse(V(g)$name %in% DM, 'gray90', V(g)$label.color )
V(g)$label.color = ifelse(V(g)$name %in% DM2, '#2d6a4f', V(g)$label.color )
E(g)$color = 'gray70'
E(g)$color = ifelse(E(g)$type == "DM1", "#FFB4A2", E(g)$color)
E(g)$color = ifelse(E(g)$type == "DM2", "#95d5b2", E(g)$color)
E(g)$width = 0.6
E(g)$width = ifelse(E(g)$type == "DM1" |E(g)$type == "DM2" , 1.5, E(g)$width)
E(g)$curved = 0.1
V(g)$frame.color = V(g)$color
par(mar = c(0,0,0,0))
plot(g)
```
The separation of diseases A and B is given by: $$
<d_{AA}> = 1.5
$$
$$
<d_{BB}> = 1.5
$$
$$
<d_{AB}> = 2.7
$$ $$
S_{AB} = 2.7 - \frac{1.5+ 1.5}2 = 1.2.
$$
### Example in real data
```{r}
sab = separation(gPPI, GDA_Interest)
Sep_ex2 = sab$Sab %>% as.matrix()
Sep_ex2[lower.tri(Sep_ex2)] = t(Sep_ex2)[lower.tri(Sep_ex2)]
```
We can visualize the network separation of the diseases using a heatmap.
```{r, }
Sep_ex2 %>% heatmap(., symm = T)
```
## Exercises
1. If we go back to our PPI, can we identify that the modules are indeed close or separated? Plot the network for those diseases.
2. Calculate the **Jaccard Index** and the **Separation** for the following diseases:
- Schizophrenia, Bipolar Disorder, Intellectual Disability, Depressive disorder, Autistic Disorder, Unipolar Depression, Mental Depression, Major Depressive Disorder, Mood Disorders, Cocaine Dependence, Cocaine Abuse, Cocaine-Related Disorders, Substance abuse problem, Drug abuse, Drug Dependence, Drug habituation, Drug Use Disorders, Substance-Related Disorders, Psychotic Disorders, Obesity, hyperlipidemia, Rheumatoid Arthritis, Prostatic Neoplasms, Mammary Neoplasms, Mammary Neoplasms, Human Malignant neoplasm of stomach, Stomach Neoplasms, Colorectal Neoplasms, Malignant neoplasm of lung, Lung Neoplasms, Malignant neoplasm of prostate.
3. Optional: Try to make the network visualization for the heatmap of `Sep_ex2`. Use diseases as nodes, and their weight as links.
4. Optional: Plot the PPI with genes selected in `GDA_Interest`, where each node is a piechart representing which diseases are associated with that particular gene. Tip: Check `vertex.shape.pie` for help.