-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path02-literature.Rmd
362 lines (240 loc) · 13.7 KB
/
02-literature.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
# Data Commonly Used in Network Medicine
In NetMed, we are often interested in understanding *how genes associated to a particular disease can influence each other*, *how two diseases are similar (or different)*, *and how a drug can be used in different set-ups.*
For that, it is necessary to use data sets that are able to represent those associations: **Protein-Protein Interactions** are used as a map of the interactions inside our cells (Session \@ref(PPI)); **Gene-Disease-Associations** are used for us to identify genes that were previously associated to diseases, often using a GWAS approach (Session \@ref(GDA)); and Drug-Target interactions, often measured by identifying physical binding of a therapeutic compound (often a drug) and a protein (Session \@ref(DTs)).
## Protein-Protein Interaction Networks {#PPI}
In PPI networks, the nodes represent proteins, and they are connected by a link if they physically interact with each other [@rual2005]. Typically, these interactions are measured experimentally, for instance, with the Yeast-Two-Hybrid (Y2H) system [@uetz2000], or by protein complex immunoprecipitation followed by high-throughput Mass Spectrometry [@zhang2008; @koh2012], or inferred computationally based on sequence similarity [@fong2004]. PPI can be used to infer gene functions and the association of sub-networks to diseases [@Menche2015]. In this type of network, a highly connected protein tends to interact with proteins that are less connected, probably to prevent unwanted cross-talk of functional modules. As mentioned, most of the methods in network medicine are based on PPI.
### Measuring PPIs
Protein-Protein Interactions can be measured mainly using three different techniques:
1. By the creation of Protein-Protein interaction maps derived from existing scientific literature;
2. Using computational predictions of PPIs based on available orthogonal information; and
3. By systematic experimental mapping of proteins identify complex association and/or binary interactions. We will focus here only on the third.
Co-complex associations interrogate a protein composition of a protein complex in one (or several) cell lines. The most common approach uses affinity purification to extract the proteins that associate with the *bait* proteins, followed by mass spectometry in order to identify proteins that associate with the *bait.* This approach is often used for simple organisms, however, similar approaches have been reported for humans. Unfortunately, achieving stable expression of bait proteins is challenging. Co-complex map associations are composed by indirect and some direct binary associations. However, raw association data cannot distinguish the indirect from the direct association, and therefore, co-complex datasets have to be filtered and need to have incorporated prior knowledge that might lead to bias towards super-start genes. On the other side, for experimental determination of binary interactions between proteins, all possible pairs of proteins are systematically tested to generate a data set of all possible biophysical interactions.
Because the human genome is composed by \~20,000 unique genes - not even considering its isophorms - we would have \~200 million possible combinations in order to robust systematically identify interactions, Yeast-to-Hybrid (Y2H) technology is the only one that can meet this requirement. This technology is able to interrogate hundreds of millions of human protein pairs for binary interactions. In short, the method works as follows: Protein of interest X and a DNA binding domain (DBD-X) fuse to form *bait*. The fusion of transcriptional activation domain (AD-Y) and a cDNA library Y results in *prey*. Those two form the basis of the protein--protein interaction detection system. Without bait--prey interaction, the activation domain is unable to restrict the gene-to-gene expression drive.
### Commonly used data sources for PPIs
PPIs can be found from different sources. I list here some well-known databases for that.
1. Binary PPIs derived from high-throughput yeast-two hybrid (Y2H) experiments:
- HI-Union [@Luck2020].
2. Binary PPIs three-dimensional (3D) protein structures:
- Interactome3D [@Mosca2013];
- Instruct [@Meyer2013];
- Insider [@Meyer2018].
3. Binary PPIs literature curation:
- PINA [@Cowley2012];
- MINT [@Licata2012];
- LitBM17 [@Luck2020];
- Interactome3D;
- Instruct;
- Insider;
- BioGrid [@Chatr-Aryamontri2017];
- HINT [@Das2012];
- HIPPIE [@Alanis-Lobato2017];
- APID [@Alonso-Lopez2019a];
- InWeb [@li2016].
4. PPIs identified by affinity purification followed by mass spectrometry:
- BioPlex [@Huttlin2017];
- QUBIC [@Hein2015];
- CoFrac [@wan2015];
- HINT;
- HIPPIE;
- APID;
- LitBM17;
- InWeb.
5. Kinase substrate interactions:
- KinomeNetworkX [@cheng2014];
- PhosphoSitePlus [@Hornbeck2015].
6. Signaling interactions:
- SignaLink [@Fazekas2013];
- InnateDB [@Breuer2013].
7. Regulatory interactions:
- ENCODE consortium.
### Understanding a PPI
For this workshop, we will be using for this workshop is a combination of a manually curated PPI that combines all previous data sets. The data can be [found here](https://github.com/deisygysi/NetMed_Workshop/blob/master/data/PPI_Symbol_Entrez.csv). This PPI was previously published in @Gysi2020a.
Before we can start any analysis using this interactome, let us first understand this data.
The PPI contains the EntrezID and the HGNC symbol of each gene, and some might not have a proper map. Therefore, it should be removed from further analysis. Moreover, we might have loops, and those should also be removed.
Let us begin by preparing our environment and calling all libraries we will need at this point.
```{r prepare_enviroment, warning=FALSE, results='hide', message=FALSE}
require(data.table)
require(tidyr)
require(igraph)
require(dplyr)
require(magrittr)
require(ggplot2)
```
Let's read in our data.
```{r}
PPI = fread("./data/PPI_Symbol_Entrez.csv")
```
```{r, results='hide'}
head(PPI)
```
```{r, results='markup', echo=FALSE}
head(PPI) %>% kbl() %>% kable_styling()
```
Let's transform our edge-list into a network.
```{r}
gPPI = PPI %>%
select(starts_with("Symbol")) %>%
filter(Symbol_A != "") %>%
filter(Symbol_B != "") %>%
graph_from_data_frame(., directed = F) %>%
simplify()
gPPI
```
How many genes do we have? How many interactions?
Next, let's check the degree distribution:
```{r degree-distribution, fig.cap="PPI Degree Distribution."}
dd = degree(gPPI) %>% table() %>% as.data.frame()
names(dd) = c('Degree', "Nodes")
dd$Degree %<>% as.character %>% as.numeric()
dd$Nodes %<>% as.character %>% as.numeric()
ggplot(dd) +
aes(x = Degree, y = Nodes) +
geom_point(colour = "#1d3557") +
scale_x_continuous(trans = "log10") +
scale_y_continuous(trans = "log10") +
theme_minimal()
```
Most of the proteins have few connections, and very few proteins have lots of connections. Who's that protein?
```{r, results='markup'}
degree(gPPI) %>%
as.data.frame() %>%
arrange(desc(.)) %>%
filter(. > 1000)
```
### Exercises
Now is your turn. Spend some minutes understanding the data and getting some familiarity with it.
1. What are the top 10 genes with the highest degree?
2. Are those genes connected?
## Gene Disease Association {#GDA}
A Gene-Disease-Association (GDA) database are typically used to understand the association of genes to diseases, and model the underlying mechanisms of complex diseases. Those associations often come from GWAS studies and knock-out studies.
### Commonly used data sources for GDAs
As PPIs, GDAs can be found from different sources and with different evidences for each Gene-Disease association. I list here some well-known databases for that.
- CTD -- Curated scientific literature [@davis2020]
- OMIM -- Curated scientific literature [@mckusick2007]
- DisGeNet -- Based on OMIM, ClinVar, and other data bases [@piñero2019]
- Orphanet -- Validated - and non-validated - GDAs
- ClinGen -- Validated - and non-validated - GDAs [@rehm2015]
- ClinVar -- Different levels of evidence [@landrum2019]
- GWAS catalogue -- GWAS associations to diseases [@buniello2018]
- PheGenI -- GWAS associations to diseases [@ramos2013]
- lncRNADisease -- Experimentally validated lncRNAs in diseases [@chen2012]
- HMDD -- Experimentally validated miRNAs in diseases [@huang2018]
### Understanding a GDA dataset
We will use in this workshop Gene-Disease-Association from DisGeNet. It can be [found here](https://github.com/deisygysi/NetMed_Workshop/blob/master/data/curated_gene_disease_associations.tsv).
Similar to the PPI, let us first get some familiarity with the data, before performing any analysis.
Let's read in the data and, again, do some basic statistics.
```{r, results='hide'}
GDA = fread(file = 'data/curated_gene_disease_associations.tsv', sep = '\t')
head(GDA)
```
```{r, results='markup', echo=FALSE}
head(GDA) %>% kbl() %>% kable_styling()
```
The first thing to notice is the inconsistency with the disease names, in order to be able to work with it, let's first put every disease to lower-case.
```{r}
Cleaned_GDA = GDA %>% filter(diseaseType == 'disease') %>%
mutate(diseaseName = tolower(diseaseName)) %>%
select(geneSymbol, diseaseName, diseaseSemanticType) %>%
unique()
dim(Cleaned_GDA)
dim(GDA)
numGenes = Cleaned_GDA %>%
group_by(diseaseName) %>%
summarise(numGenes = n()) %>%
ungroup() %>%
group_by(numGenes) %>%
summarise(numDiseases = n())
```
Let's also understand the degree distribution of the diseases.
```{r, fig.cap= "Gene-Disease degree distribution."}
ggplot(numGenes) +
aes(x = numGenes, y = numDiseases) +
geom_point(colour = "#1d3557") +
scale_x_continuous(trans = "log10") +
scale_y_continuous(trans = "log10") +
labs(x = "Genes", y = "Diseases")+
theme_minimal()
```
Because we want to focus in well studied diseases, and also that are known to be complex diseases, let's filter for diseases with at least 10 genes.
```{r}
Cleaned_GDA %<>%
group_by(diseaseName) %>%
mutate(numGenes = n()) %>%
filter(numGenes > 10)
Cleaned_GDA$diseaseName %>%
unique() %>%
length()
```
### Exercises
Now is your turn. Spend some minutes understanding the data and getting some familiarity with it.
1. What are the top 10 genes mostly involved with diseases? What are those diseases?
2. What are the top 10 highly polygenic diseases?
3. What are the top 10 highly polygenic disease classes?
## Drug-Targets {#DTs}
A *druggable* target is a protein, peptide, or nucleic acid that has an activity which can be modulated by a drug. A *drug* can be any small molecular weight chemical compound (SMOL) or a biologic (BIOL), such as an antibody or a recombinant protein that can treat a disease or a symptom.
### Properties of an ideal drug target:
A drug-target has a couple of proprieties that are highly desired when constructing the drug [@Gashaw2011]:
- Target is disease-modifying and/or has a proven function in the pathophysiology of a disease.
- Modulation of the target is less important under physiological conditions or in other diseases.
- If the druggability is not obvious (e.g., as for kinases), a 3D-structure for the target protein or a close homolog should be available for a druggability assessment.
- Target has a favorable 'assayability' enabling high throughput screening.
- Target expression is not uniformly distributed throughout the body.
- A target/disease-specific biomarker exists to monitor therapeutic efficacy.
- Favorable prediction of potential side effects according to phenotype data (e.g., in k.o. mice or genetic mutation databases).
- Target has a favorable IP situation (no competitors on target, freedom to operate).
### Commonly used data sources for GDAs
There are a couple of really good data sets that report drug-target interactions, I list here three good examples:
1. DrugBank [@Wishart2006; @wishart2017]
2. CTD [@davis2020]
3. Broad Institute Drug Repositioning Hub [@corsello2017]
### Understanding a Drug-Target dataset
For this workshop, we will use the drug bank drug-target dataset, and it can be [found here](https://github.com/deisygysi/NetMed_Workshop/blob/master/data/DB_DrugTargets_1201.csv). This dataset is from Drug-Bank, and has been previously parsed for your convenience. The original file is an XML file, and needs to be carefully handled to get information needed.
Similar to the PPI and the GDA, let us understand a little bit of the data set, and what kind of information we have here.
```{r, results='hide'}
DT = fread(file = 'data/DB_DrugTargets_1201.csv')
head(DT)
```
```{r, results='markup', echo=FALSE}
head(DT[,-c(4,10)]) %>% knitr::kable()
```
```{r}
Cleaned_DT = DT %>%
filter(organism == 'Humans') %>%
select(Gene_Target, Name,ID, Type, known_action) %>%
unique()
dim(Cleaned_DT)
dim(DT)
head(Cleaned_DT)
```
```{r}
TargetDist = Cleaned_DT %>%
group_by(Gene_Target) %>%
summarise(numDrugs = n())
DrugDist = Cleaned_DT %>%
group_by(ID) %>%
summarise(numTargets = n())
```
```{r, fig.cap= "Target distribution", warning=FALSE, message=FALSE}
ggplot(TargetDist) +
aes(x = numDrugs) +
geom_histogram(colour = "#1d3557", fill = "#a8dadc" ) +
labs(x = "Targets", y = "Drugs")+
theme_minimal()
```
Which Target is the most targetted gene?
```{r, results='markup'}
TargetDist %>%
arrange(desc(numDrugs)) %>%
filter(numDrugs > 400)
```
```{r, fig.cap= "Drug distribution", warning=FALSE, message=FALSE}
ggplot(DrugDist) +
aes(x = numTargets) +
geom_histogram(colour = "#1d3557", fill = "#a8dadc" ) +
labs(y = "Targets", x = "Drugs")+
theme_minimal()
```
### Exercises
Let us understand a little bit more about the data.
1. What are the top 10 genes mostly targeted by drugs? Are they types are they mostly?
2. What are the top 10 most promiscuous drugs? What are their indication?