02-literature.Rmd

# Data Commonly Used in Network Medicine

In NetMed, we are often interested in understanding *how genes associated to a particular disease can influence each other*, *how two diseases are similar (or different)*, *and how a drug can be used in different set-ups.*

For that, it is necessary to use data sets that are able to represent those associations: **Protein-Protein Interactions** are used as a map of the interactions inside our cells (Session \@ref(PPI)); **Gene-Disease-Associations** are used for us to identify genes that were previously associated to diseases, often using a GWAS approach (Session \@ref(GDA)); and Drug-Target interactions, often measured by identifying physical binding of a therapeutic compound (often a drug) and a protein (Session \@ref(DTs)).

## Protein-Protein Interaction Networks {#PPI}

In PPI networks, the nodes represent proteins, and they are connected by a link if they physically interact with each other [@rual2005]. Typically, these interactions are measured experimentally, for instance, with the Yeast-Two-Hybrid (Y2H) system [@uetz2000], or by protein complex immunoprecipitation followed by high-throughput Mass Spectrometry [@zhang2008; @koh2012], or inferred computationally based on sequence similarity [@fong2004]. PPI can be used to infer gene functions and the association of sub-networks to diseases [@Menche2015]. In this type of network, a highly connected protein tends to interact with proteins that are less connected, probably to prevent unwanted cross-talk of functional modules. As mentioned, most of the methods in network medicine are based on PPI.

### Measuring PPIs

Protein-Protein Interactions can be measured mainly using three different techniques:

1.  By the creation of Protein-Protein interaction maps derived from existing scientific literature;

2.  Using computational predictions of PPIs based on available orthogonal information; and

3.  By systematic experimental mapping of proteins identify complex association and/or binary interactions. We will focus here only on the third.

Co-complex associations interrogate a protein composition of a protein complex in one (or several) cell lines. The most common approach uses affinity purification to extract the proteins that associate with the *bait* proteins, followed by mass spectometry in order to identify proteins that associate with the *bait.* This approach is often used for simple organisms, however, similar approaches have been reported for humans. Unfortunately, achieving stable expression of bait proteins is challenging. Co-complex map associations are composed by indirect and some direct binary associations. However, raw association data cannot distinguish the indirect from the direct association, and therefore, co-complex datasets have to be filtered and need to have incorporated prior knowledge that might lead to bias towards super-start genes. On the other side, for experimental determination of binary interactions between proteins, all possible pairs of proteins are systematically tested to generate a data set of all possible biophysical interactions.

Because the human genome is composed by \~20,000 unique genes - not even considering its isophorms - we would have \~200 million possible combinations in order to robust systematically identify interactions, Yeast-to-Hybrid (Y2H) technology is the only one that can meet this requirement. This technology is able to interrogate hundreds of millions of human protein pairs for binary interactions. In short, the method works as follows: Protein of interest X and a DNA binding domain (DBD-X) fuse to form *bait*. The fusion of transcriptional activation domain (AD-Y) and a cDNA library Y results in *prey*. Those two form the basis of the protein--protein interaction detection system. Without bait--prey interaction, the activation domain is unable to restrict the gene-to-gene expression drive.

### Commonly used data sources for PPIs

PPIs can be found from different sources. I list here some well-known databases for that.

1.  Binary PPIs derived from high-throughput yeast-two hybrid (Y2H) experiments:

-   HI-Union [@Luck2020].

2.  Binary PPIs three-dimensional (3D) protein structures:

-   Interactome3D [@Mosca2013];
-   Instruct [@Meyer2013];
-   Insider [@Meyer2018].

3.  Binary PPIs literature curation:

-   PINA [@Cowley2012];
-   MINT [@Licata2012];
-   LitBM17 [@Luck2020];
-   Interactome3D;
-   Instruct;
-   Insider;
-   BioGrid [@Chatr-Aryamontri2017];
-   HINT [@Das2012];
-   HIPPIE [@Alanis-Lobato2017];
-   APID [@Alonso-Lopez2019a];
-   InWeb [@li2016].

4.  PPIs identified by affinity purification followed by mass spectrometry:

-   BioPlex [@Huttlin2017];
-   QUBIC [@Hein2015];
-   CoFrac [@wan2015];
-   HINT;
-   HIPPIE;
-   APID;
-   LitBM17;
-   InWeb.

5.  Kinase substrate interactions:

-   KinomeNetworkX [@cheng2014];
-   PhosphoSitePlus [@Hornbeck2015].

6.  Signaling interactions:

-   SignaLink [@Fazekas2013];
-   InnateDB [@Breuer2013].

7.  Regulatory interactions:

-   ENCODE consortium.

### Understanding a PPI

For this workshop, we will be using for this workshop is a combination of a manually curated PPI that combines all previous data sets. The data can be [found here](https://github.com/deisygysi/NetMed_Workshop/blob/master/data/PPI_Symbol_Entrez.csv). This PPI was previously published in @Gysi2020a.

Before we can start any analysis using this interactome, let us first understand this data.

The PPI contains the EntrezID and the HGNC symbol of each gene, and some might not have a proper map. Therefore, it should be removed from further analysis. Moreover, we might have loops, and those should also be removed.

Let us begin by preparing our environment and calling all libraries we will need at this point.

```{r prepare_enviroment, warning=FALSE, results='hide', message=FALSE}
require(data.table)
require(tidyr)
require(igraph)
require(dplyr)
require(magrittr)
require(ggplot2)
```

Let's read in our data.

```{r}
PPI = fread("./data/PPI_Symbol_Entrez.csv")
```

```{r, results='hide'}
head(PPI)
```

```{r, results='markup', echo=FALSE}
head(PPI) %>% kbl() %>% kable_styling()
```

Let's transform our edge-list into a network.

```{r}
gPPI = PPI %>% 
  select(starts_with("Symbol")) %>%
  filter(Symbol_A != "") %>%
  filter(Symbol_B != "") %>%
  graph_from_data_frame(., directed = F) %>%
  simplify()

gPPI
```

How many genes do we have? How many interactions?

Next, let's check the degree distribution:

```{r degree-distribution, fig.cap="PPI Degree Distribution."}
dd = degree(gPPI) %>% table() %>% as.data.frame()
names(dd) = c('Degree', "Nodes")
dd$Degree %<>% as.character %>% as.numeric()
dd$Nodes  %<>% as.character %>% as.numeric()

ggplot(dd) +
  aes(x = Degree, y = Nodes) +
  geom_point(colour = "#1d3557") +
  scale_x_continuous(trans = "log10") +
  scale_y_continuous(trans = "log10") +
  theme_minimal()
```

Most of the proteins have few connections, and very few proteins have lots of connections. Who's that protein?

```{r, results='markup'}
degree(gPPI) %>% 
  as.data.frame() %>% 
  arrange(desc(.)) %>%
  filter(. > 1000) 
```

### Exercises

Now is your turn. Spend some minutes understanding the data and getting some familiarity with it.

1.  What are the top 10 genes with the highest degree?

2.  Are those genes connected?

## Gene Disease Association {#GDA}

A Gene-Disease-Association (GDA) database are typically used to understand the association of genes to diseases, and model the underlying mechanisms of complex diseases. Those associations often come from GWAS studies and knock-out studies.

### Commonly used data sources for GDAs

As PPIs, GDAs can be found from different sources and with different evidences for each Gene-Disease association. I list here some well-known databases for that.

-   CTD -- Curated scientific literature [@davis2020]

-   OMIM -- Curated scientific literature [@mckusick2007]

-   DisGeNet -- Based on OMIM, ClinVar, and other data bases [@piñero2019]

-   Orphanet -- Validated - and non-validated - GDAs

-   ClinGen -- Validated - and non-validated - GDAs [@rehm2015]

-   ClinVar -- Different levels of evidence [@landrum2019]

-   GWAS catalogue -- GWAS associations to diseases [@buniello2018]

-   PheGenI -- GWAS associations to diseases [@ramos2013]

-   lncRNADisease -- Experimentally validated lncRNAs in diseases [@chen2012]

-   HMDD -- Experimentally validated miRNAs in diseases [@huang2018]

### Understanding a GDA dataset

We will use in this workshop Gene-Disease-Association from DisGeNet. It can be [found here](https://github.com/deisygysi/NetMed_Workshop/blob/master/data/curated_gene_disease_associations.tsv).

Similar to the PPI, let us first get some familiarity with the data, before performing any analysis.

Let's read in the data and, again, do some basic statistics.

```{r, results='hide'}
GDA = fread(file = 'data/curated_gene_disease_associations.tsv', sep = '\t')

head(GDA)
```

```{r, results='markup', echo=FALSE}
head(GDA) %>% kbl() %>% kable_styling()
```

The first thing to notice is the inconsistency with the disease names, in order to be able to work with it, let's first put every disease to lower-case.

```{r}
Cleaned_GDA = GDA %>% filter(diseaseType == 'disease') %>%
  mutate(diseaseName = tolower(diseaseName)) %>%
  select(geneSymbol, diseaseName, diseaseSemanticType) %>%
  unique() 

dim(Cleaned_GDA)
dim(GDA)

numGenes = Cleaned_GDA %>% 
  group_by(diseaseName) %>%
  summarise(numGenes = n()) %>%
  ungroup() %>%
  group_by(numGenes) %>%
  summarise(numDiseases = n())

```

Let's also understand the degree distribution of the diseases.

```{r, fig.cap= "Gene-Disease degree distribution."}
ggplot(numGenes) +
  aes(x = numGenes, y = numDiseases) +
  geom_point(colour = "#1d3557") +
  scale_x_continuous(trans = "log10") +
  scale_y_continuous(trans = "log10") +
  labs(x = "Genes", y = "Diseases")+
  theme_minimal()

```

Because we want to focus in well studied diseases, and also that are known to be complex diseases, let's filter for diseases with at least 10 genes.

```{r}
Cleaned_GDA %<>% 
  group_by(diseaseName) %>%
  mutate(numGenes = n()) %>%
  filter(numGenes > 10)

Cleaned_GDA$diseaseName %>%
  unique() %>%  
  length()
```

### Exercises

Now is your turn. Spend some minutes understanding the data and getting some familiarity with it.

1.  What are the top 10 genes mostly involved with diseases? What are those diseases?

2.  What are the top 10 highly polygenic diseases?

3.  What are the top 10 highly polygenic disease classes?

## Drug-Targets {#DTs}

A *druggable* target is a protein, peptide, or nucleic acid that has an activity which can be modulated by a drug. A *drug* can be any small molecular weight chemical compound (SMOL) or a biologic (BIOL), such as an antibody or a recombinant protein that can treat a disease or a symptom.

### Properties of an ideal drug target:

A drug-target has a couple of proprieties that are highly desired when constructing the drug [@Gashaw2011]:

-   Target is disease-modifying and/or has a proven function in the pathophysiology of a disease.

-   Modulation of the target is less important under physiological conditions or in other diseases.

-   If the druggability is not obvious (e.g., as for kinases), a 3D-structure for the target protein or a close homolog should be available for a druggability assessment.

-   Target has a favorable 'assayability' enabling high throughput screening.

-   Target expression is not uniformly distributed throughout the body.

-   A target/disease-specific biomarker exists to monitor therapeutic efficacy.

-   Favorable prediction of potential side effects according to phenotype data (e.g., in k.o. mice or genetic mutation databases).

-   Target has a favorable IP situation (no competitors on target, freedom to operate).

### Commonly used data sources for GDAs

There are a couple of really good data sets that report drug-target interactions, I list here three good examples:

1.  DrugBank [@Wishart2006; @wishart2017]

2.  CTD [@davis2020]

3.  Broad Institute Drug Repositioning Hub [@corsello2017]

### Understanding a Drug-Target dataset

For this workshop, we will use the drug bank drug-target dataset, and it can be [found here](https://github.com/deisygysi/NetMed_Workshop/blob/master/data/DB_DrugTargets_1201.csv). This dataset is from Drug-Bank, and has been previously parsed for your convenience. The original file is an XML file, and needs to be carefully handled to get information needed.

Similar to the PPI and the GDA, let us understand a little bit of the data set, and what kind of information we have here.

```{r, results='hide'}
DT = fread(file = 'data/DB_DrugTargets_1201.csv')

head(DT)
```

```{r, results='markup', echo=FALSE}
head(DT[,-c(4,10)]) %>% knitr::kable()
```

```{r}
Cleaned_DT = DT %>% 
  filter(organism == 'Humans') %>%
  select(Gene_Target, Name,ID, Type, known_action) %>%
  unique() 

dim(Cleaned_DT)
dim(DT)
head(Cleaned_DT)
```

```{r}
TargetDist = Cleaned_DT %>% 
  group_by(Gene_Target) %>%
  summarise(numDrugs = n()) 

DrugDist = Cleaned_DT %>% 
  group_by(ID) %>%
  summarise(numTargets = n()) 
```

```{r, fig.cap= "Target distribution", warning=FALSE, message=FALSE}
ggplot(TargetDist) +
  aes(x = numDrugs) +
  geom_histogram(colour = "#1d3557", fill = "#a8dadc" ) +
  labs(x = "Targets", y = "Drugs")+
  theme_minimal()
```

Which Target is the most targetted gene?

```{r, results='markup'}
TargetDist %>%
  arrange(desc(numDrugs)) %>%
  filter(numDrugs > 400)
```

```{r, fig.cap= "Drug distribution", warning=FALSE, message=FALSE}
ggplot(DrugDist) +
  aes(x = numTargets) +
  geom_histogram(colour = "#1d3557", fill = "#a8dadc" ) +
  labs(y = "Targets", x = "Drugs")+
  theme_minimal()
```

### Exercises

Let us understand a little bit more about the data.

1.  What are the top 10 genes mostly targeted by drugs? Are they types are they mostly?

2.  What are the top 10 most promiscuous drugs? What are their indication?