Inferred_GRNs_analysis.Rmd

---
title: "Gene-specific tuning of integrative GRN inference : analysing inferred GRNs (MSE, precision and recall)"
output: 
  html_document:
    df_print: paged
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: hide
---


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = F, message = F, fig.width = 10)

source('inference_functions/weightedRF.R')
source('inference_functions/weightedLASSO.R')
source('inference_functions/evaluateNetwork.R')
source('inference_functions/data_integration_optimization.R')

library(ggplot2)
library(tidyverse)
library(ggpubr)
library(patchwork)
library(ggrepel)
library(ggVennDiagram)
library(ComplexHeatmap)
# library(clusterProfiler)
library(circlize)

theme_set(theme_bw())
```

This document loads the results generated by `Gene_specific_optimisation_of_TFBM_integration.Rmd` and studies the inferred GRNs. In particular, it displays the behavior of effective data integration, MSE, TFBM support as alpha increases, and also displays precision and recall analyses of inferred GRNs against DAP-Seq interactions. It can be used to reproduce the main figures of the paper.

# Data import

Import of the expression data and binding motifs data for the nitrate-responsive genes and regulators :

```{r}
load('rdata/inference_input_N_response_varala.rdata')
genes <- input_data$grouped_genes; length(genes)
tfs <- input_data$grouped_regressors; length(tfs)
counts <- input_data$counts; dim(counts)
load("rdata/pwm_occurrences_N_response_varala.rdata")
dim(pwm_occurrence)

ALPHAS <- seq(0,1, by = 0.1)
```


# Showing GRN results for a global optimisation of alpha

From the stored results, we now provide some code to examine the results.

## Plotting EDI behaviour for all genes

How does the effective data integration (EDI) vary when alpha increases?

### In weightedRF

```{r, fig.height=20}

col_fun = colorRamp2(c(1, 201), hcl_palette = "Blue-Red 3")

# loads the mats object (importance of TF-target interactions)
load("results/rdata/weightedRF_importances_100rep.rdata")


# getting mean EDI values for all genes as a function of alpha
edi_rf <- mcsapply(genes, draw_gene_effective_integration, mats = mats,
                          return=T, mc.cores=40)

edi <- t(as.data.frame(edi_rf[3,]))
colnames(edi) = ALPHAS
edi <- na.omit(edi)
ha = HeatmapAnnotation(
    alpha = anno_simple(colnames(edi)),
    annotation_name_side = "left")

rf <- Heatmap(edi, col = col_fun,
        show_column_names = T,
         width = ncol(edi)*unit(10, "mm"), 
      height = nrow(edi)*unit(0.2, "mm"),
      name = "EDI",
        cluster_columns = F, show_row_names = F)

png("results/supp_figures/EDI_all_genes_weightedRF.png", 
    res = 300, height = 1700+2200, width = 2000)
rf
dev.off()
```

### In weightedLASSO

```{r, fig.height=20}
# loads the mats object (importance of TF-target interactions)
# load("results/rdata/weightedLASSO_importances_50rep.rdata")
load("results/rdata/weightedEN_importances_50rep.rdata")

# getting mean EDI values for all genes as a function of alpha
edi_lasso <- mcsapply(genes, draw_gene_effective_integration, mats = mats,
                          return=T, mc.cores=40)

edi <- t(as.data.frame(edi_lasso[3,]))
colnames(edi) = ALPHAS
edi <- na.omit(edi)
lasso <- Heatmap(edi, 
        col = col_fun, show_column_names = T,
         width = ncol(edi)*unit(10, "mm"), 
      height = nrow(edi)*unit(0.2, "mm"),
      name = "EDI",
        cluster_columns = F, show_row_names = F)
png("results/supp_figures/EDI_all_genes_weightedEN.png", 
    res = 300, height = 1700+2200, width = 2000)
lasso
dev.off()
```


## TFBM support of globally optimized networks :

```{r}
# loads lasso edges
load(file = "results/rdata/weightedLASSO_edges_50rep.rdata")
edges_num <- lapply(edges, function(df) df[sapply(df, is.numeric)])
d <- data.frame(settings = names(unlist(lapply(edges_num, FUN = nrow))),
                pwm = unlist(lapply(edges_num, FUN = colMeans)))
d[c("model", "alpha", "dataset", "rep", "density")] <- str_split_fixed(d$settings, '_', 5)
d_lasso <- d


# loads rf edges
load(file = "results/rdata/weightedRF_edges_100rep.rdata")
edges_num <- lapply(edges, function(df) df[sapply(df, is.numeric)])
d <- data.frame(settings = names(unlist(lapply(edges_num, FUN = nrow))),
                pwm = unlist(lapply(edges_num, FUN = colMeans)))
d[c("model", "alpha", "dataset", "rep", "density")] <- str_split_fixed(d$settings, '_', 5)
d <- rbind.data.frame(d, d_lasso)

# plots the TFBM support for both models :
pwm_support <- d %>%
  mutate(alpha = as.numeric(alpha),
         model = str_replace(model, "bRF", "weightedRF"),
         model = str_replace(model, "LASSO", "weightedLASSO"),
         density = paste("D =", density)) %>%
  ggplot(aes(color = dataset, x = alpha, y = pwm)) +
  geom_point() + 
  geom_smooth() +
  theme_bw()+
  ggh4x::facet_nested_wrap(vars(model, density), ncol =3, nest_line = T) + 
  theme(strip.background = element_blank(), axis.title.x = element_text(size = 22),
        title = element_text(size = 16), strip.text = element_text(size = 16), 
        legend.text = element_text(size = 15), axis.text = element_text(size = 12)) +
  xlab(expression(alpha)) + ylab("Mean of TFBM scores") + 
  scale_color_manual(values = setNames(c("grey", "#70AD47"), c("shuffled", "trueData")))+
  ggtitle("Mean of TFBM scores in inferred GRN edges") ; pwm_support
ggsave(pwm_support, file = "results/supp_figures/TFBM_support_en.pdf", width = 11, height = 8)
```

Wo do reach a TFBM support of 1 for $\alpha = 1$, which is the desired behavior.


## Distributions of optimal alphas

How is distributed the optimal value of TFBM integration across all nitrate responsive genes?

```{r}
# optimal values of alpha :
load("results/rdata/gene_specific/alpha_per_gene_weightedLASSO_student.rdata")
load("results/rdata/gene_specific/alpha_per_gene_weightedRF_student.rdata")

hists <- data.frame(weightedLASSO = stud_lasso, 
                    weightedRF = stud_rf) %>%
  gather(key = "model", value = "alpha") %>%
ggplot(aes(x = alpha, fill = model)) +
  geom_histogram()+
  facet_wrap(~model, ncol = 2)+
  theme(axis.title.y = element_blank(), legend.position = "none")+
  scale_fill_manual( values = c("#C55A11","#E67F87"))+theme_bw()+
  theme(strip.background = element_blank(),
        legend.position = "none", strip.text = element_text(size = 12),
        axis.title = element_text(size = 12))+
  xlab(expression(alpha))+ ylab("Number of genes")
ggexport(hists, filename  = "results/main_figures/alpha_histograms.pdf", height = 2.5, width = 5.5)
```


# Plotting MSE behaviour for all genes

How does the MSE change with $\alpha$? Depending on their class (optimal alpha different from 0 or not).

## weightedRF

```{r, fig.height=10}
load("results/rdata/weightedRF_mse_100rep.rdata")
load(file = "results/rdata/gene_specific/alpha_per_gene_weightedRF_student.rdata")
alphas_rf <-stud_rf

pos_class <- names(alphas_rf[alphas_rf!=0])
mse <- lmses[str_detect(colnames(lmses), "trueData")]
for(alpha in seq(0,1, by = 0.1)){
  mse[,paste("alpha",alpha)] <- rowMeans(mse[,str_detect(colnames(mse), paste0("RF_",as.character(alpha), "_"))])
}
mse <- as.matrix(mse[str_detect(colnames(mse), "alpha")])
mse_interest <- mse[pos_class,]
mse_other <- mse[setdiff(rownames(mse), pos_class),]

col_fun = colorRamp2(c(-2, 0, 2), hcl_palette = "Blue-Red 3")

ha = HeatmapAnnotation(
    alpha = anno_simple(as.numeric(str_remove(colnames(mse), "alpha "))),
    annotation_name_side = "left")

png("results/main_figures/rf_mse_interest.png", res = 300, height = 2000, width = 2000)
Heatmap((mse_interest-rowMeans(mse_interest))/matrixStats::rowSds(mse_interest), 
        col = col_fun, show_column_names = F,name = "MSE",
         width = ncol(mse_interest)*unit(10, "mm"), 
      height = nrow(mse_interest)*unit(0.2, "mm"),
        cluster_columns = F, show_row_names = F)+ 
  rowAnnotation(class = ifelse(alphas_rf[rownames(mse_interest)]>0, 
                               "TFBM integration", "no TFBM integration"), 
                col = list(class = setNames(c("#70AD47", "grey"), 
                                            nm = c("TFBM integration", "no TFBM integration"))))
dev.off()
png("results/main_figures/rf_mse_others.png", res = 300, height = 2000, width = 2000)
Heatmap((mse_other-rowMeans(mse_other))/matrixStats::rowSds(mse_other),
        col = col_fun,show_column_names = F,
        width = ncol(mse_other)*unit(10, "mm"), 
      height = nrow(mse_other)*unit(0.2, "mm"),
        cluster_columns = F, show_row_names = F)+ 
  rowAnnotation(class = ifelse(alphas_rf[rownames(mse_other)]>0, 
                               "TFBM integration", "no TFBM integration"), 
                col = list(class = setNames(c("#70AD47", "grey"), 
                                            nm = c("TFBM integration", "no TFBM integration"))))
dev.off()
```

## weightedLASSO

```{r, fig.height=10}

load("results/rdata/weightedLASSO_mse_50rep.rdata")
load("results/rdata/gene_specific/alpha_per_gene_weightedLASSO_true_sd.rdata")
alphas_lasso <- stud_lasso
pos_class <- names(alphas_lasso[alphas_lasso!=0])

mse <- lmses[str_detect(colnames(lmses), "trueData")]
for(alpha in seq(0,1, by = 0.1)){
  mse[,paste("alpha",alpha)] <- rowMeans(mse[,str_detect(colnames(mse), paste0("LASSO_",as.character(alpha), "_"))])
}
mse <- as.matrix(mse[str_detect(colnames(mse), "alpha")])
mse_interest <- mse[pos_class,]
mse_other <- mse[setdiff(rownames(mse), pos_class),]

library(circlize)
col_fun = colorRamp2(c(-2, 0, 2), hcl_palette = "Blue-Red 3")

ha = HeatmapAnnotation(
    alpha = anno_simple(as.numeric(str_remove(colnames(mse), "alpha "))),
    annotation_name_side = "left")

png("results/main_figures/lasso_mse_interest.png", res = 300, height = 2000, width = 2000)
Heatmap((mse_interest-rowMeans(mse_interest))/matrixStats::rowSds(mse_interest), 
        col = col_fun, show_column_names = F,
         width = ncol(mse_interest)*unit(10, "mm"), 
      height = nrow(mse_interest)*unit(0.2, "mm"),
        cluster_columns = F, show_row_names = F)+ 
  rowAnnotation(class = ifelse(alphas_lasso[rownames(mse_interest)]>0, 
                               "TFBM integration", "no TFBM integration"), 
                col = list(class = setNames(c("#70AD47", "grey"), 
                                            nm = c("TFBM integration", "no TFBM integration"))))
dev.off()
png("results/main_figures/lasso_mse_others.png", res = 300, height = 2000, width = 2000)
Heatmap((mse_other-rowMeans(mse_other))/matrixStats::rowSds(mse_other),
        col = col_fun,show_column_names = F,
        width = ncol(mse_other)*unit(10, "mm"), 
      height = nrow(mse_other)*unit(0.2, "mm"),
        cluster_columns = F, show_row_names = F)+ 
  rowAnnotation(class = ifelse(alphas_lasso[rownames(mse_other)]>0, 
                               "TFBM integration", "no TFBM integration"), 
                col = list(class = setNames(c("#70AD47", "grey"), 
                                            nm = c("TFBM integration", "no TFBM integration"))))
dev.off()
```

# Properties of inferred GRNs

## Precision and recall of global optimisation of alpha

The following code validates globally inferred GRNs, with a given density, against an experimental gold standard (here, DAP-Seq interactions).


```{r, fig.width=10, fig.height=10}

load(file = "results/rdata/weightedRF_validation_100rep.rdata")
val_rf <- val_dap
load(file = "results/rdata/weightedLASSO_validation_50rep.rdata")
val_lasso <- val_dap

precision_lasso <- draw_validation(validation = val_lasso)+
  plot_annotation(title = "weightedEN") & 
  theme(plot.title = element_text(size = 20, hjust = 0.5) )

# ggexport(precision_lasso, filename = "results/supp_figures/precision_recall_weightedLASSO.pdf", 
#          width = 10, height = 10)

precision_rf <- draw_validation(validation = val_rf)+
  plot_annotation(title = "weightedRF") & 
  theme(plot.title = element_text(size = 20, hjust = 0.5))

# 
# ggexport(precision_rf, filename = "results/supp_figures/precision_recall_weightedRF.pdf", 
#          width = 10, height = 10)
```


## Precision and recall of gene-specific optimisation of alpha


```{r, warning=TRUE}
settings <- c("model", "dataset", "rep", "density")

load("results/rdata/gene_specific/gene_specific_mse_student.rdata")
load("results/rdata/gene_specific/gene_specific_validation_student.rdata")

val_spec <- val_specific %>%
  separate(network_name, into = settings, sep = "_") %>%
    filter(density == 0.005) %>%
  mutate(density = paste("D =", density),
           model = str_replace(model, "RF", "weightedRF"),
           model = str_replace(model, "LASSO", "weightedLASSO"), 
         alpha_type = "gene-specific")

data_val <- rbind.data.frame(val_rf, val_lasso) %>%
    filter(density %in% c(0.005))%>%
    group_by(model, alpha, dataset, density) %>%
    mutate(mean_precision = mean(precision, na.rm = T),
           sd_precision = sd(precision, na.rm = T),
           mean_recall = mean(recall, na.rm = T),
           sd_recall = sd(recall, na.rm = T),
           density = paste("D =", density),
           model = str_replace(model, "bRF", "weightedRF"),
           model = str_replace(model, "LASSO", "weightedLASSO")) %>%
  dplyr::select(-network_name) %>%
  mutate(alpha = as.numeric(alpha), alpha_type = "global")

```

## MSE, precision and recall of all networks


```{r, fig.width=12, fig.height=10}
settings <- c("model", "dataset", "rep", "density")


plot_MSE_gene_specific <- function(model_){
  lmses %>%
rownames_to_column("gene") %>%
reshape2::melt()%>%
  separate(variable,
           into = c("model", "dataset",  "rep"),
           sep = "_") %>%
filter(model == model_ & dataset == "trueData") %>%
group_by(dataset , rep) %>%
summarise(median_MSE = median(value, na.rm=T))%>%
  ggplot(aes(x=dataset, y = median_MSE)) + 
geom_boxplot(size = 1, alpha = 0.5, outlier.alpha = 0, 
             color = "#4670CD", fill = "#4670CD") +
theme_pubr() + geom_jitter(width = 0.2, size = 2,  color = "#4670CD")+
theme(
  strip.background = element_blank(),
  axis.line = element_blank(), 
  axis.title=element_blank(),
  axis.text=element_blank(),
  axis.ticks=element_blank(), 
  title = element_text(size = 12),
  strip.text = element_text(size = 12),
  legend.text = element_text(size = 15),
  legend.position = 'none'
)+ xlab("")+ ylab("Median MSE")
}
# gene specific MSE
mse_lasso_spec <- plot_MSE_gene_specific("LASSO")
mse_rf_spec <- plot_MSE_gene_specific("RF")

# MSE for global alphas
draw_mse <- function(model){
data <- lmses[as.numeric(str_split_fixed(colnames(lmses), '_', 4)[,4]) <=10] %>%
rownames_to_column("gene") %>%
reshape2::melt()%>%
  separate(variable,
           into = c("model", "alpha", "dataset",  "rep"),
           sep = "_") %>%
filter(model == model & dataset == "trueData") %>%
group_by(dataset, rep, alpha) %>%
summarise(median_MSE = median(value, na.rm=T))%>%
  group_by(alpha, dataset) %>%
  mutate(mean_median_mse = mean(median_MSE),
         sd_median_mse = sd(median_MSE)) %>%
  mutate(alpha = as.numeric(alpha))
data %>%
  ggplot(aes(x=alpha, y = median_MSE)) + 
  geom_line(aes(y=mean_median_mse, group = dataset), color = "#70AD47") +
  geom_ribbon(aes(ymin = mean_median_mse - sd_median_mse , 
                    ymax = mean_median_mse + sd_median_mse), 
                alpha = .4, color = "#70AD47", fill = "#70AD47")+ 
  xlab(expression(alpha)) +
theme_pubr() + geom_point(width = 0.2, size = 2, color = "#70AD47")+
theme(
  strip.background = element_blank(),
  axis.title.x = element_text(size = 17),
  title = element_text(size = 12),
  strip.text = element_text(size = 12),
  legend.text = element_text(size = 15),
  legend.position = 'top'
)+ ylab("Median MSE")
}

load("results/rdata/weightedLASSO_mse_50rep.rdata")
mse_lasso <- draw_mse("LASSO")

load("results/rdata/weightedRF_mse_100rep.rdata")
mse_rf <- draw_mse("RF")
x_min <- 0.18


plot <- (mse_lasso + ylim(c(x_min,0.26)) +labs(title = "WeightedLASSO")+ 
           theme(plot.title = element_text(hjust = 1))+
  mse_lasso_spec+ ylim(c(x_min,0.26)) +ylab("") +
  mse_rf +  ylim(c(x_min,0.26)) +labs(title = "WeightedRF")+
    theme(plot.title = element_text(hjust = 0.75))+ylab("") +
  mse_rf_spec+ ylim(c(x_min,0.26)) + ylab(""))+
  plot_layout(guides = "collect", ncol = 4, widths = c(3,1,3,1)) & 
  theme(legend.position = 'bottom', legend.text = element_text(size = 15))


pr_curves <- data_val %>%
  filter(dataset == "trueData") %>%
  ggplot(aes(x=recall, y=precision, 
             label = alpha, fill = alpha_type, color = alpha_type)) +
  geom_point(size = 0.65)+
  geom_ribbon(aes(ymin = mean_precision - sd_precision , 
                    ymax = mean_precision + sd_precision, x=mean_recall  ), alpha = 0.4)+
  theme_pubr()+ ggh4x::facet_nested_wrap(vars(model), nest_line = T)+
  geom_line(aes(y=mean_precision, x=mean_recall), size=2)+
  geom_label(aes(y=mean_precision, x=mean_recall), nudge_y = 0.02, fill = "white", show.legend = F)+
  theme(strip.background = element_blank(), 
        legend.text = element_text(size = 15),
        legend.title = element_text(size = 18))+
  geom_hline(color = "#C55A11", linetype='dashed',
             yintercept = 0.331042, size= 1.5, show.legend = T) +
  geom_point(aes(x=recall, y=precision), data = val_spec[val_spec$dataset=="trueData",])+
  scale_color_manual(name = "Data integration",
                     values = setNames(c( "#4670CD", "#70AD47"), 
                                       c("gene-specific", "global")))+
  scale_fill_manual(name = "Data integration",
                     values = setNames(c( "#4670CD", "#70AD47"), 
                                       c("gene-specific", "global")))+
  xlab("Recall") + ylab("Precision")


plot_final <- plot/ pr_curves+theme(legend.position = "bottom",
                      axis.title = element_text(size= 15),
                      strip.text = element_text(size = 18))+
  plot_layout(heights = c(1,1))

plot_final
ggexport(plot_final, filename = "results/specific_grns_student.pdf",
         width = 11, height = 10)

```

In summary, gene-specific optimization of alpha allows to minimize the MSE, while still achieving near-optimal precision and a good recall in both models.

With the addition of methods comparisons:

```{r, fig.width=8.5, fig.height=10}

# first line: mse of full models
mse_full <- (mse_lasso + ylim(c(0.18,0.26)) + ggtitle("weightedLASSO")+
  mse_lasso_spec+ ylim(c(0.18,0.26)) +ylab("") +xlab("")+
    ggtitle("DIOgene\nweighted\nLASSO")+
  mse_rf +  ylim(c(0.18,0.26)) +ggtitle("weightedRF")+
   ylab("") +
  mse_rf_spec+ ylim(c(0.18,0.26)) + ylab(""))+xlab("")+
    ggtitle("DIOgene\nweighted\nRF")+
  plot_layout(guides = "collect", ncol = 4, widths = c(3,1,3,1)) 

# second lines: mse with comparable sparsity levels
load("results/comparisons_to_existing_methods/restricted_mse_top3_results_linear.rdata")


mse_top_3_without_weighteds <-(data_for_restricted_mse_top_3[
  data_for_restricted_mse_top_3$case=="Linear case",] %>%
               filter(model != "weightedEN") %>%
                 filter(model != "weightedLASSO") %>%
                 mutate(alpha = str_replace(alpha, "Gene-specific", "DIOgene\nweighted\nLASSO")) %>%
ggplot(aes(y=mse_to_plot, x =model, fill = model)) +
  geom_segment(aes(xend = model, y = 0.18, yend = mse_to_plot, color = model),size = 2.5) + 
  ylab("Restricted median MSE")+  ylim(c(0.18,0.3))+
  geom_point(color = "black", aes(y=mse), show.legend = F)+
  geom_label(aes(label = round(mse, 2), y=0.275), show.legend = F, color = "white")+
  ggh4x::facet_nested_wrap(vars(ifelse(!str_detect(alpha, "DIOgene"), paste("alpha =", alpha), alpha)), 
                           nest_line = T, scales = "free_x", ncol = 8)+
  scale_color_manual(name = "Linear\nmethods", values = c( "#4670CD",  
                                                  "#668877", "#70AD47" ))+
  scale_fill_manual(name = "Linear\nmethods", values = c( "#4670CD",  
                                                  "#668877", "#70AD47" )))+
(data_for_restricted_mse_top_3[data_for_restricted_mse_top_3$case=="Non-linear Case",] %>%
   filter(model != "weightedRF") %>%
   mutate(alpha = str_replace(alpha, "Gene-specific", "DIOgene\nweighted\nRF")) %>%
  ggplot(aes(y=mse_to_plot, x =model, fill = model)) +
  geom_segment(aes(xend = model, y = 0.18, yend = mse_to_plot, color = model),size = 2.5) + 
  ylab("Median MSE")+  ylim(c(0.18,0.3))+
  geom_point(color = "black", aes(y=mse_to_plot), show.legend = F)+
    geom_label(aes(label = round(mse, 2), y=0.275), show.legend = F, color = "white")+
  ggh4x::facet_nested_wrap(vars(ifelse(!str_detect(alpha, "DIOgene"), paste("alpha =", alpha), alpha)), 
                           nest_line = T, scales = "free_x", ncol = 8)+
  scale_color_manual(name = "Non linear\nmethods", values = c( "#4670CD", "#668877", "#70AD47" ))+
  scale_fill_manual(name = "Non linear\nmethods", values = c( "#4670CD", "#668877","#70AD47" ))) &
  theme_pubr() + theme(strip.background = element_blank(), axis.text.x = element_blank(), 
                     axis.ticks.x= element_blank(), axis.title.x=element_blank(),
                     legend.position = "top",text = element_text(size = 15),
        title = element_text(size = 13),
        axis.title = element_text(size = 15),
        axis.text = element_text(size = 15));mse_top_3_without_weighteds


# compute precision and recall of competing methods
load(file = "results/mLASSO_stars_weightedLASSO_comparison_reps.rdata")
linear_pr <- results
load("results/irafnet_weightedRF_comparison_reps.rdata")
nonlinear_pr <- results

linear_pr$model = "weightedLASSO"
nonlinear_pr$model = "weightedRF"

comparisons_pr <- rbind.data.frame(linear_pr, nonlinear_pr) %>%
  filter(network_name == 0.005) %>%
  filter(!str_detect(method, "weighted")) %>%
  group_by(method, alpha) %>%
  summarise(precision = mean(precision), recall = mean(recall), 
            model = unique(model))
 

 diogene_mean_data <- val_spec[val_spec$dataset=="trueData",] %>%
   group_by(model) %>%
   summarise(precision = mean(precision), recall = mean(recall)) %>%
   mutate(alpha = "DIOgene") 

 
label_sizes <- 4
pr_comps <- data_val %>%
  filter(dataset == "trueData") %>%
  ggplot(aes(x=recall, y=precision, 
             label = alpha, color = alpha_type)) +
  geom_point(size = 0.65)+
  geom_ribbon(aes(fill = alpha_type, ymin = mean_precision - sd_precision , 
                    ymax = mean_precision + sd_precision, x=mean_recall  ), alpha = 0.4)+
  theme_pubr()+ ggh4x::facet_nested_wrap(vars(model), nest_line = T)+
  geom_line(aes(y=mean_precision, x=mean_recall), size=2)+
  geom_label(size = label_sizes,
             aes(y=mean_precision, x=mean_recall), nudge_y = 0.02, fill = "white", show.legend = F)+
  theme(strip.background = element_blank(), 
        legend.text = element_text(size = 15),
        legend.title = element_text(size = 18))+
  geom_hline(color = "#C55A11", linetype='dashed',
             yintercept = 0.331042, size= 1.5, show.legend = T) +
  scale_color_manual(name = "Data integration",
                     values = setNames(c( "#4670CD", "#70AD47"), 
                                       c("gene-specific", "global")))+
  scale_fill_manual(name = "Data integration",
                     values = setNames(c( "#4670CD", "#70AD47"), 
                                       c("gene-specific", "global")))+
  xlab("Recall") + ylab("Precision") +
  geom_point(data = comparisons_pr, color = "#668877" )+
  geom_label_repel(size = label_sizes, 
                   nudge_y = -0.035, force = 2, data = comparisons_pr, 
                   color = "#668877", aes(label = paste(method, alpha)))+
  geom_point(aes(x=recall, y=precision), 
             data = val_spec[val_spec$dataset=="trueData",] %>%
               mutate(alpha = ""))+
  geom_label_repel(size = label_sizes, 
                   nudge_y = 0.01, data =diogene_mean_data,
                   color = "#4670CD", aes(label = paste(alpha)))+
  theme(legend.position = "none", strip.text = element_blank())
 
 
```

```{r, fig.width=8.5, fig.height=10}

main_figure <- (mse_full & theme(plot.title = element_text(hjust = 0.5),
                                 axis.title.y = element_text(size = 15)))/
  mse_top_3_without_weighteds /
  (pr_comps +theme(text = element_text(size = 15),
        title = element_text(size = 13),
        axis.title = element_text(size = 15),
        axis.text = element_text(size = 15))) +
  plot_layout(heights = c(0.7,0.6,1)) 
  
ggexport(main_figure, filename = "results/main_figures/Figure_3_mse_precision_recall.pdf", 
         width = 9.5, height = 11)
```


# Comparison of our approach with the simplest minimal MSE criterion

What is the difference between our approach (maximal divergence in MSE as compared to a synthetic baseline) and directly minimizing the MSE?


Differences between the two criteria for the same model : 

```{r}

load("results/rdata/gene_specific/alpha_per_gene_weightedLASSO_min.rdata")
load("results/rdata/gene_specific/alpha_per_gene_weightedRF_min.rdata")

alphas_lasso_min <- alphas_lasso
alphas_rf_min <- alphas_rf

load("results/rdata/gene_specific/alpha_per_gene_weightedLASSO_student.rdata")
load("results/rdata/gene_specific/alpha_per_gene_weightedRF_student.rdata")


alphas_lasso_div <- stud_lasso
alphas_rf_div <- stud_rf

saved_lasso <- setdiff(names(alphas_lasso_div[alphas_lasso_div > 0]), 
                             names(alphas_lasso_min[alphas_lasso_min > 0]))

saved_rf <- setdiff(names(alphas_rf_div[alphas_rf_div > 0]), 
                             names(alphas_rf_min[alphas_rf_min > 0]))

same_rf <- intersect(names(alphas_rf_div[alphas_rf_div > 0]), 
                             names(alphas_rf_min[alphas_rf_min > 0]))

lost_lasso <- setdiff(names(alphas_lasso_min[alphas_lasso_min > 0]),
                            names(alphas_lasso_div[alphas_lasso_div > 0]))

lost_rf <- setdiff(names(alphas_rf_min[alphas_rf_min > 0]),
                         names(alphas_rf_div[alphas_rf_div > 0]))

same_lasso <- intersect(names(alphas_lasso_min[alphas_lasso_min > 0]),
                        names(alphas_lasso_div[alphas_lasso_div > 0]))

matrix(c(length(same_lasso), length(lost_lasso), length(saved_lasso), 
         length(genes) - length(same_lasso) - length(lost_lasso) - length(saved_lasso)), 
       nrow = 2, byrow = F, dimnames = list(c("data integration div", "no data integration div"),
                                            c("data integration min", "no data integration min")))

```

For the RFs :

```{r}
matrix(c(length(same_rf), length(lost_rf), length(saved_rf), 
         length(genes) - length(same_rf) - length(lost_rf) - length(saved_rf)), 
       nrow = 2, byrow = F, dimnames = list(c("data integration div", "no data integration div"),
                                            c("data integration min", "no data integration min")))
```


## MSE, precision and recall analysis

We show only a small MSE increase of our approach, but precision and recall are both increased, even though TFBM integration is performed for less genes.


```{r, fig.width=10}
load("results/rdata/gene_specific/gene_specific_validation_min.rdata")
val_specific_min <- val_specific %>%
  mutate(criterion = "Minimal MSE")
load("results/rdata/gene_specific/gene_specific_validation_student.rdata")

val_specific_dev <- val_specific%>%
  mutate(criterion = "Proposed approach")

precision_recall_min_dev <- rbind.data.frame(val_specific_dev, val_specific_min) %>%
  filter(dataset == "trueData" & density < 0.05) %>%
  mutate(model = paste0("weighted", model),
         density = paste0("D = ", density)) %>%
  ggplot(aes(y = precision, x = recall, color = criterion)) + 
  geom_jitter()+
  facet_nested_wrap(vars(density, model), ncol = 2, scales = "free") + theme_bw()+
  theme(strip.background = element_blank()) +
  stat_ellipse() + ggtitle("Precision and recall")
```


```{r, fig.width = 10, fig.height=5}
load("results/rdata/gene_specific/gene_specific_mse_min.rdata")
mse_min <- lmses
colnames(mse_min) <- paste0(colnames(mse_min), '_min')
mse_min$genes <- rownames(mse_min)

load("results/rdata/gene_specific/gene_specific_mse_student.rdata")

mse_dev <- lmses
colnames(mse_dev) <- paste0(colnames(mse_dev), '_dev')
mse_dev$genes <- rownames(mse_dev)


mse_plot <- full_join(mse_dev, mse_min, by = c("genes"))  %>%
reshape2::melt()%>%
  separate(variable,
           into = c("model", "dataset",  "rep", "criterion"),
           sep = "_") %>%
filter(dataset == "trueData") %>%
group_by(rep, model, criterion) %>%
summarise(median_MSE = median(value, na.rm=T))%>%
  mutate(criterion = str_replace(criterion, "min", "Minimal MSE"),
         criterion = str_replace(criterion, "dev", "Proposed approach"),
         model = paste0("weighted", model)) %>%
  ggplot(aes(x=criterion, y = median_MSE, color = criterion, fill = criterion)) +
  geom_jitter(show.legend = F)+ ylab("Median MSE") + xlab("") +
  facet_nested_wrap(vars(model), nest_line = T) +
  theme_bw() + theme(strip.background = element_blank(), 
                     legend.position = "none", axis.text.x = element_blank()) +
  ggtitle("Median MSE")+stat_compare_means()


```


Comparing the importance values of Gold standard interactions between common target interactions, and interactions of target specific to each approach.


```{r, fig.width=10, fig.height=8}

load("rdata/connectf_N_responsive_genes.rdata")
validated_edges <- validated_edges %>%
  filter(type == "DAPSeq")

load("results/rdata/gene_specific/gene_specific_grns_student.rdata")

mats_dev <- mats[!str_detect(names(mats), "shuffled")]

load("results/rdata/gene_specific/gene_specific_grns_min.rdata")
mats_min <- mats[!str_detect(names(mats), "shuffled")]

mean_mats_min_lasso <- apply(simplify2array(mats_min[str_detect(names(mats_min), "LASSO")]), 1:2, mean)[tfs,genes]
mean_mats_min_rf <- apply(simplify2array(mats_min[str_detect(names(mats_min), "RF")]), 1:2, mean)[tfs,genes]

mean_mats_dev_lasso <- apply(simplify2array(mats_dev[str_detect(names(mats_dev), "LASSO")]), 1:2, mean)[tfs,genes]
mean_mats_dev_rf <- apply(simplify2array(mats_dev[str_detect(names(mats_dev), "RF")]), 1:2, mean)[tfs,genes]


```


```{r}
saved_curve_dev <- evaluate_fully_connected(mean_mats_dev_rf[,saved_rf], 
                                            validation = "DAPSeq", nCores = 20, 
                                            input_tfs = tfs,
                                            input_genes = c(saved_rf))%>%
  mutate(targets = "Proposed approach\nSpecific TFBM\nintegration")

lost_curve_min <- evaluate_fully_connected(mean_mats_min_rf[,lost_rf], 
                                            validation = "DAPSeq", nCores = 20, 
                                           input_tfs = tfs,
                                            input_genes = c(lost_rf)) %>%
  mutate(targets = "Minimal MSE\nSpecific TFBM\nintegration")


 lost_saved_rf <- rbind.data.frame(saved_curve_dev, lost_curve_min) %>%
  ggplot(aes(x=recall, y = precision, color = targets)) + 
  geom_line() + geom_point()+ggtitle("weightedRF")+ylim(c(0.2, 0.6))

```

```{r}
saved_curve_dev <- evaluate_fully_connected(mean_mats_dev_lasso[,saved_lasso], 
                                            validation = "DAPSeq", nCores = 20, 
                                            input_tfs = tfs,
                                            input_genes = c(saved_lasso))%>%
  mutate(targets = "Proposed approach\nSpecific TFBM\nintegration")

lost_curve_min <- evaluate_fully_connected(mean_mats_min_lasso[,lost_lasso], 
                                            validation = "DAPSeq", nCores = 20, 
                                           input_tfs = tfs,
                                            input_genes = c(lost_lasso)) %>%
  mutate(targets = "Minimal MSE\nSpecific TFBM\nintegration")


lost_saved_lasso <- rbind.data.frame(saved_curve_dev, lost_curve_min) %>%
  ggplot(aes(x=recall, y = precision, color = targets)) + 
  geom_line() + geom_point()+
  ggtitle("weightedLASSO")+ylim(c(0.2, 0.6))


pr_curves <- mse_plot + theme(legend.position = "none")+ 
  mse_plot + theme(legend.position = "none")+ 
  lost_saved_lasso + theme(legend.position = "none")+ 
  lost_saved_rf + theme(legend.position = "none")+
  plot_annotation(tag_levels = "a") & 
  theme(legend.text = element_text(size = 12))

ggexport(pr_curves, filename = "results/supp_figures/min_mse_comparison_curves.pdf", 
         width = 9, height = 9)
```


## Additional AUC/AUPR analyses


# Computing precision and recall on full models

Required matrices are loaded:

```{r}
load("results/rdata/weightedEN_importances_50rep.rdata")
mats_en <- mats

load("results/rdata/weightedLASSO_importances_50rep.rdata")
mats_lasso <- mats

load("results/rdata/weightedRF_importances_100rep.rdata")
mats_rf <- mats

```

Evaluating all GRNs for the different settings:

```{r}

N_genes <- read.table("data/Ngenes.csv", h = T, sep = ";")
n_genes <- intersect(N_genes$AGI, genes)
n_tfs <- intersect(N_genes$AGI, tfs)

results <- NULL

for(alpha in ALPHAS){
  
  for(model in c("EN", "LASSO", "RF")){
    if(model == "EN") mats = mats_en
    if(model == "LASSO") mats = mats_lasso
    if(model == "RF") mats = mats_rf
    
    if(model=="RF")
      mats_10 <- mats[paste0("b", model,"_",alpha,"_trueData_",1:10)]
    else
      mats_10 <- mats[paste0(model,"_",alpha,"_trueData_",1:10)]
    mat_mean <- apply(simplify2array(mats_10), 1:2, mean)
    
    # mse <- median(mat_mean["mse",], na.rm = T)
    eval <- evaluate_grn(mat_mean[n_tfs, genes])
    genes_eval <- evaluate_genes(mat_mean[n_tfs, genes], nCores = 40)
    
    to_add <- genes_eval %>%
      mutate(alpha = alpha, model = paste0("weighted", model), 
             grn_auc = eval$auc, grn_auc_lower = eval$auc.lower,
             grn_auc_higher = eval$auc.higher, grn_aupr = eval$aupr,
             grn_aupr_rand = eval$aupr_rand,
             grn_partial_auc = eval$partial_auc) 
    
    if(is.null(results))
      results <- to_add
    else
      results <- rbind.data.frame(results, to_add)
  }
}

# loading DIOgene importance matrices
load("results/rdata/gene_specific/gene_specific_grns_true_sd.rdata")
mats <- mats[str_detect(names(mats),"true")]

mats_lasso_diogene <- mats[str_detect(names(mats),"LASSO")]
mats_rf_diogene <- mats[str_detect(names(mats),"RF")]

load("results/rdata/gene_specific/gene_specific_grns_true_sd_en.rdata")
mats <- mats[str_detect(names(mats),"true")]
mats_en_diogene <- mats[str_detect(names(mats),"LASSO")]


for(model in c("EN", "LASSO", "RF")){
    if(model == "EN") mats = mats_en_diogene
    if(model == "LASSO") mats = mats_lasso_diogene
    if(model == "RF") mats = mats_rf_diogene
    
    if(model=="EN")
      mats_10 <- mats[paste0("LASSO_trueData_",1:10)]
    else
      mats_10 <- mats[paste0(model,"_trueData_",1:10)]
    mat_mean <- apply(simplify2array(mats_10), 1:2, mean)
    
    # mse <- median(mat_mean["mse",], na.rm = T)
    eval <- evaluate_grn(mat_mean[n_tfs, genes])
    genes_eval <- evaluate_genes(mat_mean[n_tfs, genes], nCores = 40)
    
    to_add <- genes_eval %>%
      mutate(alpha = "DIOgene", model = paste0("weighted", model), 
             grn_auc = eval$auc, grn_auc_lower = eval$auc.lower,
             grn_auc_higher = eval$auc.higher, grn_aupr = eval$aupr,
             grn_aupr_rand = eval$aupr_rand,
             grn_partial_auc = eval$partial_auc)
    
    results <- rbind.data.frame(results, to_add)
}

save(results, file = "results/comparisons_to_existing_methods/auc_global_results_EN_LASSO_RF_nitrate_tfs.rdata")
```


Results plot : 

```{r, fig.width=15, fig.height = 8}
auc_plot <- results %>%
  group_by(model, alpha) %>%
  distinct(grn_auc, grn_auc_lower,grn_auc_higher,grn_partial_auc) %>%
  ungroup() %>%
  mutate(Data_integration = ifelse(alpha=="DIOgene", "DIOgene", "Global")) %>%
  ggplot(aes(x=alpha, color =Data_integration, y = grn_auc)) +
  geom_point()+geom_hline(yintercept = 0.5, color = "grey", size = 2)+
  geom_segment(aes(xend = alpha, y = grn_auc_lower, yend = grn_auc_higher))+
  ggh4x::facet_nested_wrap(vars(model))+
  scale_color_manual(values = c( "#4670CD", "#70AD47" ))+
  ggtitle("AUC for all GRN interactions")+
results %>%
  group_by(model, alpha) %>%
  distinct(grn_auc, grn_auc_lower,grn_auc_higher, grn_partial_auc) %>%
  ungroup() %>%
  mutate(Data_integration = ifelse(alpha=="DIOgene", "DIOgene", "Global")) %>%
  ggplot(aes(x=alpha, color =Data_integration, y = grn_partial_auc)) +
  geom_point()+geom_hline(yintercept = 0.5, color = "grey", size = 2)+
  # geom_segment(aes(xend = alpha, y = grn_auc_lower, yend = grn_auc_higher))+
  ggh4x::facet_nested_wrap(vars(model))+
  scale_color_manual(values = c( "#4670CD", "#70AD47" ))+
  ggtitle("pAUC for all GRN interactions (0.9-1 specificity)")+
results %>%
  mutate(Data_integration = ifelse(alpha=="DIOgene", "DIOgene", "Global")) %>%
  ggplot(aes(x=alpha, fill =Data_integration, y = auc)) +
  geom_hline(yintercept = 0.5, color = "grey", size = 2)+
  geom_violin(draw_quantiles = c(0.5), width = 0.5)+
  ggh4x::facet_nested_wrap(vars(model))+
  scale_fill_manual(values = c( "#4670CD", "#70AD47" ))+
  ggtitle("AUC per gene")+
results %>%
  mutate(Data_integration = ifelse(alpha=="DIOgene", "DIOgene", "Global")) %>%
  ggplot(aes(x=alpha, fill =Data_integration, y = p_auc)) +
  geom_violin(draw_quantiles = c(0.5), width = 0.5)+
  ggh4x::facet_nested_wrap(vars(model))+
  scale_fill_manual(values = c( "#4670CD", "#70AD47" ))+
  ggtitle("pAUC per gene (0.9-1 specificity)")+
  ylim(c(0,0.025))+
  plot_layout(nrow = 4) & theme(strip.background = element_blank(),
                                strip.text = element_text(size =15))

ggexport(auc_plot, filename = "results/comparisons_to_existing_methods/auc_global_results_EN_LASSO_RF_nitrate_tfs.pdf", width = 15, height = 13)

```

Precision and recall values : 

```{r}
aupr_plot <- results %>%
  group_by(model, alpha) %>%
  distinct(grn_auc, grn_auc_lower,grn_auc_higher,grn_partial_auc, grn_aupr, grn_aupr_rand) %>%
  ungroup() %>%
  mutate(Data_integration = ifelse(alpha=="DIOgene", "DIOgene", "Global")) %>%
  ggplot(aes(x=alpha, color =Data_integration, y = grn_aupr)) +
  geom_point()+geom_hline(yintercept = unique(results$grn_aupr_rand), 
                          color = "grey", size = 2)+
  ggh4x::facet_nested_wrap(vars(model))+
  scale_color_manual(values = c( "#4670CD", "#70AD47" ))+
  ggtitle("AUPR for all GRN interactions")+
results %>%
  mutate(Data_integration = ifelse(alpha=="DIOgene", "DIOgene", "Global")) %>%
  ggplot(aes(x=alpha, fill =Data_integration, y = (aupr-aupr_rand)/aupr_rand)) +
  geom_hline(yintercept = 0, color = "grey", size = 2)+
  geom_violin(draw_quantiles = c(0.5), width = 0.5)+
  ggh4x::facet_nested_wrap(vars(model))+
  ylim(c(-0.5,1))+
  scale_fill_manual(values = c( "#4670CD", "#70AD47" ))+
  ggtitle("(AUPR-AUPR_random)/AUPR_random per gene")+
  plot_layout(nrow = 2) & theme(strip.background = element_blank(),
                                strip.text = element_text(size =15))

ggexport(aupr_plot, filename = "results/comparisons_to_existing_methods/aupr_global_results_EN_LASSO_RF_nitrate_tfs.pdf", 
         width = 15, height = 8)
```

Are precision and recall higher for genes with alpha > 0 in DIOgene? It is probably the case.

```{r}
load("results/rdata/gene_specific/alpha_per_gene_weightedEN_true_sd.rdata")
alphas_en <- alphas_lasso

load("results/rdata/gene_specific/alpha_per_gene_weightedLASSO_true_sd.rdata")

load("results/rdata/gene_specific/alpha_per_gene_weightedRF_true_sd.rdata")

results %>%
  filter(alpha == "DIOgene") %>%
  mutate(alpha_opt = c(alphas_en, alphas_lasso, alphas_rf)) %>%
  ggplot(aes(x=alpha_opt > 0, y=auc))+
  geom_violin(draw_quantiles = c(0.5))+
  facet_wrap(~model)+
results %>%
  filter(alpha == "DIOgene") %>%
  mutate(alpha_opt = c(alphas_en, alphas_lasso, alphas_rf)) %>%
  ggplot(aes(x=alpha_opt > 0, y=aupr-aupr_rand))+
  geom_violin(draw_quantiles = c(0.5))+
  facet_wrap(~model)
```