website: finalize forestplot

morinlab · Mar 22, 2024 · 6fef030 · 6fef030
1 parent bce9c61
commit 6fef030
Show file tree

Hide file tree

Showing 11 changed files with 485 additions and 24 deletions.
diff --git a/docs/search.json b/docs/search.json
diff --git a/docs/tutorials/forestplot.html b/docs/tutorials/forestplot.html
diff --git a/docs/tutorials/forestplot.qmd b/docs/tutorials/forestplot.qmd
@@ -2,8 +2,8 @@
 title: "Tutorial: The prettiest forestplot"
 warning: false
 message: false
-fig.width: 8
-fig.height: 5
+fig.width: 10
+fig.height: 8
 fig.align: "center"
 ---
 
@@ -28,7 +28,9 @@ We will first import the necessary packages:
 ```{r load_packages}
 # Load packages
 library(GAMBLR.data)
+library(GAMBLR.utils)
 library(GAMBLR.viz)
+library(tibble)
 library(dplyr)
 ```
 
@@ -83,8 +85,6 @@ providing the metadata and data frame with mutations in standard maf format.
 Here is an example of the output with all default parameters:
 
 ```{r default}
-#| fig-width: 10
-#| fig-height: 15
 comparison_column <- "EBV_status_inf" # character of column name for comparison
 fp <- prettyForestPlot(
     metadata = metadata,
@@ -107,3 +107,215 @@ arranged side-by-side
 ```{r}
 names(fp)
 ```
+
+## Report only significant differences
+
+By default, all of the genes of interest are reported in the output. After the
+Fisher's test is performed, the `prettyForestPlot` also calculates FDR and we
+can use it to only report significant differences by providing a significance
+cutoff with the parameter `max_q`:
+
+```{r fdr}
+max_q <- 0.1 # only those qith Q value <= 0.1 will be reported
+fp <- prettyForestPlot(
+    metadata = metadata,
+    maf = maf,
+    genes = genes,
+    comparison_column = comparison_column,
+    max_q = max_q
+)
+```
+
+We now can take a look at what genes are passing the significance cutoff:
+```{r fdr_plot}
+fp$arranged
+```
+
+## Comparing categories with more than two groups
+
+As the `prettyForestPlot` construcst the 2x2 contingency tables to run Fisher's
+test to find significant differences, it can only operate on comparing 2 groups
+between themselves - but what if you have more than that and want to see the
+difference between some of them?
+To handle this scenario, we can take advantage of the `comparison_values`
+parameter, which will be used to subset the metadata to only requested groups
+and only perform testing and plotting on this subset. Let's see it in action:
+
+```{r comp_groups}
+comparison_column <- "genetic_subgroup" # change the comparison column
+comparison_values <- c("IC-BL", "Q53-BL")
+fp <- prettyForestPlot(
+    metadata = metadata,
+    maf = maf,
+    genes = genes,
+    comparison_column = comparison_column,
+    comparison_values = comparison_values,
+    max_q = max_q
+)
+
+fp$arranged
+```
+
+This plot is exactly reproducing the Supplemmental Figure 12D from the
+[Thomas et al](https://doi.org/10.1182/blood.2022016534) study!
+
+## Separating genes with hotspots
+
+We can additionally separate hotspots from the other mutations and compare those
+separately. First, we need to annotate the maf data, for which we will use the
+`annotate_hotspots` from GAMBLR family. This function will add a new column to
+the maf named `hot_spot` indicating whether or not the specific mutation is in
+the hotspot region.
+
+```{r annotate_maf}
+# Annotate hotspots
+maf <- annotate_hotspots(maf)
+
+# What are the hotspots?
+maf %>%
+    filter(hot_spot) %>%
+    select(Hugo_Symbol, hot_spot) %>%
+    table()
+```
+::: {.callout-note}
+The GAMBLR.data version of the `annotate_hotspots` only handles very specific
+genes and does not have functionality to annotate all hotspots.
+:::
+
+Oh no! Looks like there is no hotspots in this maf data. This does not make
+sense, so what happened? Aha, the hotspot annotation in GAMBLR.data works only
+on the data in grch37 projection. But our maf is in hg38, so what should we do?
+One way is to lift the maf data to another projection using the UCSC's liftOver,
+and GAMBLR family has exactly the function that serves this purpose:
+
+```{r lift_maf}
+maf_grch37 <- liftover(
+    maf,
+    mode = "maf",
+    target_build = "grch37"
+) %>%
+mutate(Chromosome = gsub("chr", "", Chromosome)) %>%
+select(-hot_spot) # since it is empty we can just drop it
+
+```
+
+Can we annotate the hotspots now?
+
+```{r}
+maf_grch37 <- annotate_hotspots(maf_grch37)
+
+# What are the hotspots?
+maf_grch37 %>%
+    filter(hot_spot) %>%
+    select(Hugo_Symbol, hot_spot) %>%
+    table()
+```
+
+Indeed, the hotspots are properly annotated once we have maf in correct
+projection. Now, we can simply toggle the `separate_hotspots` parameter to
+perform separate comparisons within hotspots:
+
+```{r comp_hotspots}
+comparison_column <- "EBV_status_inf"
+fp <- prettyForestPlot(
+    metadata = metadata,
+    maf = maf_grch37,
+    genes = genes,
+    comparison_column = comparison_column,
+    max_q = max_q,
+    separate_hotspots = TRUE
+)
+
+fp$arranged
+```
+
+## Using binary matrix as input
+
+Sometimes it might be useful to have different input format instead of maf - for example, binary matrix of features. Can we use the `prettyForestPlot` in this
+case? Yes, sure we can!
+
+First, let's construct the binary matrix. We will supplement our maf with the
+non-coding mutations to look at the aSHM regions in addition to coding
+mutations, and this will already give us the data in correct projection:
+```{r bin_mat}
+maf <- get_ssm_by_samples(
+    these_samples_metadata = metadata
+)
+maf$Variant_Classification %>% table
+```
+
+Now we convert this maf into binary matrix:
+```{r cod_mat}
+# Generate binary matrix
+coding_matrix <- get_coding_ssm_status(
+    these_samples_metadata = metadata,
+    maf_data = maf,
+    gene_symbols = genes,
+    include_hotspots = TRUE,
+    review_hotspots = TRUE
+)
+
+```
+
+Next, supplement this with the matrix of non-coding mutation across aSHM regions
+
+```{r ashm_mat}
+# Use aSHM regions from GAMBLR.data
+regions_bed <- somatic_hypermutation_locations_GRCh37_v0.2
+
+# Add convenient name column
+regions_bed <- regions_bed %>%
+    mutate(
+        name = paste(gene, region, sep = "-")
+    )
+
+# Generate matrix of mutations per each site
+ashm_matrix <- get_ashm_count_matrix(
+    regions_bed = regions_bed,
+    maf_data = maf,
+    these_samples_metadata = metadata
+)
+
+# Binarize matrix using arbitrary 3 muts/region cutoff
+ashm_matrix[ashm_matrix <= 3] = 0
+ashm_matrix[ashm_matrix > 3] = 1
+ashm_matrix <- ashm_matrix %>%
+    rownames_to_column("sample_id")
+```
+
+
+We can now combine both coding and non-coding features into single matrix:
+```{r mat}
+feature_matrix <- left_join(
+    coding_matrix,
+    ashm_matrix
+)
+
+# Drop any fearures absent across at least 10 samples to clean any noise
+feature_matrix <- feature_matrix %>%
+    select_if(is.numeric) %>%
+    select(where(~ sum(. > 0, na.rm = TRUE) >= 10)) %>%
+    bind_cols(
+        feature_matrix %>% select(sample_id),
+        .
+    )
+```
+
+Now we can provide the binary matrix to the `prettyForestPlot` and regenerate
+the Supplemmental Figure 12C from the
+[Thomas et al](https://doi.org/10.1182/blood.2022016534) study!
+
+```{r comp_mat}
+comparison_column <- "genetic_subgroup"
+comparison_values <- c("DGG-BL", "Q53-BL")
+fp <- prettyForestPlot(
+    metadata = metadata,
+    mutmat = feature_matrix,
+    genes = genes,
+    comparison_column = comparison_column,
+    comparison_values = comparison_values,
+    max_q = max_q
+)
+
+fp$arranged
+```
diff --git a/docs/tutorials/forestplot_files/figure-html/comp_groups-1.png b/docs/tutorials/forestplot_files/figure-html/comp_groups-1.png
diff --git a/docs/tutorials/forestplot_files/figure-html/comp_hotspots-1.png b/docs/tutorials/forestplot_files/figure-html/comp_hotspots-1.png
diff --git a/docs/tutorials/forestplot_files/figure-html/comp_mat-1.png b/docs/tutorials/forestplot_files/figure-html/comp_mat-1.png
diff --git a/docs/tutorials/forestplot_files/figure-html/fdr_plot-1.png b/docs/tutorials/forestplot_files/figure-html/fdr_plot-1.png