Skip to content

Commit

Permalink
website: finalize forestplot
Browse files Browse the repository at this point in the history
  • Loading branch information
Kdreval committed Mar 22, 2024
1 parent bce9c61 commit 6fef030
Show file tree
Hide file tree
Showing 11 changed files with 485 additions and 24 deletions.
34 changes: 31 additions & 3 deletions docs/search.json

Large diffs are not rendered by default.

231 changes: 229 additions & 2 deletions docs/tutorials/forestplot.html

Large diffs are not rendered by default.

220 changes: 216 additions & 4 deletions docs/tutorials/forestplot.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
title: "Tutorial: The prettiest forestplot"
warning: false
message: false
fig.width: 8
fig.height: 5
fig.width: 10
fig.height: 8
fig.align: "center"
---

Expand All @@ -28,7 +28,9 @@ We will first import the necessary packages:
```{r load_packages}
# Load packages
library(GAMBLR.data)
library(GAMBLR.utils)
library(GAMBLR.viz)
library(tibble)
library(dplyr)
```

Expand Down Expand Up @@ -83,8 +85,6 @@ providing the metadata and data frame with mutations in standard maf format.
Here is an example of the output with all default parameters:

```{r default}
#| fig-width: 10
#| fig-height: 15
comparison_column <- "EBV_status_inf" # character of column name for comparison
fp <- prettyForestPlot(
metadata = metadata,
Expand All @@ -107,3 +107,215 @@ arranged side-by-side
```{r}
names(fp)
```

## Report only significant differences

By default, all of the genes of interest are reported in the output. After the
Fisher's test is performed, the `prettyForestPlot` also calculates FDR and we
can use it to only report significant differences by providing a significance
cutoff with the parameter `max_q`:

```{r fdr}
max_q <- 0.1 # only those qith Q value <= 0.1 will be reported
fp <- prettyForestPlot(
metadata = metadata,
maf = maf,
genes = genes,
comparison_column = comparison_column,
max_q = max_q
)
```

We now can take a look at what genes are passing the significance cutoff:
```{r fdr_plot}
fp$arranged
```

## Comparing categories with more than two groups

As the `prettyForestPlot` construcst the 2x2 contingency tables to run Fisher's
test to find significant differences, it can only operate on comparing 2 groups
between themselves - but what if you have more than that and want to see the
difference between some of them?
To handle this scenario, we can take advantage of the `comparison_values`
parameter, which will be used to subset the metadata to only requested groups
and only perform testing and plotting on this subset. Let's see it in action:

```{r comp_groups}
comparison_column <- "genetic_subgroup" # change the comparison column
comparison_values <- c("IC-BL", "Q53-BL")
fp <- prettyForestPlot(
metadata = metadata,
maf = maf,
genes = genes,
comparison_column = comparison_column,
comparison_values = comparison_values,
max_q = max_q
)
fp$arranged
```

This plot is exactly reproducing the Supplemmental Figure 12D from the
[Thomas et al](https://doi.org/10.1182/blood.2022016534) study!

## Separating genes with hotspots

We can additionally separate hotspots from the other mutations and compare those
separately. First, we need to annotate the maf data, for which we will use the
`annotate_hotspots` from GAMBLR family. This function will add a new column to
the maf named `hot_spot` indicating whether or not the specific mutation is in
the hotspot region.

```{r annotate_maf}
# Annotate hotspots
maf <- annotate_hotspots(maf)
# What are the hotspots?
maf %>%
filter(hot_spot) %>%
select(Hugo_Symbol, hot_spot) %>%
table()
```
::: {.callout-note}
The GAMBLR.data version of the `annotate_hotspots` only handles very specific
genes and does not have functionality to annotate all hotspots.
:::

Oh no! Looks like there is no hotspots in this maf data. This does not make
sense, so what happened? Aha, the hotspot annotation in GAMBLR.data works only
on the data in grch37 projection. But our maf is in hg38, so what should we do?
One way is to lift the maf data to another projection using the UCSC's liftOver,
and GAMBLR family has exactly the function that serves this purpose:

```{r lift_maf}
maf_grch37 <- liftover(
maf,
mode = "maf",
target_build = "grch37"
) %>%
mutate(Chromosome = gsub("chr", "", Chromosome)) %>%
select(-hot_spot) # since it is empty we can just drop it
```

Can we annotate the hotspots now?

```{r}
maf_grch37 <- annotate_hotspots(maf_grch37)
# What are the hotspots?
maf_grch37 %>%
filter(hot_spot) %>%
select(Hugo_Symbol, hot_spot) %>%
table()
```

Indeed, the hotspots are properly annotated once we have maf in correct
projection. Now, we can simply toggle the `separate_hotspots` parameter to
perform separate comparisons within hotspots:

```{r comp_hotspots}
comparison_column <- "EBV_status_inf"
fp <- prettyForestPlot(
metadata = metadata,
maf = maf_grch37,
genes = genes,
comparison_column = comparison_column,
max_q = max_q,
separate_hotspots = TRUE
)
fp$arranged
```

## Using binary matrix as input

Sometimes it might be useful to have different input format instead of maf - for example, binary matrix of features. Can we use the `prettyForestPlot` in this
case? Yes, sure we can!

First, let's construct the binary matrix. We will supplement our maf with the
non-coding mutations to look at the aSHM regions in addition to coding
mutations, and this will already give us the data in correct projection:
```{r bin_mat}
maf <- get_ssm_by_samples(
these_samples_metadata = metadata
)
maf$Variant_Classification %>% table
```

Now we convert this maf into binary matrix:
```{r cod_mat}
# Generate binary matrix
coding_matrix <- get_coding_ssm_status(
these_samples_metadata = metadata,
maf_data = maf,
gene_symbols = genes,
include_hotspots = TRUE,
review_hotspots = TRUE
)
```

Next, supplement this with the matrix of non-coding mutation across aSHM regions

```{r ashm_mat}
# Use aSHM regions from GAMBLR.data
regions_bed <- somatic_hypermutation_locations_GRCh37_v0.2
# Add convenient name column
regions_bed <- regions_bed %>%
mutate(
name = paste(gene, region, sep = "-")
)
# Generate matrix of mutations per each site
ashm_matrix <- get_ashm_count_matrix(
regions_bed = regions_bed,
maf_data = maf,
these_samples_metadata = metadata
)
# Binarize matrix using arbitrary 3 muts/region cutoff
ashm_matrix[ashm_matrix <= 3] = 0
ashm_matrix[ashm_matrix > 3] = 1
ashm_matrix <- ashm_matrix %>%
rownames_to_column("sample_id")
```


We can now combine both coding and non-coding features into single matrix:
```{r mat}
feature_matrix <- left_join(
coding_matrix,
ashm_matrix
)
# Drop any fearures absent across at least 10 samples to clean any noise
feature_matrix <- feature_matrix %>%
select_if(is.numeric) %>%
select(where(~ sum(. > 0, na.rm = TRUE) >= 10)) %>%
bind_cols(
feature_matrix %>% select(sample_id),
.
)
```

Now we can provide the binary matrix to the `prettyForestPlot` and regenerate
the Supplemmental Figure 12C from the
[Thomas et al](https://doi.org/10.1182/blood.2022016534) study!

```{r comp_mat}
comparison_column <- "genetic_subgroup"
comparison_values <- c("DGG-BL", "Q53-BL")
fp <- prettyForestPlot(
metadata = metadata,
mutmat = feature_matrix,
genes = genes,
comparison_column = comparison_column,
comparison_values = comparison_values,
max_q = max_q
)
fp$arranged
```
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 6fef030

Please sign in to comment.