Update vignettes

igordot · Jan 5, 2024 · 3dffd15 · 3dffd15
1 parent 69f1448
commit 3dffd15
Show file tree

Hide file tree

Showing 5 changed files with 70 additions and 75 deletions.
diff --git a/README.md b/README.md
@@ -1,27 +1,10 @@
 # clustermole: exploratory scRNA-seq cell type analysis
 
-<!-- badges: start -->
-[![CRAN](https://www.r-pkg.org/badges/version/clustermole)](https://cran.r-project.org/package=clustermole)
-[![R-CMD-check](https://github.com/igordot/clustermole/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/igordot/clustermole/actions/workflows/R-CMD-check.yaml)
-[![codecov](https://codecov.io/gh/igordot/clustermole/branch/master/graph/badge.svg)](https://codecov.io/gh/igordot/clustermole)
-<!-- badges: end -->
+The clustermole R package (available on [CRAN](https://cran.r-project.org/package=clustermole)) provides methods to query cell identity markers sourced from a variety of databases.
+It includes three primary features:
 
-![clustermole-book](https://user-images.githubusercontent.com/6363505/72761156-12414280-3ba9-11ea-87de-57ff6cd690bb.png)
-
-## Overview
-
-Assignment of cell type labels to single-cell RNA sequencing (scRNA-seq) clusters is often a time-consuming process that involves manual inspection of the cluster marker genes complemented with a detailed literature search.
-This can be especially challenging when unexpected or poorly described populations are present.
-The clustermole R package provides methods to query thousands of human and mouse cell identity markers sourced from a variety of databases.
-
-The clustermole package provides three primary features:
-
-* a database of markers for thousands of cell types
+* a meta-database of human and mouse markers for thousands of cell types
 * cell type prediction based on marker genes
 * cell type prediction based on the full expression matrix
 
 Check the [documentation website](https://igordot.github.io/clustermole/) for more information.
-
----
-
-*Image credit: "A Child's Primer Of Natural History" by Oliver Herford*
diff --git a/index.md b/index.md
@@ -0,0 +1,15 @@
+
+<!-- badges: start -->
+[![CRAN](https://www.r-pkg.org/badges/version/clustermole)](https://cran.r-project.org/package=clustermole)
+[![R-CMD-check](https://github.com/igordot/clustermole/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/igordot/clustermole/actions/workflows/R-CMD-check.yaml)
+[![codecov](https://codecov.io/gh/igordot/clustermole/graph/badge.svg?token=YoTQTU1EDk)](https://codecov.io/gh/igordot/clustermole)
+<!-- badges: end -->
+
+Assignment of cell type labels to single-cell RNA sequencing (scRNA-seq) clusters is often a time-consuming process that involves manual inspection of the cluster marker genes complemented with a detailed literature search.
+This can be especially challenging when unexpected or poorly described populations are present.
+The clustermole R package provides methods to query cell identity markers sourced from a variety of databases.
+
+---
+
+![](https://user-images.githubusercontent.com/6363505/72761156-12414280-3ba9-11ea-87de-57ff6cd690bb.png)
+*Image credit: "A Child's Primer Of Natural History" by Oliver Herford*
diff --git a/vignettes/clustermole-intro.Rmd b/vignettes/clustermole-intro.Rmd
@@ -20,18 +20,15 @@ options(pillar.min_title_chars = 10)
 
 ## Overview
 
-A typical computational pipeline to process single-cell RNA sequencing (scRNA-seq) data includes clustering of cells as one of the steps.
-Assignment of cell type labels to those clusters is often a time-consuming process that involves manual inspection of the cluster marker genes complemented with a detailed literature search.
-This is especially challenging when unexpected or poorly described populations are present.
-The clustermole R package provides methods to query thousands of human and mouse cell identity markers sourced from a variety of databases.
+The clustermole R package is designed to simplify the assignment of cell type labels to scRNA-seq clusters.
+It provides methods to query cell identity markers sourced from a variety of databases.
+The package includes three primary features:
 
-The clustermole package provides three primary features:
+* a meta-database of human and mouse markers for thousands of cell types (`clustermole_markers()`)
+* cell type prediction based on marker genes (`clustermole_overlaps()`)
+* cell type prediction based on the full expression matrix (`clustermole_enrichment()`)
 
-* a database of markers for thousands of cell types (`clustermole_markers`)
-* cell type prediction based on marker genes (`clustermole_overlaps`)
-* cell type prediction based on the full expression matrix (`clustermole_enrichment`)
-
-## Usage
+## Setup
 
 You can install clustermole from [CRAN](https://cran.r-project.org/package=clustermole).
 
@@ -45,7 +42,7 @@ Load clustermole.
 library(clustermole)
 ```
 
-### Retrieve cell type markers
+## Cell type markers
 
 You can use clustermole as a simple database and get a data frame of all cell type markers.
 
@@ -57,36 +54,24 @@ markers
 Each row contains a gene and a cell type associated with it.
 The `gene` column is the gene symbol and the `celltype_full` column contains the full cell type string, including the species and the original database. Human or mouse versions can be retrieved.
 
-Some tools require input as a list.
-To convert the markers from a data frame to a list format, you can use `gene` as the values and `celltype_full` as the grouping variable.
+Many tools that works with gene sets require input as a list.
+To convert the markers from a data frame to a list, you can use `gene` as the values and `celltype_full` as the grouping variable.
 
 ```{r celltypes-list}
 markers_list <- split(x = markers$gene, f = markers$celltype_full)
 ```
 
-Check the number of cell types in the database.
-
-```{r celltypes-count}
-length(unique(markers$celltype_full))
-```
-
-Check the cell type source databases.
-
-```{r celltypes-dbs}
-sort(unique(markers$db))
-```
-
-### Cell types based on marker genes
+## Cell types based on marker genes
 
 If you have a character vector of genes, such as cluster markers, you can compare them to known cell type markers to see if they overlap any of the known cell type markers (overrepresentation analysis).
 
 ```{r overlaps, eval=FALSE}
 my_overlaps <- clustermole_overlaps(genes = my_genes_vec, species = "hs")
 ```
 
-### Cell types based on expression matrix
+## Cell types based on an expression matrix
 
-If you have expression values, such as average expression across clusters, you can perform cell type enrichment based on a given gene expression matrix (log-transformed CPM/TPM/FPKM values).
+If you have expression values, such as average expression for each cluster, you can perform cell type enrichment based on the full gene expression matrix (log-transformed CPM/TPM/FPKM values).
 The matrix should have genes as rows and clusters/samples as columns.
 The underlying enrichment method can be changed using the `method` parameter.
 

diff --git a/vignettes/db.Rmd b/vignettes/db.Rmd
@@ -19,37 +19,47 @@ library(dplyr)
 You can use clustermole as a simple database and get a table of all cell type markers.
 
 ```{r markers}
-markers = clustermole_markers(species = "hs")
+markers <- clustermole_markers(species = "hs")
 markers
 ```
 
 Each row contains a gene and a cell type associated with it.
 The `gene` column is the gene symbol (human or mouse) and the `celltype_full` column contains the detailed cell type string including the species and the original database.
 
+## Number of cell types
+
 Check the total number of the available cell types.
 
 ```{r celltypes-length}
-markers %>% distinct(celltype_full) %>% nrow()
+length(unique(markers$celltype_full))
 ```
 
+## Number of cell types by source database
+
 Check the source databases and the number of cell types from each.
 
 ```{r count-db}
-markers %>% distinct(celltype_full, db) %>% count(db)
+distinct(markers, celltype_full, db) |> count(db)
 ```
 
+## Number of cell types by species
+
 Check the number of cell types per species (not available for all cell types).
 
 ```{r count-species}
-markers %>% distinct(celltype_full, species) %>% count(species)
+distinct(markers, celltype_full, species) |> count(species)
 ```
 
+## Number of cell types by organ
+
 Check the number of available cell types per organ (not available for all cell types).
 
 ```{r count-organ}
-markers %>% distinct(celltype_full, organ) %>% count(organ, sort = TRUE)
+distinct(markers, celltype_full, organ) |> count(organ, sort = TRUE)
 ```
 
+## Package version
+
 Check the package version since the database contents may change.
 
 ```{r package-version}

diff --git a/vignettes/example-bm-seurat.Rmd b/vignettes/example-bm-seurat.Rmd
@@ -14,10 +14,10 @@ options(pillar.min_chars = Inf)
 ## Introduction
 
 Assignment of cell type labels to scRNA-seq clusters is particularly difficult when unexpected or poorly described populations are present.
-You can often trust various fully automated algorithms for cell type annotation, but sometimes a more exploratory analysis is helpful in understanding the captured cells.
+There are fully automated algorithms for cell type annotation, but sometimes a more in-depth analysis is helpful in understanding the captured cells.
 This is an example of exploratory cell type analysis using clustermole, starting with a Seurat object.
 
-The dataset used here contains hematopoietic and stromal bone marrow populations ([Baccin et al.](https://doi.org/10.1038/s41556-019-0439-6)).
+The dataset used in this example contains hematopoietic and stromal bone marrow populations ([Baccin et al.](https://doi.org/10.1038/s41556-019-0439-6)).
 This experiment was selected because it includes both well-known as well as rare cell types.
 
 ## Load data
@@ -27,11 +27,13 @@ Load relevant packages.
 ```{r load-libraries, message=FALSE}
 library(Seurat)
 library(dplyr)
+library(ggplot2)
 library(ggsci)
+library(clustermole)
 ```
 
-Load the dataset, which is stored as a Seurat object.
-It was subset for this example to reduce the size and speed up processing.
+Download the dataset, which is stored as a Seurat object.
+It was subset for this tutorial to reduce the size and speed up processing.
 
 ```{r load-seurat-object, message=FALSE, warning=FALSE}
 so <- readRDS(url("https://osf.io/cvnqb/download"))
@@ -40,38 +42,32 @@ so
 
 Check the experiment labels on a tSNE visualization, as shown in the original publication ([original figure](https://www.nature.com/articles/s41556-019-0439-6/figures/1)).
 
-```{r tsne-exp}
-DimPlot(so, reduction = "tsne", group.by = "experiment", cells = sample(colnames(so))) +
+```{r tsne-experiment}
+DimPlot(so, reduction = "tsne", group.by = "experiment", shuffle = TRUE) +
+  theme(aspect.ratio = 1, legend.text = element_text(size = rel(0.7))) +
   scale_color_nejm()
 ```
 
 Check the cell type labels on a tSNE visualization.
 
 ```{r tsne-celltype}
-DimPlot(so, reduction = "tsne", group.by = "celltype", cells = sample(colnames(so))) +
+DimPlot(so, reduction = "tsne", group.by = "celltype", shuffle = TRUE) +
+  theme(aspect.ratio = 1, legend.text = element_text(size = rel(0.8))) +
   scale_color_igv()
 ```
 
-## Cell type annotation
-
-Since this is a clustermole tutorial, load clustermole.
-
-```{r load-clustermole, message=FALSE, warning=FALSE}
-library(clustermole)
-```
-
-Set the Seurat object cell identities to the predefined cell type labels and check what they are.
+Set the Seurat object cell identities to the predefined cell type labels for the next steps.
 
 ```{r set-idents}
 Idents(so) <- "celltype"
 levels(Idents(so))
 ```
 
-### Marker gene overlaps
+## Marker gene overlaps
 
 One type of analysis facilitated by clustermole is based on comparison of marker genes.
 
-We can start with the B-cells, which is well-defined population used in many studies.
+We can start with the B-cells, which is a well-defined population used in many studies.
 
 Find markers for the B-cell cluster.
 
@@ -148,7 +144,7 @@ head(overlaps_tbl, 15)
 
 The top results are again more diverse than for B-cells, but the appropriate populations are listed.
 
-### Enrichment of markers
+## Enrichment of markers
 
 Rather than comparing marker genes, it's also possible to run enrichment of cell type signatures across all genes.
 This avoids having to define an optimal set of markers.
@@ -181,19 +177,25 @@ enrich_tbl <- clustermole_enrichment(expr_mat = avg_exp_mat, species = "mm")
 Check the most enriched cell types for the B-cell cluster.
 
 ```{r}
-enrich_tbl %>% filter(cluster == "B-cell") %>% head(15)
+enrich_tbl %>%
+  filter(cluster == "B-cell") %>%
+  head(15)
 ```
 
 As with the previous analysis, the top results are various B-cell populations.
 
 Check the most enriched cell types for the Adipo-CAR cluster.
 
 ```{r}
-enrich_tbl %>% filter(cluster == "Adipo-CAR") %>% head(15)
+enrich_tbl %>%
+  filter(cluster == "Adipo-CAR") %>%
+  head(15)
 ```
 
 Check the most enriched cell types for the Osteoblasts cluster.
 
 ```{r}
-enrich_tbl %>% filter(cluster == "Osteoblasts") %>% head(15)
+enrich_tbl %>%
+  filter(cluster == "Osteoblasts") %>%
+  head(15)
 ```