Skip to content

Commit

Permalink
Update vignettes
Browse files Browse the repository at this point in the history
  • Loading branch information
igordot committed Jan 5, 2024
1 parent 69f1448 commit 3dffd15
Show file tree
Hide file tree
Showing 5 changed files with 70 additions and 75 deletions.
23 changes: 3 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,10 @@
# clustermole: exploratory scRNA-seq cell type analysis

<!-- badges: start -->
[![CRAN](https://www.r-pkg.org/badges/version/clustermole)](https://cran.r-project.org/package=clustermole)
[![R-CMD-check](https://github.com/igordot/clustermole/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/igordot/clustermole/actions/workflows/R-CMD-check.yaml)
[![codecov](https://codecov.io/gh/igordot/clustermole/branch/master/graph/badge.svg)](https://codecov.io/gh/igordot/clustermole)
<!-- badges: end -->
The clustermole R package (available on [CRAN](https://cran.r-project.org/package=clustermole)) provides methods to query cell identity markers sourced from a variety of databases.
It includes three primary features:

![clustermole-book](https://user-images.githubusercontent.com/6363505/72761156-12414280-3ba9-11ea-87de-57ff6cd690bb.png)

## Overview

Assignment of cell type labels to single-cell RNA sequencing (scRNA-seq) clusters is often a time-consuming process that involves manual inspection of the cluster marker genes complemented with a detailed literature search.
This can be especially challenging when unexpected or poorly described populations are present.
The clustermole R package provides methods to query thousands of human and mouse cell identity markers sourced from a variety of databases.

The clustermole package provides three primary features:

* a database of markers for thousands of cell types
* a meta-database of human and mouse markers for thousands of cell types
* cell type prediction based on marker genes
* cell type prediction based on the full expression matrix

Check the [documentation website](https://igordot.github.io/clustermole/) for more information.

---

*Image credit: "A Child's Primer Of Natural History" by Oliver Herford*
15 changes: 15 additions & 0 deletions index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@

<!-- badges: start -->
[![CRAN](https://www.r-pkg.org/badges/version/clustermole)](https://cran.r-project.org/package=clustermole)
[![R-CMD-check](https://github.com/igordot/clustermole/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/igordot/clustermole/actions/workflows/R-CMD-check.yaml)
[![codecov](https://codecov.io/gh/igordot/clustermole/graph/badge.svg?token=YoTQTU1EDk)](https://codecov.io/gh/igordot/clustermole)
<!-- badges: end -->

Assignment of cell type labels to single-cell RNA sequencing (scRNA-seq) clusters is often a time-consuming process that involves manual inspection of the cluster marker genes complemented with a detailed literature search.
This can be especially challenging when unexpected or poorly described populations are present.
The clustermole R package provides methods to query cell identity markers sourced from a variety of databases.

---

![](https://user-images.githubusercontent.com/6363505/72761156-12414280-3ba9-11ea-87de-57ff6cd690bb.png)
*Image credit: "A Child's Primer Of Natural History" by Oliver Herford*
41 changes: 13 additions & 28 deletions vignettes/clustermole-intro.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -20,18 +20,15 @@ options(pillar.min_title_chars = 10)

## Overview

A typical computational pipeline to process single-cell RNA sequencing (scRNA-seq) data includes clustering of cells as one of the steps.
Assignment of cell type labels to those clusters is often a time-consuming process that involves manual inspection of the cluster marker genes complemented with a detailed literature search.
This is especially challenging when unexpected or poorly described populations are present.
The clustermole R package provides methods to query thousands of human and mouse cell identity markers sourced from a variety of databases.
The clustermole R package is designed to simplify the assignment of cell type labels to scRNA-seq clusters.
It provides methods to query cell identity markers sourced from a variety of databases.
The package includes three primary features:

The clustermole package provides three primary features:
* a meta-database of human and mouse markers for thousands of cell types (`clustermole_markers()`)
* cell type prediction based on marker genes (`clustermole_overlaps()`)
* cell type prediction based on the full expression matrix (`clustermole_enrichment()`)

* a database of markers for thousands of cell types (`clustermole_markers`)
* cell type prediction based on marker genes (`clustermole_overlaps`)
* cell type prediction based on the full expression matrix (`clustermole_enrichment`)

## Usage
## Setup

You can install clustermole from [CRAN](https://cran.r-project.org/package=clustermole).

Expand All @@ -45,7 +42,7 @@ Load clustermole.
library(clustermole)
```

### Retrieve cell type markers
## Cell type markers

You can use clustermole as a simple database and get a data frame of all cell type markers.

Expand All @@ -57,36 +54,24 @@ markers
Each row contains a gene and a cell type associated with it.
The `gene` column is the gene symbol and the `celltype_full` column contains the full cell type string, including the species and the original database. Human or mouse versions can be retrieved.

Some tools require input as a list.
To convert the markers from a data frame to a list format, you can use `gene` as the values and `celltype_full` as the grouping variable.
Many tools that works with gene sets require input as a list.
To convert the markers from a data frame to a list, you can use `gene` as the values and `celltype_full` as the grouping variable.

```{r celltypes-list}
markers_list <- split(x = markers$gene, f = markers$celltype_full)
```

Check the number of cell types in the database.

```{r celltypes-count}
length(unique(markers$celltype_full))
```

Check the cell type source databases.

```{r celltypes-dbs}
sort(unique(markers$db))
```

### Cell types based on marker genes
## Cell types based on marker genes

If you have a character vector of genes, such as cluster markers, you can compare them to known cell type markers to see if they overlap any of the known cell type markers (overrepresentation analysis).

```{r overlaps, eval=FALSE}
my_overlaps <- clustermole_overlaps(genes = my_genes_vec, species = "hs")
```

### Cell types based on expression matrix
## Cell types based on an expression matrix

If you have expression values, such as average expression across clusters, you can perform cell type enrichment based on a given gene expression matrix (log-transformed CPM/TPM/FPKM values).
If you have expression values, such as average expression for each cluster, you can perform cell type enrichment based on the full gene expression matrix (log-transformed CPM/TPM/FPKM values).
The matrix should have genes as rows and clusters/samples as columns.
The underlying enrichment method can be changed using the `method` parameter.

Expand Down
20 changes: 15 additions & 5 deletions vignettes/db.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -19,37 +19,47 @@ library(dplyr)
You can use clustermole as a simple database and get a table of all cell type markers.

```{r markers}
markers = clustermole_markers(species = "hs")
markers <- clustermole_markers(species = "hs")
markers
```

Each row contains a gene and a cell type associated with it.
The `gene` column is the gene symbol (human or mouse) and the `celltype_full` column contains the detailed cell type string including the species and the original database.

## Number of cell types

Check the total number of the available cell types.

```{r celltypes-length}
markers %>% distinct(celltype_full) %>% nrow()
length(unique(markers$celltype_full))
```

## Number of cell types by source database

Check the source databases and the number of cell types from each.

```{r count-db}
markers %>% distinct(celltype_full, db) %>% count(db)
distinct(markers, celltype_full, db) |> count(db)
```

## Number of cell types by species

Check the number of cell types per species (not available for all cell types).

```{r count-species}
markers %>% distinct(celltype_full, species) %>% count(species)
distinct(markers, celltype_full, species) |> count(species)
```

## Number of cell types by organ

Check the number of available cell types per organ (not available for all cell types).

```{r count-organ}
markers %>% distinct(celltype_full, organ) %>% count(organ, sort = TRUE)
distinct(markers, celltype_full, organ) |> count(organ, sort = TRUE)
```

## Package version

Check the package version since the database contents may change.

```{r package-version}
Expand Down
46 changes: 24 additions & 22 deletions vignettes/example-bm-seurat.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,10 @@ options(pillar.min_chars = Inf)
## Introduction

Assignment of cell type labels to scRNA-seq clusters is particularly difficult when unexpected or poorly described populations are present.
You can often trust various fully automated algorithms for cell type annotation, but sometimes a more exploratory analysis is helpful in understanding the captured cells.
There are fully automated algorithms for cell type annotation, but sometimes a more in-depth analysis is helpful in understanding the captured cells.
This is an example of exploratory cell type analysis using clustermole, starting with a Seurat object.

The dataset used here contains hematopoietic and stromal bone marrow populations ([Baccin et al.](https://doi.org/10.1038/s41556-019-0439-6)).
The dataset used in this example contains hematopoietic and stromal bone marrow populations ([Baccin et al.](https://doi.org/10.1038/s41556-019-0439-6)).
This experiment was selected because it includes both well-known as well as rare cell types.

## Load data
Expand All @@ -27,11 +27,13 @@ Load relevant packages.
```{r load-libraries, message=FALSE}
library(Seurat)
library(dplyr)
library(ggplot2)
library(ggsci)
library(clustermole)
```

Load the dataset, which is stored as a Seurat object.
It was subset for this example to reduce the size and speed up processing.
Download the dataset, which is stored as a Seurat object.
It was subset for this tutorial to reduce the size and speed up processing.

```{r load-seurat-object, message=FALSE, warning=FALSE}
so <- readRDS(url("https://osf.io/cvnqb/download"))
Expand All @@ -40,38 +42,32 @@ so

Check the experiment labels on a tSNE visualization, as shown in the original publication ([original figure](https://www.nature.com/articles/s41556-019-0439-6/figures/1)).

```{r tsne-exp}
DimPlot(so, reduction = "tsne", group.by = "experiment", cells = sample(colnames(so))) +
```{r tsne-experiment}
DimPlot(so, reduction = "tsne", group.by = "experiment", shuffle = TRUE) +
theme(aspect.ratio = 1, legend.text = element_text(size = rel(0.7))) +
scale_color_nejm()
```

Check the cell type labels on a tSNE visualization.

```{r tsne-celltype}
DimPlot(so, reduction = "tsne", group.by = "celltype", cells = sample(colnames(so))) +
DimPlot(so, reduction = "tsne", group.by = "celltype", shuffle = TRUE) +
theme(aspect.ratio = 1, legend.text = element_text(size = rel(0.8))) +
scale_color_igv()
```

## Cell type annotation

Since this is a clustermole tutorial, load clustermole.

```{r load-clustermole, message=FALSE, warning=FALSE}
library(clustermole)
```

Set the Seurat object cell identities to the predefined cell type labels and check what they are.
Set the Seurat object cell identities to the predefined cell type labels for the next steps.

```{r set-idents}
Idents(so) <- "celltype"
levels(Idents(so))
```

### Marker gene overlaps
## Marker gene overlaps

One type of analysis facilitated by clustermole is based on comparison of marker genes.

We can start with the B-cells, which is well-defined population used in many studies.
We can start with the B-cells, which is a well-defined population used in many studies.

Find markers for the B-cell cluster.

Expand Down Expand Up @@ -148,7 +144,7 @@ head(overlaps_tbl, 15)

The top results are again more diverse than for B-cells, but the appropriate populations are listed.

### Enrichment of markers
## Enrichment of markers

Rather than comparing marker genes, it's also possible to run enrichment of cell type signatures across all genes.
This avoids having to define an optimal set of markers.
Expand Down Expand Up @@ -181,19 +177,25 @@ enrich_tbl <- clustermole_enrichment(expr_mat = avg_exp_mat, species = "mm")
Check the most enriched cell types for the B-cell cluster.

```{r}
enrich_tbl %>% filter(cluster == "B-cell") %>% head(15)
enrich_tbl %>%
filter(cluster == "B-cell") %>%
head(15)
```

As with the previous analysis, the top results are various B-cell populations.

Check the most enriched cell types for the Adipo-CAR cluster.

```{r}
enrich_tbl %>% filter(cluster == "Adipo-CAR") %>% head(15)
enrich_tbl %>%
filter(cluster == "Adipo-CAR") %>%
head(15)
```

Check the most enriched cell types for the Osteoblasts cluster.

```{r}
enrich_tbl %>% filter(cluster == "Osteoblasts") %>% head(15)
enrich_tbl %>%
filter(cluster == "Osteoblasts") %>%
head(15)
```

0 comments on commit 3dffd15

Please sign in to comment.