singleCell_MacoskoWorkflow_blank.Rmd

---
title: 'scRNA-seq workflow: Macosko et al.'
author: "Koen Van den Berge"
date: "11/16/2020"
output: 
  html_document:
    toc: true
    toc_float: true
---

# Import data

The `scRNAseq` package provides convenient access to several datasets. See the [package Bioconductor page](http://bioconductor.org/packages/release/data/experiment/html/scRNAseq.html) for more information.

```{r}
# install BiocManager package if not installed yet.
# BiocManager is the package installer for Bioconductor software.
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# install packages if not yet installed.
pkgs <- c("SingleCellExperiment", "DropletUtils", "scRNAseq", "scater", "scuttle", "scran", "BiocSingular", "scDblFinder", "glmpca", "uwot")
notInstalled <- pkgs[!pkgs %in% installed.packages()[,1]]
if(length(notInstalled) > 0){
  BiocManager::install(notInstalled)
}

# Code below might ask you to create an ExperimentHub directory. 
# Type 'yes' and hit Enter, to allow this.
suppressPackageStartupMessages(library(scRNAseq))
sce <- MacoskoRetinaData()
```

# A `SingleCellExperiment` object


```{r}
sce
```

## Accessing data from a `SingleCellExperiment` object

Please see [Figure 4.1 in OSCA](http://bioconductor.org/books/release/OSCA/data-infrastructure.html) for an overview of a `SingleCellExperiment` object.

```{r}
# Data: assays
assays(sce)
assays(sce)$counts[1:5, 1:5]

# Feature metadata: rowData
rowData(sce) # empty for now

# Cell metadata: colData
colData(sce)

# Reduced dimensions: reducedDims
reducedDims(sce) # empty for now
```

## Creating a new `SingleCellExperiment` object

```{r}
sceNew <- SingleCellExperiment(assays = list(counts = assays(sce)$counts))
sceNew

rm(sceNew)
```

## Storing (meta)data in a `SingleCellExperiment` object

```{r}
fakeGeneNames <- paste0("gene", 1:nrow(sce))
rowData(sce)$fakeName <- fakeGeneNames
head(rowData(sce))
# Remove again by setting to NULL
rowData(sce)$fakeName <- NULL

assays(sce)$logCounts <- log1p(assays(sce)$counts)
assays(sce)
assays(sce)$logCounts[1:5, 1:5]
assays(sce)$logCounts <- NULL
```

# Filtering non-informative genes


# Quality control

## Calculate QC variables

Note that the character string "^MT-" in the rownames of the sce object indicates mitochondrial genes. 

## EDA

High-quality cells should have many features expressed, and a low contribution of mitochondrial genes. Here, we see that several cells have a very low number of expressed genes, and where most of the molecules are derived from mitochondrial genes. This indicates likely damaged cells, presumably because of loss of cytoplasmic RNA from perforated cells, so we'd want to remove these for the downstream analysis.


## QC using adaptive thresholds

Below, we remove cells that are outlying with respect to

 1. A low sequencing depth (number of UMIs);
 2. A low number of genes detected;
 3. A high percentage of reads from mitochondrial genes.

Also make plots to assess the impact of the removal of . 


## Identifying and removing empty droplets

Note that the removal of cells with low sequencing depth using the adaptive threshold procedure above is a way of removing empty droplets. 
Other approaches are possible, e.g., removing cells by statistical testing using `emtpyDrops`.
This does require us to specify a lower bound on the total number of UMIs, below which all cells are considered to correspond to empty droplets.
This lower bound may not be trivial to derive, but the `barcodeRanks` function can be useful to identify an elbow/knee point.


## Identifying and removing doublets

We will use [scDblFinder](https://bioconductor.org/packages/3.14/bioc/html/scDblFinder.html) to detect doublet cells.


# Normalization

For normalization, the size factors $s_i$ computed here are simply scaled library sizes:
\[ N_i = \sum_g Y_{gi} \]
\[ s_i = N_i / \bar{N}_i \]

# Feature selection


# Dimension reduction

Note that, below, we color the cells using the known, true cell type label as defined in the metadata, to empirically evaluate the dimension reduction. In reality, we don't know this yet at this stage.

## Linear dimension reduction: PCA

We are able to recover quite some structure. 
However, many cell populations remain obscure, and the plot is overcrowded.


### PCA without feature selection


## A generalization of PCA for exponential family distributions.


## Non-linear dimension reduction: UMAP


# Clustering

```{r}
# Build a shared nearest-neighbor graph from PCA space
#g <- buildSNNGraph(sce, use.dimred = 'PCA')

# Louvain clustering on the SNN graph, and add to sce
#colData(sce)$label <- factor(igraph::cluster_louvain(g)$membership)

# Visualization.
#plotUMAP(sce, colour_by="label")
```