generated from statOmics/Rmd-website
-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathsingleCell_MacoskoWorkflow_blank.Rmd
165 lines (103 loc) · 4.63 KB
/
singleCell_MacoskoWorkflow_blank.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
title: 'scRNA-seq workflow: Macosko et al.'
author: "Koen Van den Berge"
date: "11/16/2020"
output:
html_document:
toc: true
toc_float: true
---
# Import data
The `scRNAseq` package provides convenient access to several datasets. See the [package Bioconductor page](http://bioconductor.org/packages/release/data/experiment/html/scRNAseq.html) for more information.
```{r}
# install BiocManager package if not installed yet.
# BiocManager is the package installer for Bioconductor software.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# install packages if not yet installed.
pkgs <- c("SingleCellExperiment", "DropletUtils", "scRNAseq", "scater", "scuttle", "scran", "BiocSingular", "scDblFinder", "glmpca", "uwot")
notInstalled <- pkgs[!pkgs %in% installed.packages()[,1]]
if(length(notInstalled) > 0){
BiocManager::install(notInstalled)
}
# Code below might ask you to create an ExperimentHub directory.
# Type 'yes' and hit Enter, to allow this.
suppressPackageStartupMessages(library(scRNAseq))
sce <- MacoskoRetinaData()
```
# A `SingleCellExperiment` object
```{r}
sce
```
## Accessing data from a `SingleCellExperiment` object
Please see [Figure 4.1 in OSCA](http://bioconductor.org/books/release/OSCA/data-infrastructure.html) for an overview of a `SingleCellExperiment` object.
```{r}
# Data: assays
assays(sce)
assays(sce)$counts[1:5, 1:5]
# Feature metadata: rowData
rowData(sce) # empty for now
# Cell metadata: colData
colData(sce)
# Reduced dimensions: reducedDims
reducedDims(sce) # empty for now
```
## Creating a new `SingleCellExperiment` object
```{r}
sceNew <- SingleCellExperiment(assays = list(counts = assays(sce)$counts))
sceNew
rm(sceNew)
```
## Storing (meta)data in a `SingleCellExperiment` object
```{r}
fakeGeneNames <- paste0("gene", 1:nrow(sce))
rowData(sce)$fakeName <- fakeGeneNames
head(rowData(sce))
# Remove again by setting to NULL
rowData(sce)$fakeName <- NULL
assays(sce)$logCounts <- log1p(assays(sce)$counts)
assays(sce)
assays(sce)$logCounts[1:5, 1:5]
assays(sce)$logCounts <- NULL
```
# Filtering non-informative genes
# Quality control
## Calculate QC variables
Note that the character string "^MT-" in the rownames of the sce object indicates mitochondrial genes.
## EDA
High-quality cells should have many features expressed, and a low contribution of mitochondrial genes. Here, we see that several cells have a very low number of expressed genes, and where most of the molecules are derived from mitochondrial genes. This indicates likely damaged cells, presumably because of loss of cytoplasmic RNA from perforated cells, so we'd want to remove these for the downstream analysis.
## QC using adaptive thresholds
Below, we remove cells that are outlying with respect to
1. A low sequencing depth (number of UMIs);
2. A low number of genes detected;
3. A high percentage of reads from mitochondrial genes.
Also make plots to assess the impact of the removal of .
## Identifying and removing empty droplets
Note that the removal of cells with low sequencing depth using the adaptive threshold procedure above is a way of removing empty droplets.
Other approaches are possible, e.g., removing cells by statistical testing using `emtpyDrops`.
This does require us to specify a lower bound on the total number of UMIs, below which all cells are considered to correspond to empty droplets.
This lower bound may not be trivial to derive, but the `barcodeRanks` function can be useful to identify an elbow/knee point.
## Identifying and removing doublets
We will use [scDblFinder](https://bioconductor.org/packages/3.14/bioc/html/scDblFinder.html) to detect doublet cells.
# Normalization
For normalization, the size factors $s_i$ computed here are simply scaled library sizes:
\[ N_i = \sum_g Y_{gi} \]
\[ s_i = N_i / \bar{N}_i \]
# Feature selection
# Dimension reduction
Note that, below, we color the cells using the known, true cell type label as defined in the metadata, to empirically evaluate the dimension reduction. In reality, we don't know this yet at this stage.
## Linear dimension reduction: PCA
We are able to recover quite some structure.
However, many cell populations remain obscure, and the plot is overcrowded.
### PCA without feature selection
## A generalization of PCA for exponential family distributions.
## Non-linear dimension reduction: UMAP
# Clustering
```{r}
# Build a shared nearest-neighbor graph from PCA space
#g <- buildSNNGraph(sce, use.dimred = 'PCA')
# Louvain clustering on the SNN graph, and add to sce
#colData(sce)$label <- factor(igraph::cluster_louvain(g)$membership)
# Visualization.
#plotUMAP(sce, colour_by="label")
```