The final session of the course will cover pre-processing and basic analysis of single-cell bisulfite sequencing data. We will assay two cell types, probably 16 cells in total.
Two main goals:
- Methylation profiles define cell type (i.e. cells will cluster apart by e.g. PCA)
- Context specificity of methylation variance. E.g. in mouse ES cells, CGIs are homogenous (and low in methylation), repeat elements are homogenously high and active enhancer elements are heterogeneous. This is interesting because the enhancer elements are cell type specific and thus some variation in the methylation levels here implies plasticity in cell identity which could be important for lineage formation.
Clone or download this repository so that you have the necessary code, data and materials to hand.
If you're familiar with git
:
git clone https://github.com/davismcc/SingleCellOmics_Heidelberg_Apr2017.git
If not, you can download a zip file of the repository by clicking the green "Clone or download" button above.
We have two 1.5 hour sessions to work on single-cell methylation. Broadly, we will spend the first session on processing the raw sequence files to get summarized, annotated methylation results for genomic features of interest. In the second session we will analyze and plot these results to fulfill the goals above.
- We will use
BISMARK
for alignments and methylation calling. For details, see this protocol paper. - QC (also see protocol paper)
- Negative controls should not align
- bisulfite conversion efficiency (assessed using CHH methylation from bismark reports) should be >95%
- mapping efficiency (from bismark reports) >10% (30-40% is normal here but may end up lower in these practicals)
- number of CpG sites covered (I use 1M unique positions but this will depend on seq depth so maybe just exclude outliers)
- Preprocessing and annotation
- Quantify methylation over regions of interest (promoters, gene bodies, enhancers, repeats, CpG islands).
- mean methylation rate (each covered position counts once – i.e. do not give extra weight to positions with >1 read)
- also record the coverage (number of CpG sites that were covered in the that cell at that locus) for the purpose of assigning weights to each cell in downstream analyses
- Quantify methylation over regions of interest (promoters, gene bodies, enhancers, repeats, CpG islands).
- Analysis
- Mean methylation by feature / cell type
- Variation by feature / cell type
- Dimension reduction
- Clustering
We will manage the data processing and analysis "pipeline" using snakemake. We will analyze our results in RStudio, using an R Markdown Notebook (see the notebooks
folder in this repository for an example.)
The aim will be for you to analyze the data you generate during the course in Heidelberg.
However, in case that data is unavailable for any reason and to have an alternative dataset that is processed and ready for analysis, we also have access to a small dataset from Stephen Clark and colleagues at the Babraham Institute, Cambridge. This dataset consists of 15 cells from mouse embryos.
- Raw
fastq
files are available at this link (6GB; password required, which will be shared on the course Slack channel). Only if you want to work from rawfastq
files (substantial computation needed) and have a high-bandwidth connection, download the files at the link and save todata/fastq
. - Raw
fastq
files for a "test" dataset (sampling 500,000 reads from each of the abovefastq
files), smaller in size so a little more convenient, are available at this link (210MB; password required). - Merged
Bismark
files are available at this link (76MB; password required). Download and copy these todata/bismark/merged
. - Summarized, annotated methylation results that we will use for analysis are
available in the results folder of this repository (we will generate these
ourselves during the course). A version of this file,
results/all.tsv.gz
that has already been computed is available at this link (3.5MB) in case you wish to use it for the second part of the analysis.
R
>=3.3.0 with packages:- From CRAN:
tidyverse
,data.table
,docopt
- From Bioconductor:
scater
,scran
,GenomicRanges
,SC3
,pcaMethods
- From CRAN:
RStudio
Python
>=3.4 with packages:snakemake
Trim Galore!
, which requires CutadaptBowtie2
Bismark
FastQC
MultiQC
MethylQA
Many thanks to Stephen Clark and Ricard Argelaguet for help and advice. Stephen advised on the course aims and structure and directed generation of raw data. Ricard provided advice on analysis and provided data processing scripts and processed datasets for use.