Single Cell 'Omics: Analysis of single-cell methylation data

The final session of the course will cover pre-processing and basic analysis of single-cell bisulfite sequencing data. We will assay two cell types, probably 16 cells in total.

Goals

Two main goals:

Methylation profiles define cell type (i.e. cells will cluster apart by e.g. PCA)
Context specificity of methylation variance. E.g. in mouse ES cells, CGIs are homogenous (and low in methylation), repeat elements are homogenously high and active enhancer elements are heterogeneous. This is interesting because the enhancer elements are cell type specific and thus some variation in the methylation levels here implies plasticity in cell identity which could be important for lineage formation.

First step

Clone or download this repository so that you have the necessary code, data and materials to hand.

If you're familiar with git:

git clone https://github.com/davismcc/SingleCellOmics_Heidelberg_Apr2017.git

If not, you can download a zip file of the repository by clicking the green "Clone or download" button above.

Outline:

We have two 1.5 hour sessions to work on single-cell methylation. Broadly, we will spend the first session on processing the raw sequence files to get summarized, annotated methylation results for genomic features of interest. In the second session we will analyze and plot these results to fulfill the goals above.

We will use BISMARK for alignments and methylation calling. For details, see this protocol paper.
QC (also see protocol paper)
1. Negative controls should not align
2. bisulfite conversion efficiency (assessed using CHH methylation from bismark reports) should be >95%
3. mapping efficiency (from bismark reports) >10% (30-40% is normal here but may end up lower in these practicals)
4. number of CpG sites covered (I use 1M unique positions but this will depend on seq depth so maybe just exclude outliers)
Preprocessing and annotation
- Quantify methylation over regions of interest (promoters, gene bodies, enhancers, repeats, CpG islands).
  1. mean methylation rate (each covered position counts once – i.e. do not give extra weight to positions with >1 read)
  2. also record the coverage (number of CpG sites that were covered in the that cell at that locus) for the purpose of assigning weights to each cell in downstream analyses
Analysis
1. Mean methylation by feature / cell type
2. Variation by feature / cell type
3. Dimension reduction
4. Clustering

We will manage the data processing and analysis "pipeline" using snakemake. We will analyze our results in RStudio, using an R Markdown Notebook (see the notebooks folder in this repository for an example.)

Data

The aim will be for you to analyze the data you generate during the course in Heidelberg.

However, in case that data is unavailable for any reason and to have an alternative dataset that is processed and ready for analysis, we also have access to a small dataset from Stephen Clark and colleagues at the Babraham Institute, Cambridge. This dataset consists of 15 cells from mouse embryos.

Raw fastq files are available at this link (6GB; password required, which will be shared on the course Slack channel). Only if you want to work from raw fastq files (substantial computation needed) and have a high-bandwidth connection, download the files at the link and save to data/fastq.
Raw fastq files for a "test" dataset (sampling 500,000 reads from each of the above fastq files), smaller in size so a little more convenient, are available at this link (210MB; password required).
Merged Bismark files are available at this link (76MB; password required). Download and copy these to data/bismark/merged.
Summarized, annotated methylation results that we will use for analysis are available in the results folder of this repository (we will generate these ourselves during the course). A version of this file, results/all.tsv.gz that has already been computed is available at this link (3.5MB) in case you wish to use it for the second part of the analysis.

Software requirements:

R >=3.3.0 with packages:
- From CRAN: tidyverse, data.table, docopt
- From Bioconductor: scater, scran, GenomicRanges, SC3, pcaMethods
RStudio
Python >=3.4 with packages: snakemake
Trim Galore!, which requires Cutadapt
Bowtie2
Bismark
FastQC
MultiQC
MethylQA

Acknowledgements

Many thanks to Stephen Clark and Ricard Argelaguet for help and advice. Stephen advised on the course aims and structure and directed generation of raw data. Ricard provided advice on analysis and provided data processing scripts and processed datasets for use.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
annotation		annotation
data		data
notebooks		notebooks
results/archive		results/archive
scripts		scripts
README.md		README.md
Snakefile		Snakefile
dag.dot		dag.dot
dag.pdf		dag.pdf
rulegraph.dot		rulegraph.dot
rulegraph.pdf		rulegraph.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Single Cell 'Omics: Analysis of single-cell methylation data

Goals

First step

Outline:

Data

Software requirements:

Acknowledgements

About

Releases

Packages

Languages

davismcc/SingleCellOmics_Heidelberg_Apr2017

Folders and files

Latest commit

History

Repository files navigation

Single Cell 'Omics: Analysis of single-cell methylation data

Goals

First step

Outline:

Data

Software requirements:

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages