Skip to content

Material for the single-cell methylation analysis session at EMBL's Single-Cell 'Omics course in Heidelberg, April 2017.

Notifications You must be signed in to change notification settings

davismcc/SingleCellOmics_Heidelberg_Apr2017

Repository files navigation

Single Cell 'Omics: Analysis of single-cell methylation data

The final session of the course will cover pre-processing and basic analysis of single-cell bisulfite sequencing data. We will assay two cell types, probably 16 cells in total.

Goals

Two main goals:

  1. Methylation profiles define cell type (i.e. cells will cluster apart by e.g. PCA)
  2. Context specificity of methylation variance. E.g. in mouse ES cells, CGIs are homogenous (and low in methylation), repeat elements are homogenously high and active enhancer elements are heterogeneous. This is interesting because the enhancer elements are cell type specific and thus some variation in the methylation levels here implies plasticity in cell identity which could be important for lineage formation.

First step

Clone or download this repository so that you have the necessary code, data and materials to hand.

If you're familiar with git:

git clone https://github.com/davismcc/SingleCellOmics_Heidelberg_Apr2017.git

If not, you can download a zip file of the repository by clicking the green "Clone or download" button above.

Outline:

We have two 1.5 hour sessions to work on single-cell methylation. Broadly, we will spend the first session on processing the raw sequence files to get summarized, annotated methylation results for genomic features of interest. In the second session we will analyze and plot these results to fulfill the goals above.

  1. We will use BISMARK for alignments and methylation calling. For details, see this protocol paper.
  2. QC (also see protocol paper)
    1. Negative controls should not align
    2. bisulfite conversion efficiency (assessed using CHH methylation from bismark reports) should be >95%
    3. mapping efficiency (from bismark reports) >10% (30-40% is normal here but may end up lower in these practicals)
    4. number of CpG sites covered (I use 1M unique positions but this will depend on seq depth so maybe just exclude outliers)
  3. Preprocessing and annotation
    • Quantify methylation over regions of interest (promoters, gene bodies, enhancers, repeats, CpG islands).
      1. mean methylation rate (each covered position counts once – i.e. do not give extra weight to positions with >1 read)
      2. also record the coverage (number of CpG sites that were covered in the that cell at that locus) for the purpose of assigning weights to each cell in downstream analyses
  4. Analysis
    1. Mean methylation by feature / cell type
    2. Variation by feature / cell type
    3. Dimension reduction
    4. Clustering

We will manage the data processing and analysis "pipeline" using snakemake. We will analyze our results in RStudio, using an R Markdown Notebook (see the notebooks folder in this repository for an example.)

Data

The aim will be for you to analyze the data you generate during the course in Heidelberg.

However, in case that data is unavailable for any reason and to have an alternative dataset that is processed and ready for analysis, we also have access to a small dataset from Stephen Clark and colleagues at the Babraham Institute, Cambridge. This dataset consists of 15 cells from mouse embryos.

  1. Raw fastq files are available at this link (6GB; password required, which will be shared on the course Slack channel). Only if you want to work from raw fastq files (substantial computation needed) and have a high-bandwidth connection, download the files at the link and save to data/fastq.
  2. Raw fastq files for a "test" dataset (sampling 500,000 reads from each of the above fastq files), smaller in size so a little more convenient, are available at this link (210MB; password required).
  3. Merged Bismark files are available at this link (76MB; password required). Download and copy these to data/bismark/merged.
  4. Summarized, annotated methylation results that we will use for analysis are available in the results folder of this repository (we will generate these ourselves during the course). A version of this file, results/all.tsv.gz that has already been computed is available at this link (3.5MB) in case you wish to use it for the second part of the analysis.

Software requirements:

Acknowledgements

Many thanks to Stephen Clark and Ricard Argelaguet for help and advice. Stephen advised on the course aims and structure and directed generation of raw data. Ricard provided advice on analysis and provided data processing scripts and processed datasets for use.

About

Material for the single-cell methylation analysis session at EMBL's Single-Cell 'Omics course in Heidelberg, April 2017.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages