Skip to content

Latest commit

 

History

History
118 lines (76 loc) · 11 KB

README.md

File metadata and controls

118 lines (76 loc) · 11 KB

A collection of resources on analysis of scATAC+scRNA multi-omic data

Table of contents

scATAC analysis

Picking features

Because Tn5 insertions can take place anywhere in the genome, after alignment for each cell barcode we will have a set of fragments covering different genomic positions. To allow comparison between cells we need to define a common set of genomic features. Once a feature set is defined, we count the number of fragments overlapping each feature for each single-cell, obtaining a (quasi-)binary feature x cell count matrix.

Different pipelines/papers use different strategies to define a feature set.

  • Binning the genome into equally sized windows (5-10kb) (e.g. SnapATAC). Using a set of genomic bins is a good way to start exploring your data, however when working with a large genome (e.g. human and mouse) this generates a huge matrix with over 200k features, making many analysis steps much slower. In addition, epigenomics studies highlight that the tipical size of regulatory regions is closer to hundreds rather than thousands of bps.

  • Peak calling: take a pseudo-bulk coverage track of all the cells or a group of cells and identify peaks in this signal. The most used algorithm for peak calling is MACS.

    • Peak calling from CellRanger (big caveat: does peak calling for each sample independently)
    • Cusanovich2018 approach: starts by binning the genome into fixed-size windows and building a bin x cell binary matrix. Bins that overlap ENCODE-defined blacklist regions are filtered out, and the top 20,000 most commonly used bins are retained. Then, the bins-by-cells binary matrix is normalized and rescaled using the term frequency-inverse document frequency (TF-IDF) transformation. Next, singular value decomposition (SVD) is performed to generate a PCs-by-cells LSI score matrix, which is used to cluster cells. Within each cluster, peak calling is performed on the aggregated scATAC-seq profiles
  • Using a annotated set of enhancers/regulatory regions in your genome of interest (frequently seen for studies in Drosophila)

  • Other scATAC-specific methods for feature extraction:

    • Scregseg: uses an HMM with Dirichlet-Multinomial emission probabilities to segment the genome according to distinct relative cross-cell accessibility profiles (paper)
    • [BROCKMAN]: k-mer based accessibility (paper)

Quality control

  • Calculating QC metrics (Signac)(ArchR)

    • TSS enrichment: we expect an enrichment of fragments around transcription start sites
    • Fragment size distribution: we expect to see regular peaks, these correspond to fragments that span a different number of nucleosomes
  • Filtering/refining peaks: this is dependent on the peak calling algorithms, but you might want to filter some peaks out (also to make the size of the dataset more manageable):

    • Remove peaks overlapping with ENCODE blacklist (an R-ready annotation for model organism genomes can be found in Signac)
    • Remove peaks that are accessible in less than n% cells in any given cluster (strategy used in Cusanovish2018)
    • Filter by peak width: peaks that are exactly as wide as the peak-calling width threshold are usually just a few reads from a few cells. On the other side, keeping very wide peaks might introduce artifacts (bigger peak, more reads)
  • Filtering cells:

    • Basic filtering of cells with very low coverage (these often mess up dimensionality reduction if unfiltered)
    • ArchR doublet detection

Dimensionality reduction

Additional resources:

Defining gene-level features

Summarising accessibility over gene promoters/bodies is often useful for data exploration and for qualitative or quantitative comparison with RNA. Different pipelines/papers use different strategies to reduce accessibility signal to a gene x cell matrix.

Motif analysis

  • chromVAR: determine variations in chromatin accessibility across peaks containing a set of TF binding motifs

Data structures for multi-omics

Joint dimensionality reduction

Peak-gene matching

  • DORC analysis (example code)
  • SEAcells - aggregating at metacell level for regulatory interation analysis (code) (paper)

Publicly available multi-omic datasets

(add newest on top)