Skip to content

A pipeline for Oxford Nanopore Technologies single-cell transcriptomics (10x)

Notifications You must be signed in to change notification settings

CooperStansbury/ont_10x_transcriptomics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ONT 10x Transcriptomics Pipeline

This pipeline processes single-cell transcriptomics data generated using Oxford Nanopore Technologies (ONT) sequencing with 10x Genomics GEM chips.

Overview

The pipeline assumes that the sequencing library was prepared using 10x Genomics technology and is designed to handle the specific requirements of ONT single-cell transcriptomics data.

Cloning the Pipeline Repository

To get started, first clone the pipeline repository from GitHub:

git clone https://github.com/CooperStansbury/ont_10x_transcriptomics.git
cd ont_10x_transcriptomics

This ensures you have the latest version of the pipeline.

Running the Pipeline

To run the pipeline, follow these steps:

  1. Set Up the Environment
    Ensure Conda is installed and set up correctly. The pipeline requires the following Conda environment:

    • Top-level environment: workflow-env.yaml

    Install the environments by running:

    conda env create -f envs/workflow-env.yaml

    After installation, activate the top-level environment:

    conda activate workflow-env
  2. Prepare Configuration Files
    Update the config.yaml file with the correct paths to your input data, reference genome, and output directories.

  3. Execute the Pipeline
    Always run Snakemake using the --use-conda flag to ensure proper dependency management:

    snakemake --use-conda --cores <num_cores>

    Replace <num_cores> with the number of threads available for computation.

  4. Dry Run (Optional)
    To verify the workflow without executing commands:

    snakemake --use-conda --configfile config.yaml -n
  5. Cluster Execution (Optional)
    If running on an HPC system, submit jobs using:

    snakemake --use-conda --cluster "sbatch --mem={resources.mem_mb}" --jobs 10

Input Requirements

  • Raw FASTQ Files: Path specified in config.yaml and organized in a fastq_paths.txt file as described in the config file's readme.
  • Reference Genome: FASTA and GTF files for alignment and annotation, specified in config.yaml.
  • Configuration File: config.yaml contains parameters for alignment, filtering, and output directories.

Output

The pipeline generates a comprehensive set of outputs organized into the following directories:

  • anndata: Contains the final annotated data matrix in h5ad format, ready for downstream analysis. This includes a compiled AnnData object combining data from all chromosomes.
  • config: Copies of the configuration files used in the pipeline, ensuring reproducibility.
  • counts: Contains chromosome-specific count matrices in h5ad format, generated by htseq-count.
  • demultiplex: Contains demultiplexing results, including:
    • .matched_reads.fastq.gz: FASTQ files containing reads that passed demultiplexing. (These may not exist if demultiplexing was not performed).
    • .summary.txt: Summary statistics of the demultiplexing process. (These may not exist if demultiplexing was not performed).
    • .knee_plot.png: Knee plot visualizing the distribution of barcodes. (These may not exist if demultiplexing was not performed).
  • fastq: Contains the initial, unprocessed FASTQ files.
  • logs: Contains detailed log files for various pipeline steps, including demultiplexing and mapping, crucial for monitoring and troubleshooting.
    • demultiplex: Demultiplexing-specific logs.
    • mapping: Mapping-specific logs.
  • mapping: Contains alignment files:
    • BAM files for each sample after alignment and tagging.
    • .records.csv: CSV files detailing barcode, UMI, and read name information for each alignment.
    • by_chrom: Chromosome-specific BAM files (sorted and indexed) to facilitate per-chromosome analysis.
  • references: Contains the indexed reference genome (.mmi) and other reference files used for mapping and quantification.
    • by_chrom: Chromosome-specific GTF files used by htseq-count for accurate gene quantification.
  • reports: Contains quality control reports at various stages:
    • alignment: Alignment quality metrics and summaries.
    • nanoqc: Quality control reports for raw FASTQ data generated by NanoQC, organized by sample (e.g., test1, test2).
    • nanostat: Summary statistics of raw FASTQ data generated by NanoStat.
    • seqkit_stats: Summary statistics of raw and potentially demultiplexed FASTQ files, generated by seqkit.

Troubleshooting

  • Ensure Conda is installed and environments are set up correctly.
  • Verify input file paths: Ensure all paths in config.yaml and fastq_paths.txt are correct.
  • Check for missing dependencies: Run snakemake --use-conda --conda-create-envs-only.
  • Review logs: Check the output logs in the logs directory for error messages and troubleshooting hints.

For further details and support, refer to the official Snakemake documentation:
https://snakemake.readthedocs.io

About

A pipeline for Oxford Nanopore Technologies single-cell transcriptomics (10x)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published