Skip to content

parklab/HiScanner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HiScanner (HIgh-resolution Single-Cell Allelic copy Number callER)

PyPI version License: MIT

HiScanner is a python package for high-resolution single-cell copy number analysis. It supports two modes of operation:

  1. Standard Pipeline (RDR + BAF): Full analysis using both read depth ratios (RDR) and B-allele frequencies (BAF)
  2. RDR-only Pipeline: Simplified analysis using only read depth ratios

We provide a demo dataset and tutorial to help you get started. After installation, see https://github.com/parklab/hiscanner_demo for instructions.

Table of Contents

Installation

# Create new conda environment with all dependencies
conda create -n hiscanner_test python=3.8
conda activate hiscanner_test
pip install hiscanner --no-cache-dir

Install R and required packages:

conda install -c conda-forge r-base  
conda install -c bioconda r-mgcv>=1.8

Install other dependencies:

conda install -c bioconda snakemake samtools bcftools

We tested with snakemake==7.32.4, samtools==1.15.1, bcftools==1.13.

Required External Files:

1) Mappability Track

hg19/GRCh37 (100mer)
wget https://www.math.pku.edu.cn/teachers/xirb/downloads/software/BICseq2/Mappability/hg19CRG.100bp.tar.gz --no-check-certificate
tar -xvzf hg19CRG.100bp.tar.gz
hg38/GRCh38 (150mer)

Download from: https://doi.org/10.6084/m9.figshare.28370357.v1

Other reference genomes: instructions on custom track generation

For other genomes/read length configurations, follow instructions at CGAP Annotations to generate mappability tracks.

IMPORTANT: Guidelines when choosing mappability tracks
  • Shorter mappability track (e.g., 100bp) with longer reads (e.g., 150bp): Valid but conservative (some uniquely mappable regions may be missed)
  • Longer mappability track (e.g., 150bp) with shorter reads (e.g., 100bp): Not valid, will cause false positives

2) Reference Genome

We require the reference genome fasta to be split into chromosomes, to allow for parallel processing. You can use the following command to split the reference genome:

samtools faidx /path/to/reference.fasta
mkdir /path/to/reference/split
awk '{print $1}' /path/to/reference.fasta.fai | xargs -I {} samtools faidx /path/to/reference.fasta {} > /path/to/reference/split/{}.fasta

Pipeline Options

HiScanner offers two pipeline options:

1. Standard Pipeline (RDR + BAF)

Uses both read depth ratios and B-allele frequencies for comprehensive CNV analysis.

Steps:

  1. SNP Calling (via SCAN2)
  2. Phasing & BAF Computation
  3. ADO Analysis
  4. Segmentation
  5. CNV Calling

Requirements:

  • SCAN2 output files
  • Bulk and single cell BAM files
  • Reference genome
  • Mappability tracks

2. RDR-only Pipeline

Uses only read depth ratios for simplified CNV analysis.

Steps:

  1. Segmentation
  2. CNV Calling

Requirements:

  • Single cell BAM files
  • Reference genome
  • Mappability tracks

Running the Pipeline

Option 1: Standard Pipeline (RDR + BAF)

Step 1: SCAN2 (Prerequisites)

SCAN2 needs to be run separately before using HiScanner.

Note: If you only need SCAN2 output for HiScanner (and are not interested in SNV calls), you can save time by running SCAN2 with:

scan2 run --joblimit 5000 --snakemake-args ' --until shapeit/phased_hets.vcf.gz --latency-wait 120'

This will stop at the phasing step, which is sufficient for HiScanner's requirements.

If you have already run SCAN2, ensure you have:

  • VCF file with raw variants (gatk/hc_raw.mmq60.vcf.gz)

  • Phased heterozygous variants (shapeit/phased_hets.vcf)

  • Additionally, we note that the phased genotype field in phased_hets.vcf should be named as phasedgt. This is the expected output from the SCAN2 pipeline that we have tested with. If your VCF file has a different field name, please manually rename it to phasedgt in the VCF.

The expected location is scan2_out/ in your project directory.

Steps 2-5: HiScanner Pipeline

1. Initialize HiScanner project
hiscanner init --output ./my_project
cd my_project
2. Edit config.yaml

Edit with your paths and parameters

3. Prepare metadata file

Must contain the following columns:

bamID    bam    singlecell
bulk1    /path/to/bulk.bam    N
cell1    /path/to/cell1.bam   Y 
cell2    /path/to/cell2.bam   Y
4. Validate configuration
hiscanner validate
5. Run the pipeline
hiscanner run --step snp      # Check SCAN2 results
hiscanner run --step phase    # Process SCAN2 results
hiscanner run --step ado      # ADO analysis to identify optimal bin size
hiscanner run --step normalize # Normalize read depth ratios
hiscanner run --step segment  # Segmentation
hiscanner run --step cnv      # CNV calling

# Or run all steps at once:
hiscanner run --step all

For normalize (the most time-consuming step), we provide an option to run with cluster, e.g.,

hiscanner --config config.yaml run --step normalize --use-cluster

Option 2: RDR-only Pipeline

1. Initialize project
hiscanner init --output ./my_project
cd my_project
2. Edit config.yaml
  • Set rdr_only: true
  • Configure paths and parameters
3. Prepare metadata file

Must contain the following columns (bulk samples are not required):

bamID    bam    singlecell
cell1    /path/to/cell1.bam   Y 
cell2    /path/to/cell2.bam   Y
4. Validate configuration
hiscanner validate
5. Run the pipeline
hiscanner run --step normalize # Normalize read depth ratios
hiscanner run --step segment  # Segmentation
hiscanner run --step cnv      # CNV calling

Output Structure

hiscanner_output/
├── phased_hets/       # Processed heterozygous SNPs (Standard pipeline only)
├── ado/               # ADO analysis results (Standard pipeline only)
├── bins/             # Binned read depth
├── segs/             # Segmentation results
└── final_calls/      # Final CNV calls

Cleaning Up

HiScanner creates several temporary directories during analysis. You can clean these up using the clean command:

hiscanner clean

Troubleshooting

Common issues:

  1. Missing SCAN2 results: Ensure scan2_output directory is correctly specified
  2. File permissions: Check access to BAM files and reference data
  3. Memory issues: Adjust batch_size in config.yaml

For more detailed information, check the log files in hiscanner_output/logs/

Support

HiScanner is currently under active development. For support or questions, please open an issue on our GitHub repository.

Citation

If you use HiScanner in your research, please cite:

Zhao, Y., Luquette, L. J., Veit, A. D., Wang, X., Xi, R., Viswanadham, V. V., ... & Park, P. J. (2024). High-resolution detection of copy number alterations in single cells with HiScanner. bioRxiv, 2024-04.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published