HiScanner is a python package for high-resolution single-cell copy number analysis. It supports two modes of operation:
- Standard Pipeline (RDR + BAF): Full analysis using both read depth ratios (RDR) and B-allele frequencies (BAF)
- RDR-only Pipeline: Simplified analysis using only read depth ratios
We provide a demo dataset and tutorial to help you get started. After installation, see https://github.com/parklab/hiscanner_demo for instructions.
- Installation
- Required External Files
- Pipeline Options
- Running the Pipeline
- Output Structure
- Troubleshooting
- Citation
# Create new conda environment with all dependencies
conda create -n hiscanner_test python=3.8
conda activate hiscanner_test
pip install hiscanner --no-cache-dir
Install R and required packages:
conda install -c conda-forge r-base
conda install -c bioconda r-mgcv>=1.8
Install other dependencies:
conda install -c bioconda snakemake samtools bcftools
We tested with snakemake==7.32.4, samtools==1.15.1, bcftools==1.13.
hg19/GRCh37 (100mer)
wget https://www.math.pku.edu.cn/teachers/xirb/downloads/software/BICseq2/Mappability/hg19CRG.100bp.tar.gz --no-check-certificate
tar -xvzf hg19CRG.100bp.tar.gz
hg38/GRCh38 (150mer)
Download from: https://doi.org/10.6084/m9.figshare.28370357.v1
Other reference genomes: instructions on custom track generation
For other genomes/read length configurations, follow instructions at CGAP Annotations to generate mappability tracks.
IMPORTANT: Guidelines when choosing mappability tracks
- Shorter mappability track (e.g., 100bp) with longer reads (e.g., 150bp): Valid but conservative (some uniquely mappable regions may be missed)
- Longer mappability track (e.g., 150bp) with shorter reads (e.g., 100bp): Not valid, will cause false positives
We require the reference genome fasta to be split into chromosomes, to allow for parallel processing. You can use the following command to split the reference genome:
samtools faidx /path/to/reference.fasta
mkdir /path/to/reference/split
awk '{print $1}' /path/to/reference.fasta.fai | xargs -I {} samtools faidx /path/to/reference.fasta {} > /path/to/reference/split/{}.fasta
HiScanner offers two pipeline options:
Uses both read depth ratios and B-allele frequencies for comprehensive CNV analysis.
Steps:
- SNP Calling (via SCAN2)
- Phasing & BAF Computation
- ADO Analysis
- Segmentation
- CNV Calling
Requirements:
- SCAN2 output files
- Bulk and single cell BAM files
- Reference genome
- Mappability tracks
Uses only read depth ratios for simplified CNV analysis.
Steps:
- Segmentation
- CNV Calling
Requirements:
- Single cell BAM files
- Reference genome
- Mappability tracks
SCAN2 needs to be run separately before using HiScanner.
Note: If you only need SCAN2 output for HiScanner (and are not interested in SNV calls), you can save time by running SCAN2 with:
scan2 run --joblimit 5000 --snakemake-args ' --until shapeit/phased_hets.vcf.gz --latency-wait 120'
This will stop at the phasing step, which is sufficient for HiScanner's requirements.
If you have already run SCAN2, ensure you have:
-
VCF file with raw variants (
gatk/hc_raw.mmq60.vcf.gz
) -
Phased heterozygous variants (
shapeit/phased_hets.vcf
) -
Additionally, we note that the phased genotype field in
phased_hets.vcf
should be named asphasedgt
. This is the expected output from the SCAN2 pipeline that we have tested with. If your VCF file has a different field name, please manually rename it tophasedgt
in the VCF.
The expected location is scan2_out/
in your project directory.
1. Initialize HiScanner project
hiscanner init --output ./my_project
cd my_project
2. Edit config.yaml
Edit with your paths and parameters
3. Prepare metadata file
Must contain the following columns:
bamID bam singlecell
bulk1 /path/to/bulk.bam N
cell1 /path/to/cell1.bam Y
cell2 /path/to/cell2.bam Y
4. Validate configuration
hiscanner validate
5. Run the pipeline
hiscanner run --step snp # Check SCAN2 results
hiscanner run --step phase # Process SCAN2 results
hiscanner run --step ado # ADO analysis to identify optimal bin size
hiscanner run --step normalize # Normalize read depth ratios
hiscanner run --step segment # Segmentation
hiscanner run --step cnv # CNV calling
# Or run all steps at once:
hiscanner run --step all
For normalize
(the most time-consuming step), we provide an option to run with cluster, e.g.,
hiscanner --config config.yaml run --step normalize --use-cluster
1. Initialize project
hiscanner init --output ./my_project
cd my_project
2. Edit config.yaml
- Set
rdr_only: true
- Configure paths and parameters
3. Prepare metadata file
Must contain the following columns (bulk samples are not required):
bamID bam singlecell
cell1 /path/to/cell1.bam Y
cell2 /path/to/cell2.bam Y
4. Validate configuration
hiscanner validate
5. Run the pipeline
hiscanner run --step normalize # Normalize read depth ratios
hiscanner run --step segment # Segmentation
hiscanner run --step cnv # CNV calling
hiscanner_output/
├── phased_hets/ # Processed heterozygous SNPs (Standard pipeline only)
├── ado/ # ADO analysis results (Standard pipeline only)
├── bins/ # Binned read depth
├── segs/ # Segmentation results
└── final_calls/ # Final CNV calls
HiScanner creates several temporary directories during analysis. You can clean these up using the clean command:
hiscanner clean
Common issues:
- Missing SCAN2 results: Ensure scan2_output directory is correctly specified
- File permissions: Check access to BAM files and reference data
- Memory issues: Adjust batch_size in config.yaml
For more detailed information, check the log files in hiscanner_output/logs/
HiScanner is currently under active development. For support or questions, please open an issue on our GitHub repository.
If you use HiScanner in your research, please cite:
Zhao, Y., Luquette, L. J., Veit, A. D., Wang, X., Xi, R., Viswanadham, V. V., ... & Park, P. J. (2024). High-resolution detection of copy number alterations in single cells with HiScanner. bioRxiv, 2024-04.