CalicoST is a probabilistic model that infers allele-specific copy number aberrations and tumor phylogeography from spatially resolved transcriptomics.CalicoST has the following key features:
- Identifies allele-specific integer copy numbers for each transcribed region, revealing events such as copy neutral loss of heterozygosity (CNLOH) and mirrored subclonal CNAs that are invisible to total copy number analysis.
- Assigns each spot a clone label indicating whether the spot is primarily normal cells or a cancer clone with aberration copy number profile.
- Infers a phylogeny relating the identified cancer clones as well as a phylogeography that combines genetic evolution and spatial dissemination of clones.
- Handles normal cell admixture in SRT technologies hat are not single-cell resolution (e.g. 10x Genomics Visium) to infer more accurate allele-specific copy numbers and cancer clones.
- Simultaneously analyzes multiple regional or aligned SRT slices from the same tumor.
The package has tested on the following Linux operating systems: SpringdaleOpenEnterprise 9.2 (Parma) and CentOS Linux 7 (Core).
First setup a conda environment from the environment.yml
file:
cd CalicoST
conda config --add channels conda-forge
conda config --add channels bioconda
conda env create -f environment.yml --name calicost_env
Next download Eagle2 by
wget https://storage.googleapis.com/broad-alkesgroup-public/Eagle/downloads/Eagle_v2.4.1.tar.gz
tar -xzf Eagle_v2.4.1.tar.gz
Then install Startle by
git clone --recurse-submodules https://github.com/raphael-group/startle.git
cd startle
mkdir build; cd build
cmake -DLIBLEMON_ROOT=<lemon path>\
-DCPLEX_INC_DIR=<cplex include path>\
-DCPLEX_LIB_DIR=<cplex lib path>\
-DCONCERT_INC_DIR=<concert include path>\
-DCONCERT_LIB_DIR=<concert lib path>\
..
make
Finally, install CalicoST using pip by
conda activate calicost_env
pip install -e .
Setting up the conda environments takes around 10 minutes on an HPC head node.
CalicoST requires the coordinate information of genes and SNPs, the information files for GRCh38 genome are available from either of the example data tarball. Specify the information file paths, your input SRT data paths, and running configurations in config.yaml
, and then you can run CalicoST by
snakemake --cores <number threads> --configfile config.yaml --snakefile calicost.smk all
Check out our readthedocs for tutorials on the simulated data and prostate cancer data.
The simulated count matrices are available from examples/CalicoST_example.tar.gz
.
CalicoST requires a reference SNP panel and phasing panel, which can be downloaded from
- SNP panel. You can also choose other SNP panels from cellsnp-lite webpage.
- Phasing panel
Untar the downloaded example data. Replace the following paths in the example_config.yaml
of the downloaded example data with paths on your machine
- calicost_dir: the path to CalicoST git-cloned code.
- eagledir: the path to Eagle2 directory
- region_vcf: the path to the downloaded SNP panel.
- phasing_panel: the path to the downloaded and unzipped phasing panel.
To avoid falling into local maxima in CalicoST's optimization objective, we recommend run CalicoST with multiple random initializations with a list random seed specified by random_state
in the example_config.yaml
file. The provided one uses five random initializations.
Then run CalicoST by
cd <directory of downloaded example data>
snakemake --cores 5 --configfile example_config.yaml --snakefile <calicost_dir>/calicost.smk all
CalicoST takes about 69 minutes to finish on this example using 5 cores on an HPC.
The above snakemake run will create a folder calicost
in the directory of downloaded example data. Within this folder, each random initialization of CalicoST generates a subdirectory of calicost/clone*
.
CalicoST generates the following key files of each random initialization:
- clone_labels.tsv: The inferred clone labels for each spot.
- cnv_seglevel.tsv: Allele-specific copy numbers for each clone for each genome segment.
- cnv_genelevel.tsv: The projected allele-specific copy numbers from genome segments to the covered genes.
- cnv_diploid_seglevel.tsv, cnv_triploid_seglevel.tsv, cnv_tetraploid_seglevel.tsv, cnv_diploid_genelevel.tsv, cnv_triploid_genelevel.tsv, cnv_tetraploid_genelevel.tsv: Allele-specific copy numbers when enforcing a ploidy for each genome segment or each gene.
See the following examples of the key files.
head -10 calicost/clone3_rectangle0_w1.0/clone_labels.tsv
BARCODES clone_label
spot_0 2
spot_1 2
spot_2 2
spot_3 2
spot_4 2
spot_5 2
spot_6 2
spot_7 2
spot_8 0
head -10 calicost/clone3_rectangle0_w1.0/cnv_seglevel.tsv
CHR START END clone0 A clone0 B clone1 A clone1 B clone2 A clone2 B
1 1001138 1616548 1 1 1 1 1 1
1 1635227 2384877 1 1 1 1 1 1
1 2391775 6101016 1 1 1 1 1 1
1 6185020 6653223 1 1 1 1 1 1
1 6785454 7780639 1 1 1 1 1 1
1 7784320 8020748 1 1 1 1 1 1
1 8026738 9271273 1 1 1 1 1 1
1 9292894 10375267 1 1 1 1 1 1
1 10398592 11922488 1 1 1 1 1 1
head -10 calicost/clone3_rectangle0_w1.0/cnv_genelevel.tsv
gene clone0 A clone0 B clone1 A clone1 B clone2 A clone2 B
A1BG 1 1 1 1 1 1
A1CF 1 1 1 1 1 1
A2M 1 1 1 1 1 1
A2ML1-AS1 1 1 1 1 1 1
AACS 1 1 1 1 1 1
AADAC 1 1 1 1 1 1
AADACL2-AS1 1 1 1 1 1 1
AAK1 1 1 1 1 1 1
AAMP 1 1 1 1 1 1
CalicoST graphs the following plots for visualizing the inferred cancer clones in space and allele-specific copy number profiles for each random initialization.
- plots/clone_spatial.pdf: The spatial distribution of inferred cancer clones and normal regions (grey color, clone 0 by default)
- plots/rdr_baf_defaultcolor.pdf: The read depth ratio (RDR) and B allele frequency (BAF) along the genome for each clone. Higher RDR indicates higher total copy numbers, and a deviation-from-0.5 BAF indicates allele imbalance due to allele-specific CNAs.
- plots/acn_genome.pdf: The default allele-specific copy numbers along the genome.
- plots/acn_genome_diploid.pdf, plots/acn_genome_triploid.pdf, plots/acn_genome_tetraploid.pdf: Allele-specific copy numbers when enforcing a ploidy.
The allele-specific copy number plots have the following color legend.
CalicoST uses the following command-line packages and python for extracting the BAF information
- samtools
- cellsnp-lite
- Eagle2
- pysam
- snakemake
CalicoST uses the following packages for the remaining steps to infer allele-specific copy numbers and cancer clones:
- numpy
- scipy
- pandas
- scikit-learn
- scanpy
- anndata
- numba
- tqdm
- statsmodels
- networkx
- matplotlib
- seaborn
- snakemake