Pipeline to predict HiC matrices from ATAC-seq fastq files using maxATAC and C.Origami
If you are working on Ultraviolet (aka BigPuprle) you need to setup an environment. Which means you need to install snakemake and mamba (so a conda within a conda).
module load anaconda3/gpu
conda create -n snakemake -c conda-forge -c bioconda mamba snakemake
To configure conda to work on Ultraviolet you can edit the ultraviolet.yaml
that comes with
this repo to match your liking and copy it to your .config
directory as shown.
mkdir -p ~/.config/snakemake/ultraviolet/
cp ultraviolet.yaml ~/.config/snakemake/ultraviolet/config.yaml
To run the pipeline you need to:
- create a samplesheet with information about your samples
- setup a directory with with data to run
C.Origami
- edit the
config/config.yaml
file to specify the paths to the relevant directories.
The samplesheet should have information about sample and replicate names and the path to the fastq files.
See the config/sample_meta.csv
that comes with this directory for an example.
Alternatively you can specify a column called Run
with the SRR
ids of the samples and the pipeline should automatically download them for you (NOT TESTED).
The C.Origami directory should look like this:
<corigami_base>/
├── data
│ ├── <genome>
│ │ ├── centrotelo.bed
│ │ └── dna_sequence
│ │ ├── chr10.fa.gz
│ │ ├── chr11.fa.gz
│ │ ├── ...
│ │ ├── chrX.fa.gz
│ │ └── chrY.fa.gz
│ └── <genome>_tiles.bed
└── model_weights
└── <corigami_model>.ckpt
Where corigami_base
, genome
and corigami_model
are specified in config/config.yaml
.
The <corigami_base>/data
directory can be:
- Downloaded from here (you will need to
untar
it) - If you work from within Ultraviolet, symlinked from here:
/gpfs/data/tsirigoslab/home/jt3545/hic_prediction/C.Origami-release/corigami_data/
To get the model weights you need to:
- Train your own model and save the checkpoint. Ask Javier Rodriguez Hernaez for details.
- Download a pretrained hg38 model checkpoint created by Javier from here.
If in Ultraviolet symlink/copy the following path:
/gpfs/home/rodrij92/PROJECTS/SHARE/epoch=53-step=64260.ckpt
The main parameter you may need to specify are:
- genome: either
hg38
ormm10
- sample_meta: path to samplesheet
- corigami_base: directory with C.Origami data
- corigami_model: name of checkpoint file under
<corigami_base>/model_weights
If you have specify everything correctly you can launch the pipeline by executing the following commands on Ultraviolet:
conda activate snakemake # activate environment you created in Ultraviolet if you don't have snakemake
snakemake --profile ultraviolet