cd coalqc/scripts
pwd
copy the path where all the scripts are present and use export path to "coalpath" to save in your working environment. i.e. path which you will obtain after executing pwd in your coalqc/scripts folder.
export coalpath={paste the path here}
Now, you have setup the environment for the scripts. Add the location of the coalqc folder to your PATH variable, you can then execute coalqc from anywhere on your computer.
Prior to assessing the quality of genomic regions for performing coalescent analysis, we need to prepare an alignment of the short read data against a reference genome in the bam format. While this can be done using a short-read alignment program of your choice, we recommend using the coalmap command that uses the bwa read mapper to perform the read mapping. The following programs need to be installed and present in the PATH: bwa, samtools
coalqc map -g genome -f fastqread1 -r fastqread2 -p prefix -n threads
The coalmapscript will perform indexing of genome fasta file using the bwa index command and will also align the short read data to the genome using the bwa mem command. You may use the coalmapmulti.sh command to map the reads from multiple individuals.
As an example, we will download the genome assembly of the worm Caenorhabditis elegans from the NCBI website and use the publicly available short read data from the wild isolate PB306.
#Download genome reference file
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/985/GCF_000002985.6_WBcel235/GCF_000002985.6_WBcel235_genomic.fna.gz
#Uncompress the reference fasta file
gunzip GCF_000002985.6_WBcel235_genomic.fna.gz
#Download short read data files
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR306/006/ERR3063486/ERR3063486_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR306/006/ERR3063486/ERR3063486_2.fastq.gz
#Use coalmap command to index the genome and map the short read data.
coalqc map -g GCF_000002985.6_WBcel235_genomic.fna -f ERR3063486_1.fastq.gz -r ERR3063486_2.fastq.gz -p cel_PB306_ERR3063486 -n 10
The following information about the genome assembly used and percent of reads mapped to the genome are reported.
Percent of N's or hardmasked bases is 0.
Analysis will be performed by using this fasta file but make sure you are not using hardmasked genome.
Now, mapping the reads to the reference genome.
Mapping finished
The mapped percentage for this assembly is 98.91%.
Now, calculating coverage.
The mean coverage of reads is 27.1474
#Download the repeat masker output file from NCBI (or generate your own by running repeat masker).
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/985/GCF_000002985.6_WBcel235/GCF_000002985.6_WBcel235_rm.out.gz
Uncompress the repeat masker output and convert it to a BED format file. Note that the repeat type is encoded as a the fourth column.
gunzip GCF_000002985.6_WBcel235_rm.out.gz
cat GCF_000002985.6_WBcel235_rm.out|tail -n +4|awk '{print $5,$6,$7,$11}'|sed 's/ /\t/g' > GCF_000002985.6_WBcel235_rm.bed
Having created the read alignment file in bam format and the repeat location and type file in bed format, we are ready to start running the coalqc repeat module.
coalqc repeat -g GCF_000002985.6_WBcel235_genomic.fna -b cel_PB306_ERR3063486/cel_PB306_ERR3063486.sort.bam -p cel_PB306_ERR3063486_repeats -n 10 -m GCF_000002985.6_WBcel235_rm.bed -t 0.1 -u 2.7e-09 -P 1*20+10*5+5*2+1
Scripts:
coalqc - Wrapper Script.
coalmap - Single mapping module of coalqc.
coalmapmulti - Multiple mapping module of coalqc.
coalrep - Repeat effect module of coalqc.
coalcov - Coverage effect module of coalqc.
coalcall - Callability effect module of coalqc.