This pipeline was taylored to deal with whole-exome data, where I wanted to obtain alignments from exome target regions but preserving whole-genome coordinates. Furthermore, the fact that I have a different reference per species precludes the use of programs to manipulate vcf files, such as vcftools, that assume a common reference for all individuals.
There are some preliminary files that we should create to complete each species tree analysis.
1. Generate consensus fasta per individual and chromosome (as described here)
Obtain a file with the reference's chromosome sizes. Oryctolagus_cuniculus.OryCun2.0.dna.toplevel.fa
is the OryCun2.0 whole genome reference downloaded from ENSEMBL.
faidx --transform chromsizes --out Oryctolagus.chromsizes Oryctolagus_cuniculus.OryCun2.0.dna.toplevel.fa
Create 50 kb windows across the reference and store the coordinates in a bed file.
bedtools makewindows -g Oryctolagus.chromsizes -w 50000 > genome_50kbwindows_g.bed
Then, I separated this big bed file by chromosome (includes ALL chromosomes in the genome):
for chr in $(cut -f 1 genome_50kbwindows_g.bed | sort | uniq); do grep -w $chr genome_50kbwindows_g.bed > $chr"_output.bed"; done
The theory behind ASTRAL is described in Zhang et al 2018.
Assuming the previous steps were completed, here are the commands I used for the ASTRAL analysis.
- Create a folder for your ASTRAL analysis. Inside this, create folders for each chromosome.
parallel 'mkdir chr_{}_200' :::: target_chromosomes.txt
-
Place a file with the list of chromosomes
target_chromosomes.txt
and list of sample IDssamples.txt
inside main ASTRAL folder. -
Generate the 50 kb fasta alignments with
create_50kbw_200bp_align.sh
. This script callsbedtools
,faidx
,msa_split
(fromphas
),TriSeq
(from TriFusion) and the scriptseq2line.py
.
It will also need the following files and the paths to them:
samples.txt
with list of sample IDs;Oryctolagus.chromsizes
, a two column file with chromosome name and length;- Bed files with the target coordinates for each chromosome "$chromosome".bed
This script will call the consensus chromosome fasta for each of your individuals from step 1. It will create multi fasta alignments with all the individuals in samples.txt
. These alignments will have genome coordinates, but will be restricted to 50kb regions that overlap with regions that were in our initial capture extended by 200bp. It does this with a bed file with the coordinates of each target in the capture and using bedtools to extend the coordinates by 200bp on either side.
parallel -j 10 'create_50kbw_200bp_align.sh {}' :::: target_chromosomes.txt
The resulting files will have the extension *_new_filter.fasta
.
Using filter_fasta_by_length.py
, I gzipped the fasta files that are shorter than 1000 bp, since I wanted to exclude them from further analyses.
filter_fasta_by_length.py 1000 windows_gziped.txt
I obtained general information about these alignments (such as length) with AMAS
.
for i in $(ls *_new_filter.fas); do ~/my_programs/AMAS/amas/AMAS.py summary -f fasta -d dna -i $i -o "$i"_summary; done
I ran RAxML
on all alignments > 1000 bp (parallelize this step as much as possible).
for i in $(ls *_filter.fas); do o=${i/new_filter.fas/.out}; raxmlHPC-PTHREADS -T 8 -f a -N 100 -x 2408 -m GTRGAMMA -p 2308 -o OryCun,BID_RC_444 -s $i -n $o
After generating the gene trees, we will use the bootstrap files to obtain Tree Certainty scores for each one of them using RAxML
option -L MRE
.
for i in $(ls RAxML_bootstrap*); do
o=$(ls $i | cut -f2,3 -d'.' | sed 's/_new_filter.fas//');
output=$o'_100BS_TC_score.tre';
echo "My input is : $i; My output is: $output";
raxmlHPC-PTHREADS -T 8 -L MRE -z $i -m GTRGAMMA -n $output;
done
I filtered trees with Tree Certainty Score smaller than 5, resulting in a final dataset with 8889 gene trees.
Before running ASTRAL, I unrooted the gene trees using R package ape
.
library(ape)
trees_list<-list.files(pattern="*bestTree*")
for (i in 1:length(trees_list)){
tr_i<-read.tree(trees_list[i])
tr_unroot_i<-unroot(tr_i)
outtree2=paste(trees_list[i],"_unrooted.tree",sep="")
write.tree(tr_unroot_i,file=outtree2)
}
Concatenate the trees into a single file gene_trees_TCs5.tree
and run ASTRAL. I ran ASTRAL with and without assigning individuals to species with a map_file_astral.txt
, and for gene tress of only the autosomes and X chromosome.
java -jar ~/my_programs/ASTRAL/Astral/astral.5.6.3.jar -t 2 -i gene_trees_TCs5.tree -o astral5.6.3_species_TCs5_gene_trees -a map_file_astral.txt
java -jar ~/my_programs/ASTRAL/Astral/astral.5.6.3.jar -t 2 -i gene_trees_TCs5.tree -o astral5.6.3_lineage_TCs5_gene_trees
java -jar ~/my_programs/ASTRAL/Astral/astral.5.6.3.jar -t 2 -i chr_auto_gene_trees_TCs5.tree -a map_file_astral.txt -o astral5.6.3_chrauto_TCs5_gene_trees
java -jar ~/my_programs/ASTRAL/Astral/astral.5.6.3.jar -t 2 -i chr_X_gene_trees_TCs5.tree -a map_file_astral.txt -o astral5.6.3_chrX_TCs5_gene_trees
The theory behind SVDquartets is described in Chifman & Kubatko 2014.
If we were using whole genome alignments with the same reference for all species, we could simply concatenate the vcf files of all individuals and call snps. Unfortunately, this is not possible when each species has it's own reference. This is how I worked around this issue. The logic of the pipeline is similar to the previous one. We have a folder per chromosome where we create SNP alignments.
parallel 'mkdir chr_{}_200' :::: target_chromosomes.txt
This is the script we use to generate a SNP alignment for each chromosome. Check the script for more details about each step.
nohup parallel -j 10 '../specific_scripts/create_svdquartets_input.sh {}' :::: target_chromosomes.txt &
This script will call the consensus chromosome fasta for each individual from step 1. It will create multi-fasta alignments per chromosome. It will call SNPs inside the exome capture target regions extended by 200bp (as in ASTRAL). We first call SNPs for Lepus individuals/species and then obtain the ancestral states from Rabbit and pygmy rabbit; therefore our samples.txt
file only contains Lepus.
This script calls:
snp-sites
; converts a whole chromosome fasta alignment to vcf. Unfortunately, the version I used did not code missing data correctly in a way a program like vcftools would accept. Therefore, I had to write my own script to filter missing data and produce a consensus fasta for the chromosome.filter_missing_vcf.py
; filters SNPs that have too much missing data (adjustable threshold)faidx
;filter_bed.py
; limits the final alignment to 1 SNP per 10kb.bedtools
concatenate_snps.py
; concatenates the output ofbedtools getfasta
And some of the same files as above:
Oryctolagus.chromsizes
- Bed files with the target coordinates for each chromosome "$chromosome".bed
samples.txt
; a list of Lepus individuals.
The script will ignore empty vcf files after filtering for missing data.
After we have a snp alignment per chromosome, there's still some processing to do. I created the SVDquartets input in a new folder:
ln -s ../chr_*_200/*_outgroup.fa ./
# fix the headers.
for i in $(ls *_outgroup.fa); do o=${i/.fa/_new.fa}; sed "s/_[^_]*$//" $i > $o; done
for i in $(ls *_final_snps_OryBID_outgroup_new.fa); do o=${i/_final_snps_OryBID_outgroup_new.fa/.fa}; mv $i $o; done
# Substitute N by ?, since that is PAUPs prefered missing character.
for i in $(ls *.fa); do sed -i.tmp '/^>/! s/N/?/g' $i; done
# This is a bunch of code to remove chromosome files that are empty.
for i in $(ls *.fa); do samtools faidx $i; done
# Check if the smallest file size is empty and use it to find empty files.
ls -lhS *.fa
# Empty:
for i in $(find . -name "*.fa" -size 471c -type f -exec ls {} \;); do head -n1 "$i".fai ; done
# Not empty:
for i in $(find . -name "*.fa" -type f -exec ls {} \;); do head -n1 "$i".fai ; done
# Delete empty:
find . -name "*.fa" -size 471c -type f -delete
Then, use AMAS concat
to concate the different chromosome files:
AMAS.py concat -i *.fa -f fasta -d dna -t snp_alignment_33indvs_noX.nex -p partitions_33indvs_noX.nex -u nexus -y nexus
AMAS.py concat -i *.fa -f fasta -d dna -t snp_alignment_33indvs_WithX.nex -p partitions_33indvs_WithX.nex -u nexus -y nexus
AMAS.py concat -i chr_X.fa -f fasta -d dna -t snp_alignment_33indvs_chr_X.nex -p partitions_33indvs_chr_X.nex -u nexus -y nexus
Add the taxa block that assigns individuals to species.
Then run SVDquartets using PAUP*:
- Whole exome snps, by species:
begin paup;
log start file=snp_alignment_33indvs_WithX_May2020.nex.log replace;
execute snp_alignment_33indvs_WithX.nex;
SVDquartets evalQuartets=all bootstrap=yes nreps=1000 partition=lagomorphspecies speciesTree=yes nthreads=8 mrpFile=species_33indvs_WithX_qall_b1000_May2020;
savetrees file=species_33indvs_WithX_qall_b1000_May2020.tre format=Newick root=yes supportValues=nodeLabels;
end;
- Chromosome X snps, by species:
begin paup;
log start file=33indvs_chr_X.log replace;
execute snp_alignment_33indvs_chr_X.nex;
SVDquartets evalQuartets=all bootstrap=yes nreps=1000 partition=lagomorphspecies speciesTree=yes nthreads=8 mrpFile=species_33indvs_chr_X_qall_b1000;
savetrees file=species_33indvs_chr_X_qall_b1000.tre format=Newick root=yes supportValues=nodeLabels;
end;
- Whole exome snps, by individuals:
begin paup;
log start file=lineage_33indvs_WithX.nex.log replace;
execute snp_alignment_33indvs_WithX.nex;
SVDquartets evalQuartets=all bootstrap=yes nreps=1000 speciesTree=no nthreads=8 mrpFile=lineage_33indvs_WithX_qall_b1000;
savetrees file=lineage_33indvs_WithX_qall_b1000.tre format=Newick root=yes supportValues=nodeLabels;
end;
- Chromosome X snps, by individuals:
begin paup;
log start file=lineage_33indvs_chr_X.log replace;
execute snp_alignment_33indvs_chr_X.nex;
SVDquartets evalQuartets=all bootstrap=yes nreps=1000 speciesTree=no nthreads=8 mrpFile=lineage_33indvs_chr_X_qall_b1000;
savetrees file=lineage_33indvs_chr_X_qall_b1000.tre format=Newick root=yes supportValues=nodeLabels;
end;
Again using a folder per chromosome, get the consensus fasta alignments from step 1.
parallel --colsep '\t' 'ln -s /path/to/consensus/sequences/{1}/{1}_{2}.fasta.fa ./' :::: ../samples.txt ::: $chromosome
Retrieve Oryctolagus sequence and change its name.
faidx -x ~/reference/Oryctolagus_cuniculus.OryCun2.0.dna.toplevel.fa $chromosome
mv $chromosome.fa OryCun_$chromosome.fasta.fa
Change the fasta header to the name of the individual.
for i in $(ls *.fasta.fa); do header=${i/.fasta.fa/}; o=${i/.fasta.fa/.fasta.new.fa}; sed "s/$chromosome/$header/" "$i" > $o; done
Create whole chromosome alignment.
cat *.fasta.new.fa > chr_"$chromosome"_full_algn.fa
With TriSeq
, remove any position with missing data:
parallel -j 6 'i={1}; TriSeq -in $i -of fasta -o ${i/.fa/_f100} --upper-case --missing-filter 0 0' ::: $(ls *_full_algn.fa)
Exclude chromosomes with no positions. Concatenate chromosome alignments into a single alignment.
AMAS concat -i *.fas -f fasta -d dna -u fasta -y raxml -p exome_alignment_partitions_33indvs.txt -t exome_alignment_33indvs.fas
Then, run RAxML.
raxmlHPC-PTHREADS-SSE3 -T 6 -f a -N autoMRE -x 2408 -m GTRGAMMA -p 2330 -o OryCun,BID_RC_444 -s exome_alignment_33indvs_f100.fas -n exome_alignment_33indvs_f100.tree
Go back to main page