Skip to content

Latest commit

 

History

History
267 lines (196 loc) · 13 KB

3.species_tree_analysis.md

File metadata and controls

267 lines (196 loc) · 13 KB

Species tree

This pipeline was taylored to deal with whole-exome data, where I wanted to obtain alignments from exome target regions but preserving whole-genome coordinates. Furthermore, the fact that I have a different reference per species precludes the use of programs to manipulate vcf files, such as vcftools, that assume a common reference for all individuals.

There are some preliminary files that we should create to complete each species tree analysis.

1. Generate consensus fasta per individual and chromosome (as described here)

2. Make 50 kb windows across the genome.

Obtain a file with the reference's chromosome sizes. Oryctolagus_cuniculus.OryCun2.0.dna.toplevel.fa is the OryCun2.0 whole genome reference downloaded from ENSEMBL.

faidx --transform chromsizes --out Oryctolagus.chromsizes Oryctolagus_cuniculus.OryCun2.0.dna.toplevel.fa

Create 50 kb windows across the reference and store the coordinates in a bed file.

bedtools makewindows -g Oryctolagus.chromsizes -w 50000 > genome_50kbwindows_g.bed

Then, I separated this big bed file by chromosome (includes ALL chromosomes in the genome):

for chr in $(cut -f 1 genome_50kbwindows_g.bed | sort | uniq); do grep -w $chr genome_50kbwindows_g.bed > $chr"_output.bed"; done

3. Split a bed file with coordinates for your capture targets by chromosome.

  1. ASTRAL;
  2. SVDquartets;
  3. Maximum likelihood tree of a concatenated whole exome alignment (RAxML).

ASTRAL

The theory behind ASTRAL is described in Zhang et al 2018.

Assuming the previous steps were completed, here are the commands I used for the ASTRAL analysis.

  1. Create a folder for your ASTRAL analysis. Inside this, create folders for each chromosome.
parallel 'mkdir chr_{}_200' :::: target_chromosomes.txt
  1. Place a file with the list of chromosomes target_chromosomes.txt and list of sample IDs samples.txt inside main ASTRAL folder.

  2. Generate the 50 kb fasta alignments with create_50kbw_200bp_align.sh. This script calls bedtools, faidx, msa_split (from phas), TriSeq (from TriFusion) and the script seq2line.py.

It will also need the following files and the paths to them:

  • samples.txt with list of sample IDs;
  • Oryctolagus.chromsizes, a two column file with chromosome name and length;
  • Bed files with the target coordinates for each chromosome "$chromosome".bed

This script will call the consensus chromosome fasta for each of your individuals from step 1. It will create multi fasta alignments with all the individuals in samples.txt. These alignments will have genome coordinates, but will be restricted to 50kb regions that overlap with regions that were in our initial capture extended by 200bp. It does this with a bed file with the coordinates of each target in the capture and using bedtools to extend the coordinates by 200bp on either side.

parallel -j 10 'create_50kbw_200bp_align.sh {}' :::: target_chromosomes.txt

The resulting files will have the extension *_new_filter.fasta.

Using filter_fasta_by_length.py, I gzipped the fasta files that are shorter than 1000 bp, since I wanted to exclude them from further analyses.

filter_fasta_by_length.py 1000 windows_gziped.txt

I obtained general information about these alignments (such as length) with AMAS.

for i in $(ls *_new_filter.fas); do ~/my_programs/AMAS/amas/AMAS.py summary -f fasta -d dna -i $i -o "$i"_summary; done

I ran RAxML on all alignments > 1000 bp (parallelize this step as much as possible).

for i in $(ls *_filter.fas); do o=${i/new_filter.fas/.out}; raxmlHPC-PTHREADS -T 8 -f a -N 100 -x 2408 -m GTRGAMMA -p 2308 -o OryCun,BID_RC_444 -s $i -n $o

After generating the gene trees, we will use the bootstrap files to obtain Tree Certainty scores for each one of them using RAxMLoption -L MRE.

for i in $(ls RAxML_bootstrap*); do
	o=$(ls $i | cut -f2,3 -d'.' | sed 's/_new_filter.fas//');
	output=$o'_100BS_TC_score.tre';
	echo "My input is : $i; My output is: $output";	
	raxmlHPC-PTHREADS -T 8 -L MRE -z $i -m GTRGAMMA -n $output;
done

I filtered trees with Tree Certainty Score smaller than 5, resulting in a final dataset with 8889 gene trees.

Before running ASTRAL, I unrooted the gene trees using R package ape.

library(ape)

trees_list<-list.files(pattern="*bestTree*")

for (i in 1:length(trees_list)){
  tr_i<-read.tree(trees_list[i])
  tr_unroot_i<-unroot(tr_i)
  outtree2=paste(trees_list[i],"_unrooted.tree",sep="")
  write.tree(tr_unroot_i,file=outtree2)
}

Concatenate the trees into a single file gene_trees_TCs5.tree and run ASTRAL. I ran ASTRAL with and without assigning individuals to species with a map_file_astral.txt, and for gene tress of only the autosomes and X chromosome.

java -jar ~/my_programs/ASTRAL/Astral/astral.5.6.3.jar -t 2 -i gene_trees_TCs5.tree -o astral5.6.3_species_TCs5_gene_trees -a map_file_astral.txt
java -jar ~/my_programs/ASTRAL/Astral/astral.5.6.3.jar -t 2 -i gene_trees_TCs5.tree -o astral5.6.3_lineage_TCs5_gene_trees
java -jar ~/my_programs/ASTRAL/Astral/astral.5.6.3.jar -t 2 -i chr_auto_gene_trees_TCs5.tree -a map_file_astral.txt -o astral5.6.3_chrauto_TCs5_gene_trees
java -jar ~/my_programs/ASTRAL/Astral/astral.5.6.3.jar -t 2 -i chr_X_gene_trees_TCs5.tree -a map_file_astral.txt -o astral5.6.3_chrX_TCs5_gene_trees

SVDquartets

The theory behind SVDquartets is described in Chifman & Kubatko 2014.

If we were using whole genome alignments with the same reference for all species, we could simply concatenate the vcf files of all individuals and call snps. Unfortunately, this is not possible when each species has it's own reference. This is how I worked around this issue. The logic of the pipeline is similar to the previous one. We have a folder per chromosome where we create SNP alignments.

parallel 'mkdir chr_{}_200' :::: target_chromosomes.txt

This is the script we use to generate a SNP alignment for each chromosome. Check the script for more details about each step.

nohup parallel -j 10 '../specific_scripts/create_svdquartets_input.sh {}' :::: target_chromosomes.txt &

This script will call the consensus chromosome fasta for each individual from step 1. It will create multi-fasta alignments per chromosome. It will call SNPs inside the exome capture target regions extended by 200bp (as in ASTRAL). We first call SNPs for Lepus individuals/species and then obtain the ancestral states from Rabbit and pygmy rabbit; therefore our samples.txt file only contains Lepus.

This script calls:

  • snp-sites; converts a whole chromosome fasta alignment to vcf. Unfortunately, the version I used did not code missing data correctly in a way a program like vcftools would accept. Therefore, I had to write my own script to filter missing data and produce a consensus fasta for the chromosome.
  • filter_missing_vcf.py; filters SNPs that have too much missing data (adjustable threshold)
  • faidx;
  • filter_bed.py; limits the final alignment to 1 SNP per 10kb.
  • bedtools
  • concatenate_snps.py; concatenates the output of bedtools getfasta

And some of the same files as above:

  • Oryctolagus.chromsizes
  • Bed files with the target coordinates for each chromosome "$chromosome".bed
  • samples.txt; a list of Lepus individuals.

The script will ignore empty vcf files after filtering for missing data.

After we have a snp alignment per chromosome, there's still some processing to do. I created the SVDquartets input in a new folder:

ln -s ../chr_*_200/*_outgroup.fa ./ 

# fix the headers.
for i in $(ls *_outgroup.fa); do o=${i/.fa/_new.fa}; sed "s/_[^_]*$//" $i > $o; done
for i in $(ls *_final_snps_OryBID_outgroup_new.fa); do o=${i/_final_snps_OryBID_outgroup_new.fa/.fa}; mv $i $o; done

# Substitute N by ?, since that is PAUPs prefered missing character.
for i in $(ls *.fa); do sed -i.tmp '/^>/! s/N/?/g' $i; done

# This is a bunch of code to remove chromosome files that are empty. 
for i in $(ls *.fa); do samtools faidx $i; done
# Check if the smallest file size is empty and use it to find empty files.
ls -lhS *.fa
# Empty:
for i in $(find . -name "*.fa" -size 471c -type f -exec ls {} \;); do head -n1 "$i".fai ; done
# Not empty:
for i in $(find . -name "*.fa" -type f -exec ls {} \;); do head -n1 "$i".fai ; done
# Delete empty:
find . -name "*.fa" -size 471c -type f -delete

Then, use AMAS concat to concate the different chromosome files:

AMAS.py concat -i *.fa -f fasta -d dna -t snp_alignment_33indvs_noX.nex -p partitions_33indvs_noX.nex -u nexus -y nexus
AMAS.py concat -i *.fa -f fasta -d dna -t snp_alignment_33indvs_WithX.nex -p partitions_33indvs_WithX.nex -u nexus -y nexus
AMAS.py concat -i chr_X.fa -f fasta -d dna -t snp_alignment_33indvs_chr_X.nex -p partitions_33indvs_chr_X.nex -u nexus -y nexus

Add the taxa block that assigns individuals to species.

Then run SVDquartets using PAUP*:

  1. Whole exome snps, by species:
begin paup;
	log start file=snp_alignment_33indvs_WithX_May2020.nex.log replace;
	execute snp_alignment_33indvs_WithX.nex; 
	SVDquartets evalQuartets=all bootstrap=yes nreps=1000 partition=lagomorphspecies speciesTree=yes nthreads=8 mrpFile=species_33indvs_WithX_qall_b1000_May2020; 
	savetrees file=species_33indvs_WithX_qall_b1000_May2020.tre format=Newick root=yes supportValues=nodeLabels;
end;
  1. Chromosome X snps, by species:
begin paup;
	log start file=33indvs_chr_X.log replace;
	execute snp_alignment_33indvs_chr_X.nex; 
	SVDquartets evalQuartets=all bootstrap=yes nreps=1000 partition=lagomorphspecies speciesTree=yes nthreads=8 mrpFile=species_33indvs_chr_X_qall_b1000; 
	savetrees file=species_33indvs_chr_X_qall_b1000.tre format=Newick root=yes supportValues=nodeLabels;
end;
  1. Whole exome snps, by individuals:
begin paup;
	log start file=lineage_33indvs_WithX.nex.log replace;
	execute snp_alignment_33indvs_WithX.nex; 
	SVDquartets evalQuartets=all bootstrap=yes nreps=1000 speciesTree=no nthreads=8 mrpFile=lineage_33indvs_WithX_qall_b1000; 
	savetrees file=lineage_33indvs_WithX_qall_b1000.tre format=Newick root=yes supportValues=nodeLabels;
end;
  1. Chromosome X snps, by individuals:
begin paup;
	log start file=lineage_33indvs_chr_X.log replace;
	execute snp_alignment_33indvs_chr_X.nex; 
	SVDquartets evalQuartets=all bootstrap=yes nreps=1000 speciesTree=no nthreads=8 mrpFile=lineage_33indvs_chr_X_qall_b1000; 
	savetrees file=lineage_33indvs_chr_X_qall_b1000.tre format=Newick root=yes supportValues=nodeLabels;
end;

Concatenated whole exome alignment.

Again using a folder per chromosome, get the consensus fasta alignments from step 1.

parallel --colsep '\t' 'ln -s /path/to/consensus/sequences/{1}/{1}_{2}.fasta.fa ./' :::: ../samples.txt ::: $chromosome

Retrieve Oryctolagus sequence and change its name.

faidx -x ~/reference/Oryctolagus_cuniculus.OryCun2.0.dna.toplevel.fa $chromosome
mv $chromosome.fa OryCun_$chromosome.fasta.fa

Change the fasta header to the name of the individual.

for i in $(ls *.fasta.fa); do header=${i/.fasta.fa/}; o=${i/.fasta.fa/.fasta.new.fa}; sed "s/$chromosome/$header/" "$i" > $o; done

Create whole chromosome alignment.

cat *.fasta.new.fa > chr_"$chromosome"_full_algn.fa

With TriSeq, remove any position with missing data:

parallel -j 6 'i={1}; TriSeq -in $i -of fasta -o ${i/.fa/_f100} --upper-case --missing-filter 0 0' ::: $(ls *_full_algn.fa)

Exclude chromosomes with no positions. Concatenate chromosome alignments into a single alignment.

AMAS concat -i *.fas -f fasta -d dna -u fasta -y raxml -p exome_alignment_partitions_33indvs.txt -t exome_alignment_33indvs.fas

Then, run RAxML.

raxmlHPC-PTHREADS-SSE3 -T 6 -f a -N autoMRE -x 2408 -m GTRGAMMA -p 2330 -o OryCun,BID_RC_444 -s exome_alignment_33indvs_f100.fas -n exome_alignment_33indvs_f100.tree

Go back to main page