Skip to content

Latest commit

 

History

History
125 lines (94 loc) · 6.49 KB

README.md

File metadata and controls

125 lines (94 loc) · 6.49 KB

DOI

What is this?

This repository contains scripts and data to analyze the evolution of the MADS box gene family in orchids (with special attention to the occurrence and expression of these genes in Erycina pusilla).

The basic outline of the workflow enabled by these scripts is as follows:

Merging and aligning

The raw data files are FASTA files for each gene class. These should be merged in various combinations and then be aligned as codon alignments (i.e. exactly in-frame). To this end the set of files to be merged should be placed in a single folder, which is provided as the argument for the shell script protaln.sh. This shell script unaligns any previous alignment that's been applied to these files (script/unalign.pl) and translates them to amino acid (script/nuc2aa.pl).

For this to be successful it is essential that the unaligned sequences translate exactly right without frame shifts. The sequences should start with the ORF (technically this is not a hard requirement as long as the sequences are in-frame). The stop codon must be omitted!

Subsequently, both the unaligned nucleotide sequences and the amino acid sequences are concatenated. The amino acid sequences are then aligned using MAFFT, using the algorithm that is recommended for proteins with multiple conserved domains. The nucleotide sequences are then reconciled with this amino acid alignment (script/protalign.pl).

Phylogenetic inference

To build trees for the codon alignments we use PhyML. This requires PHYLIP format input, which is created from the FASTA files. Because PHYLIP format requires that sequence names are no longer than 10 characters we need to have some string that is guaranteed to be distinct for each sequence. For this we use the genbank accession number, which is expected to be the last string in the FASTA definition line, preceded by an underscore ('_').

The PHYLIP files are created and analyzed with PhyML using a shell script that has the search parameters embedded in it (phyml.sh, which internally calls (a file conversion script)). Inference goes very fast, at most a few minutes.

speciation/duplication inference

There have been duplications within certain MADS classes. It would be very strong if we had an objective, tree-based inference of where these duplications occurred so that we can identify which sequences are orthologous with respect to each other, and which are paralogous. Given that the species tree is relatively uncontroversial we should be able to class nodes in the gene trees as either speciations or duplications, using the (g)SDI algorithm.

This requires the preparation of the right input files, in phyloxml syntax. These files need to have unambiguous tags that identify to which species a sequence belongs. This is done by a conversion script, which is called by a shell script which invokes the forester jar to run GSDI and produce annotated output tree files that have the inferred duplications embedded in them.

dN/dS analysis

We analyze branch specific variation in dN/dS ratios using the BranchSiteREL algorithm of HyPhy, as made available on the datamonkey.org cluster. To this end we must create a NEXUS file that has both the codon alignment and the gene tree embedded in it. We create this from the phylip input file for phyml and the resulting tree file (script/make_nexus.pl).

Subsequently, we upload the produced NEXUS file and click through the datamonkey wizard for a BranchSiteREL analysis with the user-provided tree and the universal genetic code:

  • go to http://datamonkey.org/dataupload.php
  • upload the NEXUS file (datatype: codon, genetic code: universal)
  • click the button "Proceed to the analysis menu"
  • select "Branch-site REL" from the pulldown menu, click "User Tree(s)", and click "Run"

The analysis takes a few hours on the cluster, and multiple uploaded files are queued one after the other. The results aren't stored indefinitely, so you should download them when the analysis is done (or the next morning or something). What we need to download is the NEXUS output and the CSV output.

It is crucial that you keep track of which output files belong with which upload. A reasonable way to do this is to take note of the Job ID that DataMonkey assigns to the upload and record this ID, for example as a NEXUS comment inside the file you uploaded.

Post-processing and visualization

The result files from datamonkey (NEXUS and CSV) need to be merged. Subsequently, this merge needs to be combined with the gene duplication (g)SDI analysis in a visualization that in addition reconstructs the other member sequences for any of the Erycina pusilla target genes.

Hence, this is a two-step approach that is implemented in a driver shell script, which in turn executes the following steps:

  1. a generic merge of data monkey results done by script/parse_datamonkey.pl

  2. the visualization, to SVG format, done by script/draw_gsdi_bsr.pl