Skip to content

Starting sequences for circular replicons

Ryan Wick edited this page Mar 15, 2021 · 7 revisions

A circular replicon doesn't have a first base, so a contig for this replicon could really start anywhere. However, it's nice to have consistent conventions for starting points, as it makes it easier to compare different genomes. The starting_genes.fasta file contains the sequences Trycycler will preferentially use to start a circular replicon, and this page describes how I made that file.

Finding the most common starting genes

I took an empirical approach to starting sequences: whatever genes tend to be used to start circular contigs in RefSeq assemblies are the ones I want Trycycler to use.

I used the ncbi-genome-download tool to download all completed RefSeq prokaryotic genomes in GenBank format:

ncbi-genome-download -p 4 --assembly-level complete bacteria,archaea

For each contig in each of these genomes, if there was an annotated coding sequence that began in the first 50 bases of the contig, I noted its name. Sorting these names from most-common to least-common revealed this:

  • chromosomal replication initiator protein dnaa (4820 instances)
  • hypothetical protein (2291 instances)
  • trna uridine-5-carboxymethylaminomethyl(34) synthesis enzyme mnmg (706 instances)
  • replication initiation protein (430 instances)
  • repb family plasmid replication initiator protein (404 instances)
  • aaa family atpase (236 instances)
  • incfii family plasmid replication initiator repa (200 instances)
  • plasmid partitioning protein repa (170 instances)
  • etc...

The words 'replication initiator/initiation' are clearly common in starting genes, along with related terms like 'repa'. So I manually curated the list to include any gene names that fit this mould, of which there were 27: chromosomal replication initiator protein dnaa, replication initiation protein, repb family plasmid replication initiator protein, incfii family plasmid replication initiator repa, plasmid partitioning protein repa, replication initiator protein a, replication protein repa, plasmid replication protein, replication regulatory protein repa, chromosomal replication initiation protein, repfib replication protein a, plasmid replication initiation protein, repa, repa protein, incfii family, orc1-type dna replication protein, chromosomal replication initiator dnaa, chromosome replication initiator dnaa, replication initiator protein repa, replication protein repa4, chromosomal replication initiation protein dnaa, plasmid replication initiator, dnaa chromosomal replication initiation protein, incfii repa family protein, chromosomal replication protein dnaa, replication associated protein repa1, plasmid replication protein repa and repa family protein.

I then extracted the corresponding sequences for any gene with a name in that list. This gave me 24512 sequences which would make a good place to start a circular contig.

Clusters and consensus

I perhaps could have stopped here and simply provided Trycycler with this file to use as starting sequences. However, I saw two downsides. First, there would be a lot of redundancy in that file, with many of those 24512 sequences being nearly identical to others. Second, I wanted to sort these sequences from most-common to least-common, so if Trycycler found two potential starting genes in a contig it could use the more common choice.

To cluster the sequences, I used minimap2 to find all pairs of sequences with 100% coverage and ≥95% identity. I then did complete-linkage hierarchical clustering with SciPy. This produced 7171 clusters which ranged from 1 sequence to 142 sequences. To transform each cluster into a single sequence, I did a multiple sequence alignment with MUSCLE, built a tree with FastTree and did an ancestral state reconstruction with TreeTime.

The final file of starting genes file contains these 7171 ancestral state reconstruction sequences (one per cluster), sorted from most-common to least-common.

Scripts

I haven't included the scripts I used to do this work in the Trycycler repo, as they are pretty clumsy and bespoke. But I'm happy to share them if anyone is actually interested – just let me know!

Clone this wiki locally