Mapping BUSCO genes from a reference genome to a new assembly to crudely infer large scale fissions/fusions. Telomere cappings highlight whether the chromosome contained telomeric repeat at the ends of the sequence.
- Python 3
- (Python 3 module) Biopython
- (Python 3 module) seaborn
- (Python 3 module) reportlab
- (Python 3 module) docopt
Via conda these can be installed issuing conda install -c anaconda biopython reportlab docopt seaborn
run BUSCO (tested with version 5) for a given reference genome and a query genome. Ensure that chromosomes in the reference are labelled suitably. I used ChrXX.
In BUSCO 5.0.0 there is a bug in formatting for the "Sequence" column where a few loci contain the loci "Sequence:start-end format". I use awk filter out these lines before running BUSCO_2_Chrom.py
.
For Pieris Napi, I ran:
busco -i GCA_905163465.1_ilCraLigu1.1_genomic.fna --out ilCraLigu1.1_busco -m geno -l lepidoptera_odb10 -c 8
busco -i GCA_905231895.1_ilPieNapi4.1_alternate_haplotype_genomic.fna --out ilPieNapi4.1_busco -m geno -l lepidoptera_odb10 -c 8
awk 'BEGIN{FS="\t";OFS=FS}($3 !~ /:/){print}' ilCraLigu1.1_busco/run_lepidoptera_odb10/full_table.tsv > ilCraLigu1.1_busco.tsv
awk 'BEGIN{FS="\t";OFS=FS}($3 !~ /:/){print}' ilPieNapi4.1_busco/run_lepidoptera_odb10/full_table.tsv > ilPieNapi4.1_busco.tsv
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/905/163/465/GCA_905163465.1_ilCraLigu1.1/GCA_905163465.1_ilCraLigu1.1_assembly_structure/Primary_Assembly/assembled_chromosomes/chr2acc -O ilCraLigu1.1.chr2seqNames.tsv
while IFS=$'\t' read -r -a fNames; do
sed "s/${fNames[1]}/Chr${fNames[0]}/" ilCraLigu1.1_busco.tsv > tmp
mv tmp ilCraLigu1.1_busco.tsv
done < ilCraLigu1.1.chr2seqNames.tsv
The fasta file for Pieris napi can be downloaded from this NCBI's link. The BUSCO 5 full table files can be found in the examples directory.
python Downloads/BUSCO_karyotyping/BUSCO_2_Chrom.py --fasta GCA_905231895.1_ilPieNapi4.1_alternate_haplotype_genomic.fna --busco ilPieNapi4.1_busco.tsv --ref ilCraLigu1.1_busco.tsv --title "Pieris napi painted by Craniophora ligustri"
Plot chromosomes colored by reference sets of orthologous loci (BUSCO's single copy orthologs). These sets of orthologs can be groups of orthologs found in the same chromosome in a reference species but could also be orthologs inferred to be found in the same chromosome for long evolutionary time.
The most frequent reference set per chromosome will be found in gray. Reference sets found in more than the fractional cutoff will have a different color while the ones found in less than x fraction will all be represented with the same color.
Tags will indicate the reference set that the color represents. These will be represented whenever there is a change in reference, but hidden when the neighboring loci is the same as the previously indicated.
The telomeric sequence will be searched at the ends of the sequence. The sequences where the repeat was found more than 10 times contiguously within the first or last 1000 nts are considered to be capped by telomeres. Sequences with telomeres will a a round end while those lacking will have blunt ends.