-
Notifications
You must be signed in to change notification settings - Fork 0
Red oak parental genome versioning
This wiki contains code and documentation for the parental and grandparental genome versions of the reference genome provided by JGI. Some background information -
- SM1 - Female grandparent
- SM2 - Male grandparent
- SM1316 - Female parent
- SM1370 - Male parent
So, SM1 was crossed with SM2 to create F1 individuals (SM1316 and SM1370). Then, these F1's were crossed with each other to create a Pseudo-F2 which is our reference genome. All of this was done to reduce heterozygosity in the reference genome. These 4 individuals were sequenced (oxford nanopore reads) and assembled into chromosome scale assemblies using reference genome. Here, I provide the code and documentation for SM1 which is completely identical to other individuals.
Base directory - /pickett_flora/projects/quercus_rubra/SM1_genome
Base directory - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1
NextDenovo is a string graph-based de novo assembler for long reads (CLR, HiFi and ONT).
Base directory - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.0.0/analyses/1_nextdenovo
Commands used -
# put fastq file path to a text file
ls SM1_adapter_free.fastq.gz > input.fofn
# copy the configuration file and adjust parameters
cp /home/bkapoor/NextDenovo/doc/run.cfg .
# activate paralleltask
conda activate paralleltask
# run nextdenovo
nohup /home/bkapoor/NextDenovo/nextDenovo run.cfg &
# give better name
mv nd.asm.fasta q_rubra_SM1_v1.0.0.fasta
Final assembly file - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.0.0/final_files/q_rubra_SM1_v1.0.0.fasta
Base directory - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.1.0
NextPolish is used to fix base errors (SNV/Indel) in the genome generated by noisy long reads.
Base directory - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.1.0/analyses/1_nextpolish
Commands used -
# soft link genome file
ln -s ../../input_files/q_rubra_SM1_v1.0.0.fasta .
# soft link long read file
ln -s ../../../../../analyses_parents/2_adapter_removal/1_SM1/SM1_adapter_free.fastq.gz .
ls SM1_adapter_free.fastq.gz > lgs.fofn
# create run.cfg
# activate paralleltask
conda activate paralleltask
# run nextpolish
/home/bkapoor/NextPolish/nextPolish run.cfg
# give proper names to output files
mv genome.nextpolish.fasta q_rubra_SM1_v1.1.0.fasta
mv genome.nextpolish.fasta.stat q_rubra_SM1_v1.1.0.fasta.stat
Final assembly file - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.1.0/final_files/q_rubra_SM1_v1.1.0.fasta
Base directory - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.2.0
Purge dups is used to remove any overlaps and haplotypic duplication present in the assembly.
Base directory - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.2.0/analyses/1_purge_dups
Commands used -
# activate minimap2
conda activate minimap2
minimap2 -x map-ont -t 10 q_rubra_SM1_v1.1.0.fasta SM1_adapter_free.fastq.gz | gzip -c - > SM1.paf.gz
/pickett_flora/software/purge_dups-1.2.5/bin/pbcstat SM1.paf.gz
/pickett_flora/software/purge_dups-1.2.5/bin/calcuts PB.stat > cutoffs 2>calcults.log
# split the assembly and do that split-split alignment
/pickett_flora/software/purge_dups-1.2.5/bin/split_fa q_rubra_SM1_v1.1.0.fasta > q_rubra_SM1_v1.1.0.fasta.split
minimap2 -xasm5 -DP q_rubra_SM1_v1.1.0.fasta.split q_rubra_SM1_v1.1.0.fasta.split | gzip -c - > q_rubra_asm.split.self.paf.gz
# split the assembly and do that split-split alignment
/pickett_flora/software/purge_dups-1.2.5/bin/split_fa q_rubra_SM1_v1.1.0.fasta > q_rubra_SM1_v1.1.0.fasta.split
minimap2 -xasm5 -DP q_rubra_SM1_v1.1.0.fasta.split q_rubra_SM1_v1.1.0.fasta.split | gzip -c - > q_rubra_asm.split.self.paf.gz
# purge haplotigs and overlaps
/pickett_flora/software/purge_dups-1.2.5/bin/purge_dups -2 -T cutoffs -c PB.base.cov q_rubra_asm.split.self.paf.gz > dups.bed 2> purge_dups.log
# get purged primary and haplotig sequences from draft asseembly
/pickett_flora/software/purge_dups-1.2.5/bin/get_seqs -e dups.bed q_rubra_SM1_v1.1.0.fasta
mv purged.fa q_rubra_SM1_v1.2.0.fasta
Final file - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.2.0/final_files/q_rubra_SM1_v1.2.0.fasta
Base directory - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.3.0
RagTag scaffolds the contig level assembly using reference genome.
Base directory - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.3.0/analyses/1_ragtag
Commands used -
conda activate ragtag
nohup ragtag.py scaffold -u -t 10 Qrubra_687_v2.0.fa q_rubra_SM1_v1.2.0.fasta &
# proper name
mv ragtag.scaffold.fasta q_rubra_SM1_v1.3.0.fasta
Final file - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.3.0/final_files/q_rubra_SM1_v1.3.0.fasta
Base directory - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.4.0
Tgsgapcloser plugs the gaps (N's) in the assembly.
Base directory - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.4.0/analyses/1_tgsgapcloser
Commands used -
# convert fastq gz to fasta
/sphinx_local/software/seqtk/seqtk seq -L 1000 -A -l 100 SM1_adapter_free.fastq.gz > SM1_adapter_free.fasta
conda activate tgsgapcloser
nohup tgsgapcloser \
--scaff q_rubra_SM1_v1_3_0_multiline.fasta \
--reads SM1_adapter_free.fasta \
--output q_rubra_SM1_v1_4_0 \
--ne \
--thread 30 &
# provide better name
mv q_rubra_SM1_v1_4_0.scaff_seqs q_rubra_SM1_v1.4.0.fasta
Final file - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.4.0/final_files/q_rubra_SM1_v1.4.0.fasta
This is our final chromosome level assembly for SM1. Here are the stats -
Main genome scaffold total: 182
Main genome contig total: 682
Main genome scaffold sequence total: 735.937 MB
Main genome contig sequence total: 735.887 MB 0.007% gap
Main genome scaffold N/L50: 5/60.207 MB
Main genome contig N/L50: 83/2.686 MB
Main genome scaffold N/L90: 11/45.325 MB
Main genome contig N/L90: 297/585.184 KB
Max scaffold length: 94.075 MB
Max contig length: 12.009 MB
Number of scaffolds > 50 KB: 131
% main genome in scaffolds > 50 KB: 99.79%
Busco score -
# BUSCO version is: 5.0.0
# The lineage dataset is: embryophyta_odb10 (Creation date: 2020-09-10, number of species: 50, number of BUSCOs: 1614)
# Summarized benchmarking in BUSCO notation for file /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.4.0/analyses/2_busco/q_rubra_SM1_v1.4.0.fasta
# BUSCO was run in mode: genome
# Gene predictor used: metaeuk
***** Results: *****
C:97.6%[S:93.8%,D:3.8%],F:1.6%,M:0.8%,n:1614
1575 Complete BUSCOs (C)
1514 Complete and single-copy BUSCOs (S)
61 Complete and duplicated BUSCOs (D)
26 Fragmented BUSCOs (F)
13 Missing BUSCOs (M)
1614 Total BUSCO groups searched
Base directory - /pickett_flora/projects/quercus_rubra/SM2_genome
Chromosome level assembly - /pickett_flora/projects/quercus_rubra/SM2_genome/version_1/1.4.0/final_files/q_rubra_SM2_v1.4.0.fasta
Base directory - /pickett_flora/projects/quercus_rubra/SM1316_genome
Chromosome level assembly - /pickett_flora/projects/quercus_rubra/SM1316_genome/version_1/1.4.0/final_files/q_rubra_SM1316_v1.4.0.fasta
Base directory - /pickett_flora/projects/quercus_rubra/SM1316_genome
Chromosome level assembly - /pickett_flora/projects/quercus_rubra/SM1370_genome/version_1/1.4.0/final_files/q_rubra_SM1370_v1.4.0.fasta