Skip to content

Red oak parental genome versioning

Beant Kapoor edited this page Oct 30, 2024 · 1 revision

This wiki contains code and documentation for the parental and grandparental genome versions of the reference genome provided by JGI. Some background information -

  • SM1 - Female grandparent
  • SM2 - Male grandparent
  • SM1316 - Female parent
  • SM1370 - Male parent

So, SM1 was crossed with SM2 to create F1 individuals (SM1316 and SM1370). Then, these F1's were crossed with each other to create a Pseudo-F2 which is our reference genome. All of this was done to reduce heterozygosity in the reference genome. These 4 individuals were sequenced (oxford nanopore reads) and assembled into chromosome scale assemblies using reference genome. Here, I provide the code and documentation for SM1 which is completely identical to other individuals.

Base directory - /pickett_flora/projects/quercus_rubra/SM1_genome

Version 1.0.0

Base directory - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1

1. NextDenovo

NextDenovo is a string graph-based de novo assembler for long reads (CLR, HiFi and ONT).
Base directory - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.0.0/analyses/1_nextdenovo
Commands used -

# put fastq file path to a text file
ls SM1_adapter_free.fastq.gz > input.fofn

# copy the configuration file and adjust parameters
cp /home/bkapoor/NextDenovo/doc/run.cfg .

# activate paralleltask
conda activate paralleltask

# run nextdenovo
nohup /home/bkapoor/NextDenovo/nextDenovo run.cfg &

# give better name
mv nd.asm.fasta q_rubra_SM1_v1.0.0.fasta

Final assembly file - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.0.0/final_files/q_rubra_SM1_v1.0.0.fasta

Version 1.1.0

Base directory - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.1.0

2. NextPolish

NextPolish is used to fix base errors (SNV/Indel) in the genome generated by noisy long reads.
Base directory - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.1.0/analyses/1_nextpolish
Commands used -

# soft link genome file
ln -s ../../input_files/q_rubra_SM1_v1.0.0.fasta .

# soft link long read file
ln -s ../../../../../analyses_parents/2_adapter_removal/1_SM1/SM1_adapter_free.fastq.gz .

ls SM1_adapter_free.fastq.gz > lgs.fofn

# create run.cfg

# activate paralleltask
conda activate paralleltask

# run nextpolish
/home/bkapoor/NextPolish/nextPolish run.cfg

# give proper names to output files
mv genome.nextpolish.fasta q_rubra_SM1_v1.1.0.fasta
mv genome.nextpolish.fasta.stat q_rubra_SM1_v1.1.0.fasta.stat

Final assembly file - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.1.0/final_files/q_rubra_SM1_v1.1.0.fasta

Version 1.2.0

Base directory - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.2.0

3. Purge dups

Purge dups is used to remove any overlaps and haplotypic duplication present in the assembly. Base directory - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.2.0/analyses/1_purge_dups
Commands used -

# activate minimap2
conda activate minimap2
minimap2 -x map-ont -t 10 q_rubra_SM1_v1.1.0.fasta SM1_adapter_free.fastq.gz | gzip -c - > SM1.paf.gz

/pickett_flora/software/purge_dups-1.2.5/bin/pbcstat SM1.paf.gz
/pickett_flora/software/purge_dups-1.2.5/bin/calcuts PB.stat > cutoffs 2>calcults.log

# split the assembly and do that split-split alignment
/pickett_flora/software/purge_dups-1.2.5/bin/split_fa q_rubra_SM1_v1.1.0.fasta > q_rubra_SM1_v1.1.0.fasta.split
minimap2 -xasm5 -DP q_rubra_SM1_v1.1.0.fasta.split q_rubra_SM1_v1.1.0.fasta.split | gzip -c - > q_rubra_asm.split.self.paf.gz

# split the assembly and do that split-split alignment
/pickett_flora/software/purge_dups-1.2.5/bin/split_fa q_rubra_SM1_v1.1.0.fasta > q_rubra_SM1_v1.1.0.fasta.split
minimap2 -xasm5 -DP q_rubra_SM1_v1.1.0.fasta.split q_rubra_SM1_v1.1.0.fasta.split | gzip -c - > q_rubra_asm.split.self.paf.gz

# purge haplotigs and overlaps
/pickett_flora/software/purge_dups-1.2.5/bin/purge_dups -2 -T cutoffs -c PB.base.cov q_rubra_asm.split.self.paf.gz > dups.bed 2> purge_dups.log

# get purged primary and haplotig sequences from draft asseembly
/pickett_flora/software/purge_dups-1.2.5/bin/get_seqs -e dups.bed q_rubra_SM1_v1.1.0.fasta
mv purged.fa q_rubra_SM1_v1.2.0.fasta

Final file - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.2.0/final_files/q_rubra_SM1_v1.2.0.fasta

Version 1.3.0

Base directory - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.3.0

4. RagTag

RagTag scaffolds the contig level assembly using reference genome. Base directory - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.3.0/analyses/1_ragtag
Commands used -

conda activate ragtag
nohup ragtag.py scaffold -u -t 10 Qrubra_687_v2.0.fa q_rubra_SM1_v1.2.0.fasta &

# proper name
mv ragtag.scaffold.fasta q_rubra_SM1_v1.3.0.fasta

Final file - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.3.0/final_files/q_rubra_SM1_v1.3.0.fasta

Version 1.4.0

Base directory - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.4.0

5. Tgsgapcloser

Tgsgapcloser plugs the gaps (N's) in the assembly.
Base directory - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.4.0/analyses/1_tgsgapcloser
Commands used -

# convert fastq gz to fasta
/sphinx_local/software/seqtk/seqtk seq -L 1000 -A -l 100 SM1_adapter_free.fastq.gz > SM1_adapter_free.fasta

conda activate tgsgapcloser

nohup tgsgapcloser \
	--scaff q_rubra_SM1_v1_3_0_multiline.fasta \
	--reads SM1_adapter_free.fasta \
	--output q_rubra_SM1_v1_4_0 \
	--ne \
	--thread 30 &

# provide better name
mv q_rubra_SM1_v1_4_0.scaff_seqs q_rubra_SM1_v1.4.0.fasta

Final file - /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.4.0/final_files/q_rubra_SM1_v1.4.0.fasta

This is our final chromosome level assembly for SM1. Here are the stats -

Main genome scaffold total:         	182
Main genome contig total:           	682
Main genome scaffold sequence total:	735.937 MB
Main genome contig sequence total:  	735.887 MB  	0.007% gap
Main genome scaffold N/L50:         	5/60.207 MB
Main genome contig N/L50:           	83/2.686 MB
Main genome scaffold N/L90:         	11/45.325 MB
Main genome contig N/L90:           	297/585.184 KB
Max scaffold length:                	94.075 MB
Max contig length:                  	12.009 MB
Number of scaffolds > 50 KB:        	131
% main genome in scaffolds > 50 KB: 	99.79%

Busco score -

# BUSCO version is: 5.0.0 
# The lineage dataset is: embryophyta_odb10 (Creation date: 2020-09-10, number of species: 50, number of BUSCOs: 1614)
# Summarized benchmarking in BUSCO notation for file /pickett_flora/projects/quercus_rubra/SM1_genome/version_1/1.4.0/analyses/2_busco/q_rubra_SM1_v1.4.0.fasta
# BUSCO was run in mode: genome
# Gene predictor used: metaeuk

	***** Results: *****

	C:97.6%[S:93.8%,D:3.8%],F:1.6%,M:0.8%,n:1614	   
	1575	Complete BUSCOs (C)			   
	1514	Complete and single-copy BUSCOs (S)	   
	61	Complete and duplicated BUSCOs (D)	   
	26	Fragmented BUSCOs (F)			   
	13	Missing BUSCOs (M)			   
	1614	Total BUSCO groups searched

SM2 genome

Base directory - /pickett_flora/projects/quercus_rubra/SM2_genome
Chromosome level assembly - /pickett_flora/projects/quercus_rubra/SM2_genome/version_1/1.4.0/final_files/q_rubra_SM2_v1.4.0.fasta

SM1316 genome

Base directory - /pickett_flora/projects/quercus_rubra/SM1316_genome
Chromosome level assembly - /pickett_flora/projects/quercus_rubra/SM1316_genome/version_1/1.4.0/final_files/q_rubra_SM1316_v1.4.0.fasta

SM1370 genome

Base directory - /pickett_flora/projects/quercus_rubra/SM1316_genome
Chromosome level assembly - /pickett_flora/projects/quercus_rubra/SM1370_genome/version_1/1.4.0/final_files/q_rubra_SM1370_v1.4.0.fasta