Complete Genomic Characterization of Global Pathogens, Respiratory Syncytial Virus (RSV), and Human Norovirus (HuNoV) Using Probe-based Capture Enrichment.
To evaluate the capability of the capture methodology to assemble full-length genomes, the VirMAP:https://github.com/cmmr/virmap pipeline was used to reconstruct RSV and HuNoV genomes. Trimmed and host-filtered reads were processed through VirMAP (24) to assemble complete RSV or HuNoV genomes. The VirMAP summary statistics include information on reconstructed genome length, the number of reads mapped to the reconstruction, and the average coverage across the genome.
Viral read recovery efficiency. Percent of trimmed, non-human sequence reads (post-processing) that mapped to the target viral genome in pre-capture (circles) and post-capture (triangles) libraries. CT value range of samples: ‘CT <20’ (red), ‘CT 20 to 30’ (light blue), ‘CT > 30’ (green) & ND (not detected) (pink). A: Viral reads mapping to RSV genomes, split by two subtypes. B: Viral reads mapping to HuNoV genomes, split by genotypes (GI.1, GII.4, Other GII).
Average genome coverage obtained in post-capture (triangles) and pre-capture (circles) samples. Genome reconstruction was classified as follows: ‘complete’ (within expected length range, >90% completeness & >20x coverage), ‘complete with low coverage’ (within expected length range, >90% completeness & <20x coverage), or ‘incomplete’ (below expected length range, <90% completeness & <20x coverage). CT value range of samples: ‘CT <20’ (red), ‘CT 20 to 30’ (light blue), ‘CT > 30’ (green) & ‘ND’ (pink). A: RSV samples split by RSV-A or RSV-B genotype. B: HuNoV samples split by five genotypes (GI.1, GII.4, Other GII).
To calculate the breadth of coverage, we first align the reads to a given reference genome (see below), and then use samtools depth
to calculate the coverage at each base across the genome.
For the alignments, we used bwa mem
and different reference genomes depending on the virus. For RSV, we used the RSV/A and RSV/B reference genomes that were recently published by our group, which can be found here. For Norovirus, we used the assembled genome from each sample (assembled using capture probes) as a reference. To ensure quality, we applied a filter for a minimum mapping quality of 20 Phred scores (-q 20
) when calculating the coverage.
Here’s the code we used for the alignment and coverage calculation:
# Performing alignment for each sample. The samtools commands will convert the output to bam and immediatelly sort the output into the final sorted file.
bwa mem -t 4 -T 0 reference read1 read2 | samtools view -hb - | samtools sort -o $outputdir/${name}.sorted.bam -
# Calculating the breadth of coverage for 20x and 30x
cov20=$(samtools depth -q 20 $outputdir/${name}.sorted.bam | awk '$3 >= 20 {count++} END {print count}')
cov30=$(samtools depth -q 20 $outputdir/${name}.sorted.bam | awk '$3 >= 30 {count++} END {print count}')
Where:
reference
: is the reference genome ;
read1
: the fastq file containing reads 1 ;
read2
: the fastq file containing reads 2 ;
outputdir
: the output directory ;
name
: the sample name.
Info about the RSV reference genomes here: https://doi.org/10.1093/ve/vead086