This repository attempts to provide a stepwise tutorial on the Next Generation Sequence (NGS) analysis.
In biological contexts, sequencing refers to determining the order of nucleotides (A, T, G, C) in a DNA sample. The term next generation sequencing (NGS) refers to an advanced and high-throughput methodnology of determining nucleotides order.
What about sequencing of nucleotides from RNA sample?
Similar to the DNA sequencing, RNA sequencing also does exist. While dealing with RNA samples, for most biological interpretations RNA is converted into an intermediate form known as cDNA (complementary DNA). However, methods namely direct RNA sequencing (DRS) involves sequencing of RNA directly.
- Determining genes expression (a.k.a single cell RNA sequencing). Here, the complete mRNA pool of a cell is first converted into cDNAs using reverse transcriptase. Then cDNAs are sequenced to determine their counts. (cDNA counts directly correlates with gene expression levels).
- Identifying nucleotide mutations or variants.
- Generating a reference genome of a newly identified species. Upon encountering a novel species, NGS can be utilized to determine the complete genome sequence.
- In case of diseases, after symptoms based characterization, NGS can be utilized to identify and confirm the disease causing species or a species subtype.
In this section, we will prepare to begin the analysis.
To follow this tutorial, you will need a Linux terminal.
- If you are on mac, you only need to open command prompt.
- If you are on windows, you need to activate Windows subsystem for linux (WSL). To activate WSL, you can follow this youtube tutorial.
💡 If don’t have any prior experience with unix commands, I would suggest you to check out this article on Basic Linux commands. However, for this analysis, you would only need to know the
ls
,grep
,cd
,mkdir
commands.
To follow this tutorial, the best is to clone this repository in your local device.
git clone <>
mkdir ngs_analysis && cd ngs_analysis
This step generates an environment in which tools required to perform NGS analysis can be downloaded. Therefore, in future whenever you would be performing NGS analysis you will need to simply activate the environment and begin analysis
conda create --name NGS_analysis
conda activate NGS_analysis
conda config --add channels bioconda
conda config --add channels conda-forge
conda install -c bioconda bcftools bedtools blast bwa fastqc igv igvtools samtools sra-tools trim-galore vcftools
As sequence analysis is a multi-step procedure, we will be using a smaller dataset for learning purposes. I am using example data featured by Qiagen.
To download this dataset, copy and paste each links in the url.
https://zenodo.org/records/11791175/files/SRR396636.sra_1.fastq
https://zenodo.org/records/11791175/files/SRR396636.sra_2.fastq
https://zenodo.org/records/11791175/files/SRR396637.sra_1.fastq
https://zenodo.org/records/11791175/files/SRR396637.sra_2.fastq
Once the sequencing data has been downloaded, with the listing command the files can be seen in the current directory.
ls
zcat SRR28409626_1.fastq.gz | head -n 8 # To view top 8 lines or top 2 reads
zcat SRR28409626_2.fastq.gz | head -n 8 # To view top 8 lines or top 2 reads
You should see that both files have same reads ID, i.e SRR28409626_1
and SRR28409626_2
zcat SRR28409626_1.fastq.gz | grep @SRR | wc -l
Check yourself whether the other file have the same no. of reads.
During the sequecning process due to many possible reasons, the sequence process could be prone to errors. This would lead to
As the sequencing process is a time consuming process,
mkdir fastqc_results
fastqc SRR28409626_1.fastq.gz SRR28409626_2.fastq.gz -o /fastqc_results
Reading fastqc report: > - Per base sequence quality: > - Per base sequence content > - Per base GC content > - Per base N content > - Sequence Length distribution > - Sequence Duplication levels > - Overrepresented sequences > - Adapter content
While fastqc
generates report for each .fastq files, multiqc
takes fastqc reports as an input and generates one report.
mkdir multiqc_results
multiqc *.fastqc -o multiqc_results
This step could be done by any of these three methods.
Is it necessary? Removal of low quality reads. Reads containing adaptor sequences.
java -jar /trimmomatic-0.39.jar\\
PE -phred33\\
<input_file1_1> <input_file1_2>\\
<output_file1_1> <output_file1_2>\\
ILLUMINACLIP:adapter.fasta TRAILING:3 LEADING:3 MINLEN:36 SLIDINGWINDOW:4:15
cutadapt -a <adapter_seq_forw> -A <adapter_seq_rev> <file1_1.fastq.gz> <file1_2.fastq.gz> <file1_1.trimmed.fastq.gz> <file1_2.trimmed.fastq.gz>
The information on Mycobacterium tuberculosis reference genome is here.
If you simply want to download the reference genome and would liek to move on to the next step of the analysis, use this code.
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/195/955/GCF_000195955.2_ASM19595v2/GCF_000195955.2_ASM19595v2_genomic.fna.gz
However, if you want to learn the steps to find a reference genome assembly and download it, here below are the three methods.
Method 1: With no use of command
Method 2: Best practice
-
Go to NCBI homepage
-
Select assembly from the drop-down menu
-
Search for
Mycobacterium tuberculosis
-
Click on
ASM19595v2
assemblyRepeat the 1-4 steps of Method 1. -
Click on 'FTP directory for RefSeq assembly'
-
Right click on
GCF_000195955.2_ASM19595v2_genomic.fna.gz
and copy link. -
Download the refernece sequence data
wget
command.wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/195/955/GCF_000195955.2_ASM19595v2/GCF_000195955.2_ASM19595v2_genomic.fna.gz
Method 3: Downloading reference genome from NCBI genomes.
-
Go to NCBI genome collection
-
Select/search
Mycobacterium tuberculosis
-
Select the reference genome assembly
-
Now you have three ways of downloading reference genome data.
Using Download
- Click on download button
- Select
Genome sequences (FASTA)
Using dataset command
-
Installing datasets conda package
conda install -c conda-forge ncbi-datasets-cli
Now
datasets
function is ready to execute commands.-
Click on
datasets
-
Copy the command
datasets download genome accession GCF_000195955.2 --include gff3,rna,cds,protein,genome,seq-report
-
Paste it in command line to download refernce genome and some additonal files.
Using curl command
-
click on
curl
-
Copy the command
-
Download the dataset with
curl
command.curl https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_000195955.2/download?include_annotation_type=GENOME_FASTA&include_annotation_type=GENOME_GFF&include_annotation_type=RNA_FASTA&include_annotation_type=CDS_FASTA&include_annotation_type=PROT_FASTA&include_annotation_type=SEQUENCE_REPORT&hydrated=FULLY_HYDRATED
This is a one-time step and can be performed using any of these tools.
bwa index <ref_genome>
This command generates:
.amb
.ann
.bwt: Binary file
.pac: Binary file
.sa: Binary file
bowtie2-build <ref_genome> <prefix>
This command generates 6 files
.1.bt2
.2.bt2
.3.bt2
.4.bt2
.rev.1.bt2
.rev.2.bt2
Using #samtools
samtools view -Sb <file.sam> > -o <file1.bam>
-Sb : Input in SAM format (S) and the output will be be BAM format(b)
Using #samtools
samtools sort <file1.bam> -o <file1_sorted.bam>
Sorting reads by read name
rather than chromosomal coordinates
samtools sort -n <file1.bam> -o <file1_namesort.bam>
Appending 'ms' and 'MC' tags
samtools fixmate -m <file1_namesort.bam> <file1_namesort_fixmate.bam>
Re-sort the BAM files by chromosomal coordinates
samtools sort <file1_namesort_fixmate.bam> -o <file1_namesort_fixmate_sort.bam>
Marking the duplicates.
samtools markdup -r <file1_namesort_fixmate_sort.bam> <file1_namesort_fixmate_sort_markdup.bam>
Using #samtools
samtools flagstat <file1_sorted.bam> -o <file1_sorted.txt>
Using #samtools
samtools view <file1_sorted.sam/bam>
Using #samtools
samtools index <file1_sorted.bam>
Using #samtools
samtools tview <file1_sorted.bam> <ref_genome>
qualimap bamqc -outdir <dir> \\
-bam <file1.bam> \\
-gff <file.gff>
- Estimate of variant frequency
- Measure of confidence
Using #freebayes
freebayes -f <ref_genome> <file1.bam> > <file1.vcf>
Using #bcftools Step1: Generating pileup file Generates genotype likelihoods at each genomic position with coverage.
bcftools mpileup -O b -f <ref_genome> <file1_sorted.bam> -o <file1_raw.bcf
Step2: Detecting SNVs
bcftools call -v <file1_raw.bcf> -o <file1_variants.vcf>
-v: outputs variant sites only.
In a single step
bcftools mpileup -Ou -f <ref_genome> <file1.bam> | bcftools call -mv -Ob -o <file1.bcf>
-O: output type b: compressed BCf, u: uncompressed BCf z: compressed VCF, v: uncompressed VCF
Determining the number of variant sites
grep -v -c "^#" <file1.vcf>
bcftools view <file1.vcf>
Useful commands *Extracting information from VCF file (can be saved as a #bedfile), ```
bcftools query -f '%COLa\\t%COLb\\t%COLc\\n' <file1.vcf>
Converting BCF file to VCF file
bcftools view <file1.bcf> > <file1.vcf>
Sorting VCF file by chromosome → coordinate
vcf-sort <file1.vcf> > <file1.sorted.vcf>
Converting VCF to BED file
Using #vcf2bed**
vcf2bed < <file1.vcf> > <file1.bed>
identify annotated genes which are not covered by reads across their full length.
bedtools coverage -a <ref_genome.gff> -b <file.bam> -o <file1_gene-coverage.txt>
[[Generating an assembly with unmapped reads]]