Tutorial: Next Generation Sequence Analysis - Stepwise

This repository attempts to provide a stepwise tutorial on the Next Generation Sequence (NGS) analysis.

In biological contexts, sequencing refers to determining the order of nucleotides (A, T, G, C) in a DNA sample. The term next generation sequencing (NGS) refers to an advanced and high-throughput methodnology of determining nucleotides order.

What about sequencing of nucleotides from RNA sample?

Similar to the DNA sequencing, RNA sequencing also does exist. While dealing with RNA samples, for most biological interpretations RNA is converted into an intermediate form known as cDNA (complementary DNA). However, methods namely direct RNA sequencing (DRS) involves sequencing of RNA directly.

The goal of NGS analyses can be

Determining genes expression (a.k.a single cell RNA sequencing). Here, the complete mRNA pool of a cell is first converted into cDNAs using reverse transcriptase. Then cDNAs are sequenced to determine their counts. (cDNA counts directly correlates with gene expression levels).
Identifying nucleotide mutations or variants.
Generating a reference genome of a newly identified species. Upon encountering a novel species, NGS can be utilized to determine the complete genome sequence.
In case of diseases, after symptoms based characterization, NGS can be utilized to identify and confirm the disease causing species or a species subtype.

0. Prerequisite for this tutorial

In this section, we will prepare to begin the analysis.

0.1. Linux architecture and knowledge of unix commands

To follow this tutorial, you will need a Linux terminal.

If you are on mac, you only need to open command prompt.
If you are on windows, you need to activate Windows subsystem for linux (WSL). To activate WSL, you can follow this youtube tutorial.

💡 If don’t have any prior experience with unix commands, I would suggest you to check out this article on Basic Linux commands. However, for this analysis, you would only need to know the ls, grep, cd, mkdir commands.

0.2. Follow along this tutorial

To follow this tutorial, the best is to clone this repository in your local device.

git clone <>

0.2. Creating a working directory to save the analysis progress

mkdir ngs_analysis && cd ngs_analysis

0.3. Creating a conda environment for this whole NGS analysis session and activating it

This step generates an environment in which tools required to perform NGS analysis can be downloaded. Therefore, in future whenever you would be performing NGS analysis you will need to simply activate the environment and begin analysis

conda create --name NGS_analysis
conda activate NGS_analysis

0.4.2. Add channels

conda config --add channels bioconda
conda config --add channels conda-forge

0.4.3. Installing required tools

conda install -c bioconda bcftools bedtools blast bwa fastqc igv igvtools samtools sra-tools trim-galore vcftools

1. Downloading sequencing data

As sequence analysis is a multi-step procedure, we will be using a smaller dataset for learning purposes. I am using example data featured by Qiagen.

To download this dataset, copy and paste each links in the url.

https://zenodo.org/records/11791175/files/SRR396636.sra_1.fastq
https://zenodo.org/records/11791175/files/SRR396636.sra_2.fastq
https://zenodo.org/records/11791175/files/SRR396637.sra_1.fastq
https://zenodo.org/records/11791175/files/SRR396637.sra_2.fastq

2. Information on .fastq files

Once the sequencing data has been downloaded, with the listing command the files can be seen in the current directory.

ls

2.1. Viewing .fastq files content

zcat SRR28409626_1.fastq.gz | head -n 8 # To view top 8 lines or top 2 reads

zcat SRR28409626_2.fastq.gz | head -n 8 # To view top 8 lines or top 2 reads

You should see that both files have same reads ID, i.e SRR28409626_1 and SRR28409626_2

2.2 Determining the number of reads

zcat SRR28409626_1.fastq.gz | grep @SRR | wc -l

Check yourself whether the other file have the same no. of reads.

2.3. fastqc

During the sequecning process due to many possible reasons, the sequence process could be prone to errors. This would lead to

As the sequencing process is a time consuming process,

mkdir fastqc_results
fastqc SRR28409626_1.fastq.gz SRR28409626_2.fastq.gz -o /fastqc_results

Reading fastqc report: > - Per base sequence quality: > - Per base sequence content > - Per base GC content > - Per base N content > - Sequence Length distribution > - Sequence Duplication levels > - Overrepresented sequences > - Adapter content

2.4 multiqc

While fastqc generates report for each .fastq files, multiqc takes fastqc reports as an input and generates one report.

mkdir multiqc_results
multiqc *.fastqc -o multiqc_results

3. Trimming reads

This step could be done by any of these three methods.

Is it necessary? Removal of low quality reads. Reads containing adaptor sequences.

Using Trimmomatic

Important resource

java -jar /trimmomatic-0.39.jar\\
 PE -phred33\\
 <input_file1_1> <input_file1_2>\\
 <output_file1_1> <output_file1_2>\\
 ILLUMINACLIP:adapter.fasta TRAILING:3 LEADING:3 MINLEN:36 SLIDINGWINDOW:4:15

Using Cutadapt

cutadapt -a <adapter_seq_forw> -A <adapter_seq_rev> <file1_1.fastq.gz> <file1_2.fastq.gz> <file1_1.trimmed.fastq.gz> <file1_2.trimmed.fastq.gz>

Using Trim Galore

trim_galore --paired --phred33 --gzip <file1_1.fastq.gz> <file1_2.fastq.gz>

4. Alignment of reads with the reference genome

4.1. Downloading reference genome

The information on Mycobacterium tuberculosis reference genome is here.

If you simply want to download the reference genome and would liek to move on to the next step of the analysis, use this code.

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/195/955/GCF_000195955.2_ASM19595v2/GCF_000195955.2_ASM19595v2_genomic.fna.gz

However, if you want to learn the steps to find a reference genome assembly and download it, here below are the three methods.

Method 1: With no use of command

Go to NCBI homepage
Select assembly from the drop-down menu
Search for Mycobacterium tuberculosis
Click on ASM19595v2 assembly
Click on Download assembly
Select Refseq and Genomic Fasta (.fna)

Method 2: Best practice

Go to NCBI homepage
Select assembly from the drop-down menu
Search for Mycobacterium tuberculosis
Click on ASM19595v2 assemblyRepeat the 1-4 steps of Method 1.
Click on 'FTP directory for RefSeq assembly'
Right click on GCF_000195955.2_ASM19595v2_genomic.fna.gz and copy link.

Download the refernece sequence data wget command.

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/195/955/GCF_000195955.2_ASM19595v2/GCF_000195955.2_ASM19595v2_genomic.fna.gz

Method 3: Downloading reference genome from NCBI genomes.

Go to NCBI genome collection
Select/search Mycobacterium tuberculosis
Select the reference genome assembly

Now you have three ways of downloading reference genome data.

Using Download

Click on download button
Select Genome sequences (FASTA)

Using dataset command

Installing datasets conda package

conda install -c conda-forge ncbi-datasets-cli

Now datasets function is ready to execute commands.

Click on datasets

Copy the command

datasets download genome accession GCF_000195955.2 --include gff3,rna,cds,protein,genome,seq-report

Paste it in command line to download refernce genome and some additonal files.

Using curl command

click on curl
Copy the command

Download the dataset with curl command.

curl https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_000195955.2/download?include_annotation_type=GENOME_FASTA&include_annotation_type=GENOME_GFF&include_annotation_type=RNA_FASTA&include_annotation_type=CDS_FASTA&include_annotation_type=PROT_FASTA&include_annotation_type=SEQUENCE_REPORT&hydrated=FULLY_HYDRATED

4.2. Indexing reference genome: one-time step

This is a one-time step and can be performed using any of these tools.

Using samtools

samtools faidx <ref_genome>

Using bwa

bwa index <ref_genome>

This command generates:

.amb

.ann

.bwt: Binary file

.pac: Binary file

.sa: Binary file

Using bowtie2

bowtie2-build <ref_genome> <prefix>

This command generates 6 files

.1.bt2

.2.bt2

.3.bt2

.4.bt2

.rev.1.bt2

.rev.2.bt2

4.3. Performing alignment generating SAM file

Using samtools

samtools

Using bwa

bwa mem <ref_genome> <file1_1.fq> <file1_2.fq> > <read1.sam>

Using bowtie2

bowtie2 -x <prefix> -1 <file1_1.fq> -2 <file1_2.fq> -S <read1.sam>

4.4 Converting SAM into BAM file

Using #samtools

samtools view -Sb <file.sam> > -o <file1.bam>

-Sb : Input in SAM format (S) and the output will be be BAM format(b)

4.5 Sorting BAM file

Using #samtools

samtools sort <file1.bam> -o <file1_sorted.bam>

Samtools fixmate & Samtools markdup

Sorting reads by read name rather than chromosomal coordinates

samtools sort -n <file1.bam> -o <file1_namesort.bam>

Appending 'ms' and 'MC' tags

samtools fixmate -m <file1_namesort.bam> <file1_namesort_fixmate.bam>

Re-sort the BAM files by chromosomal coordinates

samtools sort <file1_namesort_fixmate.bam> -o <file1_namesort_fixmate_sort.bam>

Marking the duplicates.

samtools markdup -r <file1_namesort_fixmate_sort.bam> <file1_namesort_fixmate_sort_markdup.bam>

4.6 BAM file statistics

Using #samtools

samtools flagstat <file1_sorted.bam> -o <file1_sorted.txt>

4.7 Viewing SAM or BAM file

Using #samtools

samtools view <file1_sorted.sam/bam>

4.8 Creating a BAM file index to view in IGV

Using #samtools

samtools index <file1_sorted.bam>

4.9 Viewing whole alignment

Using #samtools

samtools tview <file1_sorted.bam> <ref_genome>

Qualimap

qualimap bamqc -outdir <dir> \\
-bam <file1.bam> \\
-gff <file.gff>

5. Generating VCF file for variant calling

Estimate of variant frequency
Measure of confidence

Using #freebayes

freebayes -f <ref_genome> <file1.bam> > <file1.vcf>

Using #bcftools Step1: Generating pileup file Generates genotype likelihoods at each genomic position with coverage.

bcftools mpileup -O b -f <ref_genome> <file1_sorted.bam> -o <file1_raw.bcf

Step2: Detecting SNVs

bcftools call -v <file1_raw.bcf> -o <file1_variants.vcf>

-v: outputs variant sites only.

In a single step

bcftools mpileup -Ou -f <ref_genome> <file1.bam> | bcftools call -mv -Ob -o <file1.bcf>

-O: output type b: compressed BCf, u: uncompressed BCf z: compressed VCF, v: uncompressed VCF

Determining the number of variant sites

grep -v -c "^#" <file1.vcf>

6. Viewing VCF file

bcftools view <file1.vcf>

6.1 Generating an index file for VCF to view in IGV

Useful commands *Extracting information from VCF file (can be saved as a #bedfile), ```

bcftools query -f '%COLa\\t%COLb\\t%COLc\\n' <file1.vcf>

Converting BCF file to VCF file

bcftools view <file1.bcf> > <file1.vcf>

Sorting VCF file by chromosome → coordinate

vcf-sort <file1.vcf> > <file1.sorted.vcf>

Converting VCF to BED file

Using #vcf2bed**

vcf2bed < <file1.vcf> > <file1.bed>

identify annotated genes which are not covered by reads across their full length.

bedtools coverage -a <ref_genome.gff> -b <file.bam> -o <file1_gene-coverage.txt>

[[Generating an assembly with unmapped reads]]

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
GCF_000195955.2_ASM19595v2_genomic.fna		GCF_000195955.2_ASM19595v2_genomic.fna
README.md		README.md
README2.md		README2.md
Reference_genome		Reference_genome

SlowSD/Tutorial-Next-Generation-Sequence-Analysis

Folders and files

Latest commit

History

Repository files navigation

Tutorial: Next Generation Sequence Analysis - Stepwise

The goal of NGS analyses can be

0. Prerequisite for this tutorial

0.1. Linux architecture and knowledge of unix commands

0.2. Follow along this tutorial

0.2. Creating a working directory to save the analysis progress

0.3. Creating a conda environment for this whole NGS analysis session and activating it

0.4.2. Add channels

0.4.3. Installing required tools

1. Downloading sequencing data

2. Information on .fastq files

2.1. Viewing .fastq files content

2.2 Determining the number of reads

2.3. fastqc

2.4 multiqc

3. Trimming reads

Using Trimmomatic

Using Cutadapt

Using Trim Galore

4. Alignment of reads with the reference genome

4.1. Downloading reference genome

4.2. Indexing reference genome: one-time step

Using samtools

Using bwa

Using bowtie2

4.3. Performing alignment generating SAM file

Using samtools

Using bwa

Using bowtie2

4.4 Converting SAM into BAM file

4.5 Sorting BAM file

Samtools fixmate & Samtools markdup

4.6 BAM file statistics

4.7 Viewing SAM or BAM file

4.8 Creating a BAM file index to view in IGV

4.9 Viewing whole alignment

Qualimap

5. Generating VCF file for variant calling

6. Viewing VCF file

6.1 Generating an index file for VCF to view in IGV

Some important concepts

Sequencing depth

About

Topics

Resources

Stars

Watchers

Forks

Releases