This repository attempts to provide a stepwise tutorial on the Next Generation Sequence (NGS) analysis.
In biological contexts, sequencing refers to determining the order of nucleotides (A, T, G, C) in a DNA sample. The term next generation sequencing (NGS) refers to an advanced and high-throughput methodnology of determining nucleotides order.
What about sequencing of nucleotides from RNA sample?
Similar to the DNA sequencing, RNA sequencing also does exist. While dealing with RNA samples, for most biological interpretations RNA is converted into an intermediate form known as cDNA (complementary DNA). However, methods namely direct RNA sequencing (DRS) involves sequencing of RNA directly.
- Determining genes expression (a.k.a single cell RNA sequencing). Here, the complete mRNA pool of a cell is first converted into cDNAs using reverse transcriptase. Then cDNAs are sequenced to determine their counts. (cDNA counts directly correlates with gene expression levels).
- Identifying nucleotide mutations or variants.
- Generating a reference genome of a newly identified species. Upon encountering a novel species, NGS can be utilized to determine the complete genome sequence.
- In case of diseases, after symptoms based characterization, NGS can be utilized to identify and confirm the disease causing species or a species subtype.
In this section, we will prepare to begin the analysis.
To follow this tutorial, you will need a Linux terminal.
- If you are on mac, you only need to open command prompt.
- If you are on windows, you need to activate Windows subsystem for linux (WSL). To activate WSL, you can follow this youtube tutorial.
💡 If don’t have any prior experience with unix commands, I would suggest you to check out this article on Basic Linux commands. However, for this analysis, you would only need to know the
ls
,grep
,cd
,mkdir
commands.
The best way to follow this tutorial is to clone this repository in your local device.
git clone https://github.com/SlowSD/Tutorial-Next-Generation-Sequence-Analysis.git
This step generates a conda environment in which tools required to perform the NGS analysis can be downloaded. In future, you would only need to activate the conda environment to resume the current analysis or re-run the analysis on different sequencing datasets.
conda deactivate
conda create --name NGS_analysis #Creating new conda environment and naming it
conda activate NGS_analysis #Activating the conda environment
conda config --add channels bioconda
conda config --add channels conda-forge
This step would install required tools in the conda environment NGS_analysis
conda install -c bioconda bcftools bedtools blast bwa fastqc igv igvtools samtools sra-tools trim-galore vcftools trimmomatic bowtie2
Now to avoid writing long code chunks especially long directory path and file names, we will use variables.
working_dir=$(pwd) #Here we have saved working directory as a variable
As the complete sequence analysis is a multi-step and time-consuming procedure, we will be using a smaller dataset for the learning purposes. In this tutorial, we will be using example dataset featured by the Qiagen.
The sequencing data comes from sequencing sample of Pseudomonas aeruginosa species.
mkdir -p $working_dir/fastq_reads #Creating directory to save fastq_reads
wget -L -O $working_dir/fastq_reads/SRR396636.sra_1.fastq https://zenodo.org/records/11791175/files/SRR396636.sra_1.fastq?download=1
wget -L -O $working_dir/fastq_reads/SRR396636.sra_2.fastq https://zenodo.org/records/11791175/files/SRR396636.sra_2.fastq?download=1
wget -L -O $working_dir/fastq_reads/SRR396637.sra_1.fastq https://zenodo.org/records/11791175/files/SRR396637.sra_1.fastq?download=1
wget -L -O $working_dir/fastq_reads/SRR396637.sra_2.fastq https://zenodo.org/records/11791175/files/SRR396637.sra_2.fastq?download=1
read1_F=$working_dir/fastq_reads/SRR396636.sra_1.fastq
read1_R=$working_dir/fastq_reads/SRR396636.sra_2.fastq
read2_F=$working_dir/fastq_reads/SRR396637.sra_1.fastq
read2_R=$working_dir/fastq_reads/SRR396637.sra_2.fastq
Once the sequencing data has been downloaded, with the listing command the 4 .fastq
files can be seen in the current directory.
ls -1 $working_dir/fastq_reads
To view content of the .fastq
file, there are multiple ways
Use head
to view top 10 lines or tail
to view bottom 10 lines of a .fastq
file.
cat $read1_F | head -n 10 #cat command often print the output in the terminal
less $read1_F | head -n 10 #less command is used to view the output without printing in the terminal
In fastq files, there are four lines corresponding to a read.
To determine the number of reads in a fastq
file, use the below command.
cat $read1_R | grep @SRR | wc -l #If you encounter compressed fastq files (such as file.fastq.gz or file.fq.gz), cat should be replaced with zcat
Here,
Bothcat
andless
command can be used.
grep
command works as search tool and searches for occurence of word@SRR
in the.fastq
file.
wc -l
command count the occurences of@SRR
.
🔍 Check yourself whether the mate file (i,e, SRR396636.sra_2.fastq) has the same no. of reads.
Fastqc is a tool developed by Babraham Bioinformatics for looking at quality of fastq reads.
mkdir -p $working_dir/fastq_reads/fastqc_results #Creates a folder to save the fastqc results
fastqc *.fastq -o $working_dir/fastq_reads/fastqc_results #Here, *.fastq works as a wildcard for us.
Reading fastqc report:
> - Per base sequence quality:
> - Per base sequence content:
> - Per base GC content:
> - Per base N content:
> - Sequence Length distribution:
> - Sequence Duplication levels:
> - Overrepresented sequences:
> - Adapter content:
While fastqc
generates report for each .fastq files, multiqc
compiles the results of fastqc reports as a single report. For working with multiqc, either the individual fastqc reports
or the directory containing fastqc results
could be provided. In the latter case, multiqc automatically searches for reports for its working.
mkdir -p $working_dir/fastq_reads/multiqc_results #Creates a folder to save multiqc_results
multiqc /fastqc_results/ . -o $working_dir/fastq_reads/multiqc_results
In sequencing, reads are flagged with adapter sequences (either at one or both ends). Being complementary to the adaper/oligo sequences attached to the flow cells, they help in initating the sequencing-by-synthesis (by acting as primers) using the polymerase enzyme.
However, after the sequencing adapter sequences should be removed for the following reasons;
- In the next step of alignment of reads with the reference genome, the quality of alignment might be hamperred as adapter sequences does not belong to the species genome.
This step of trimming low quality reads and removal of adapters could be done by any of these three tools.
Using Trimmomatic
mkdir -p trimmomatic_results
trimmomatic PE -phred33 \
$read1_R $read1_R \
SRR396636.sra_1_trim.fastq SRR396636.sra_2_trim.fastq \
ILLUMINACLIP:adapter.fasta TRAILING:3 LEADING:3 MINLEN:36 SLIDINGWINDOW:4:15
Using Cutadapt
mkdir -p cutadapt_results
cutadapt -a <adapter_seq_forw> -A <adapter_seq_rev> $read1_R $read1_R SRR396636.sra_1_trim.fastq SRR396636.sra_2_trim.fastq -o cutadapt_results/
Use:
-a [adapter] : To remove 3' [adapter] sequence -a file:[adapters.fasta] : To read adapter sequences from a fasta file
-g [adapter] : To remove 5' [adapter] sequence
-u [n] : To remove 'n' bases from the begining of each read. Replace by '-n' to remove bases from the end of each read.
-q [n] : To trim read from 3' end, if quality of read drops below [n]
-m and -M [n] : To remove reads having length smaller than [m] or greater than [M]
Using Trim Galore
mkdir -p trim_galore_results
trim_galore --paired --phred33 --gzip \
$read1_R $read1_R -o trim_galore_results/
mkdir -p fastqc_after_trim_galore #Creating directory for running fastqc on trimmed reads
--fastqc_args --outdir /fastqc_after_trim_galore
As name indicates, a reference genome is the complete genome sequence of a species. Such a genome is annotated based on evidences. Reference genomes gets updated regularly according to the updated knowledge.
For our project, there are multiple methods to download reference genome. If you are already aware about downloading reference genomes or simply not in the mood, use the below code and move to section 4.2
mkdir -p $working_dir/reference_genome #Creating a directory to store reference genome
ref_genome="GCF_000006765.1_ASM676v1_genomic.fna.gz" #Saving the reference_genome file as a variable named `ref_genome` for ease
wget -O reference_genome/$ref_genome https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/765/GCF_000006765.1_ASM676v1/GCF_000006765.1_ASM676v1_genomic.fna.gz #Downloading reference genome in the directory we created.
-
Go to NCBI homepage.
-
Select
assembly
from the drop-down menu and search forPseudomonas aeruginosa
. -
From the filters, under
RefSeq category
, selectReference
-
Click on
ASM676v1
assembly
Now you have four methods to download the reference_genome.
Method 1: Using Download
- Click
Download
. - Among file sources select
RefSeq only
, and among file types make sureGenome sequences (FASTA)
being selected. - This will download a zip folder that would needed be move to current working directory (for working ease) and would needed to be uncompressed.
Method 2: Using Datasets
-
Installing
datasets
conda packageconda install -c conda-forge ncbi-datasets-cli
-
Click on
datasets
-
Copy the command
datasets download genome accession GCF_000195955.2 --include genome #Here I have modified the copied command to download only genome file.
-
Paste in the command line to download the refernce genome.
Method 3: Using URL
- This method allows you to simply copy the provided URL and paste in your browser to download all the files.
- Certainly, you would need to unzip the folder to be able to use the files.
Method 4: Using FTP (Best practice)
-
click on
FTP
. -
Right click on the file
GCF_000006765.1_ASM676v1_genomic.fna.gz
and copy the link. -
Use the
wget
orcurl
command.mkdir -p reference_genome #Create a directory to save the reference genome. wget -o reference_genome https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/765/GCF_000006765.1_ASM676v1/GCF_000006765.1_ASM676v1_genomic.fna.gz #In place of wget, curl command can also be used. ref_genome="GCF_000006765.1_ASM676v1_genomic.fna.gz"
Indexing a reference genome is a one-time step and can be performed using any of below tools. However, keep in mind that the tool used for generating reference index should also be used for alignment.
Using HISAT2
mkdir -p reference_genome/hisat2_index
hisat2-build $ref_genome hisat2_index/[prefix]
This commmand generates the 8 files.
[prefix].1-8.ht2
Using samtools
mkdir -p reference_genome/samtools_index
samtools faidx $ref_genome -o samtools_index/[prefix].fasta.fai
Using bwa
mkdir -p reference_genome/bwa_index
bwa index $ref_genome
This command generates the following 5 files:
.amb
.ann
.bwt: Binary file
.pac: Binary file
.sa: Binary file
Using bowtie2
mkdir -p reference_genome/bowtie2_index
bowtie2-build $ref_genome <prefix>
This command generates 6 files
[prefix].1.bt2
[prefix].2.bt2
[prefix].3.bt2
[prefix].4.bt2
[prefix].rev.1.bt2
[prefix].rev.2.bt2
Using samtools
samtools
Using bwa
mkdir -p bwa_alignment
bwa mem $ref_genome $read1_F $read1_R > ../bwa_alignment/SRR396636.sam
Using bowtie2
bowtie2 -x <prefix> -1 $read1_R -2 $read1_R -S alignment_results/SRR396636.sam
The SAM file is human readable form
samtools view alignment_results/SRR396636.sam
Understand SAM file format
SAM file has three header lines starting with @
- @HD \
- VN: Version of SAM \
- SO: Sorted/Unsorted \
- GO:
- @SQ: Reference seq \
- SN: Ref seq accession \
- LN: Length of Ref seq
- @PG: Programme used to generate SAM file \
- ID: ID of program \
- PN: Name of program \
- VN: Version of program \
- CL: Wrapper script used to generate SAM
Using samtools
samtools view -Sb alignment_results/SRR396636.sam -o alignment_results/SRR396636.bam
-Sb : Input in SAM format (S) and the output will be be BAM format(b)
Using samtools
samtools sort SRR396636.bam -o alignment_results/SRR396636.sort.bam
Sorting reads by read name
rather than chromosomal coordinates
samtools sort -n <file1.bam> -o <file1_namesort.bam>
Appending 'ms' and 'MC' tags
samtools fixmate -m <file1_namesort.bam> <file1_namesort_fixmate.bam>
Re-sort the BAM files by chromosomal coordinates
samtools sort <file1_namesort_fixmate.bam> -o <file1_namesort_fixmate_sort.bam>
Marking the duplicates.
samtools markdup -r <file1_namesort_fixmate_sort.bam> <file1_namesort_fixmate_sort_markdup.bam>
Using samtools
samtools flagstat alignment_results/SRR396636.sort.bam > alignment_results/alignment_summary.txt
Using samtools
samtools view <file1_sorted.sam/bam>
Using samtools
samtools index <file1_sorted.bam>
Using #samtools
samtools tview <file1_sorted.bam> <ref_genome>
qualimap bamqc -outdir <dir> \\
-bam <file1.bam> \\
-gff <file.gff>
- Estimate of variant frequency
- Measure of confidence
Using Freebayes
freebayes -f <ref_genome> <file1.bam> > <file1.vcf>
Using Bcftools
- Step 1: Generating pileup file Generates genotype likelihoods at each genomic position with coverage.
mkdir -p variant_call_results
bcftools mpileup -O b -f $ref_genome alignment_results/*.sort.bam -o variant_call_results/SRR396636.mpileup.bcf
- Step 2: Detecting SNVs
bcftools call -O b \ #Output file type will be bcf
--vc \
--ploidy 1 \
variant_call_results/SRR396636.mpileup.bcf -o variant_call_results/SRR396636.call.bcf
- Step 3: Filtering variants
bcftools filter -O v \ #Output file type will be vcf
-i '%QUAL>=n' \
variant_call_results/SRR396636.call.bcf -o variant_call_results/SRR396636.call.filtered.vcf
In a single step
bcftools mpileup -Ou -f <ref_genome> <file1.bam> | bcftools call -mv -Ob -o <file1.bcf>
-O: output type
-b: compressed BCf, -u: uncompressed BCf
-z: compressed VCF, -v: uncompressed VCF
bcftools view <file1.vcf>
grep -v -c "^#" <file1.vcf>
Useful commands *Extracting information from VCF file (can be saved as a #bedfile)
bcftools query -f '%COLa\\t%COLb\\t%COLc\\n' <file1.vcf>
Converting BCF file to VCF file
bcftools view <file1.bcf> > <file1.vcf>
vcf-sort <file1.vcf> > <file1.sorted.vcf>
Using vcf2bed
vcf2bed < <file1.vcf> > <file1.bed>
identify annotated genes which are not covered by reads across their full length.
bedtools coverage -a <ref_genome.gff> -b <file.bam> -o <file1_gene-coverage.txt>