This repository contains a Next-Generation Sequencing (NGS) variant calling pipeline designed for processing raw sequencing data, identifying genetic variants (SNPs, indels), and annotating those variants for downstream analysis. The pipeline is modular and flexible, allowing users to customize each stage of the process.
- Quality Control (QC)
- Read Trimming
- Read Alignment
- Post-Alignment Processing
- Variant Calling
- Variant Filtering
- Variant Annotation
-
Clone the repository:
-
Install the required software. You can install the dependencies using a package manager like conda: conda create -n ngs_pipeline fastqc fastp bwa samtools gatk picard bcftools annovar multiqc conda activate ngs_pipeline
-
Ensure the required reference genome files (FASTA, index files) are available in the appropriate directory. Index the reference genome if not already done:
bwa index reference/genome.fa samtools faidx reference/genome.fa
-
Quality control: To asses the quality of your raw sequence data
-
Read Trimming (fastp) To trim adapters and low-quality bases from the sequencing reads:
-
Read Alignment (BWA) To align trimmed reads to the reference genome:
-
Post-Alignment Processing (Samtools/Picard) Sort and index BAM files, then mark duplicates:
-
Variant Calling (GATK HaplotypeCaller) To call variants from the aligned reads:
-
Variant Filtering (GATK/bcftools) Filter low-quality variants:
-
Variant Annotation (ANNOVAR) Annotate the called variants:
-
MultiQC Report Aggregate QC reports from FastQC, fastp, and other tools:
.
├── data/ # Raw data files (FASTQ)
├── reference/ # Reference genome files (FASTA, index)
├── alignment/ # BAM files for aligned reads
├── variants/ # VCF files with called variants
├── annotations/ # Annotated variant files
├── qc_reports/ # QC reports generated by FastQC, MultiQC
├── scripts/ # Scripts for each step of the pipeline
│ ├── run_fastqc.sh # Script to run FastQC
│ ├── trim_reads.sh # Script for trimming reads with fastp
│ ├── align_reads.sh # Script for aligning reads with BWA
│ ├── sort_bam.sh # Script for sorting BAM files
│ ├── mark_duplicates.sh # Script for marking duplicates using Picard
│ ├── call_variants.sh # Script for calling variants with GATK
│ ├── filter_variants.sh # Script for filtering variants
│ ├── annotate_variants.sh # Script for annotating variants with ANNOVAR
│ └── run_multiqc.sh # Script to run MultiQC for QC aggregation
└── README.md # Project overview and instructions