Skip to content

This repository provides a workflow for performing variant calling from high-throughput sequencing data (e.g., Illumina or Oxford Nanopore).

License

Notifications You must be signed in to change notification settings

Itsbosire/NGS_variant_calling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

NGs variant calling pipeline

Overview

This repository contains a Next-Generation Sequencing (NGS) variant calling pipeline designed for processing raw sequencing data, identifying genetic variants (SNPs, indels), and annotating those variants for downstream analysis. The pipeline is modular and flexible, allowing users to customize each stage of the process.

Standard workflow of the pipeline

  • Quality Control (QC)
  • Read Trimming
  • Read Alignment
  • Post-Alignment Processing
  • Variant Calling
  • Variant Filtering
  • Variant Annotation

Installation

  1. Clone the repository:

  2. Install the required software. You can install the dependencies using a package manager like conda: conda create -n ngs_pipeline fastqc fastp bwa samtools gatk picard bcftools annovar multiqc conda activate ngs_pipeline

  3. Ensure the required reference genome files (FASTA, index files) are available in the appropriate directory. Index the reference genome if not already done:

bwa index reference/genome.fa samtools faidx reference/genome.fa

Usage

  1. Quality control: To asses the quality of your raw sequence data

  2. Read Trimming (fastp) To trim adapters and low-quality bases from the sequencing reads:

  3. Read Alignment (BWA) To align trimmed reads to the reference genome:

  4. Post-Alignment Processing (Samtools/Picard) Sort and index BAM files, then mark duplicates:

  5. Variant Calling (GATK HaplotypeCaller) To call variants from the aligned reads:

  6. Variant Filtering (GATK/bcftools) Filter low-quality variants:

  7. Variant Annotation (ANNOVAR) Annotate the called variants:

  8. MultiQC Report Aggregate QC reports from FastQC, fastp, and other tools:

Project structure
.
├── data/                             # Raw data files (FASTQ)
├── reference/                        # Reference genome files (FASTA, index)
├── alignment/                        # BAM files for aligned reads
├── variants/                         # VCF files with called variants
├── annotations/                      # Annotated variant files
├── qc_reports/                       # QC reports generated by FastQC, MultiQC
├── scripts/                          # Scripts for each step of the pipeline
│   ├── run_fastqc.sh                 # Script to run FastQC
│   ├── trim_reads.sh                 # Script for trimming reads with fastp
│   ├── align_reads.sh                # Script for aligning reads with BWA
│   ├── sort_bam.sh                   # Script for sorting BAM files
│   ├── mark_duplicates.sh            # Script for marking duplicates using Picard
│   ├── call_variants.sh              # Script for calling variants with GATK
│   ├── filter_variants.sh            # Script for filtering variants
│   ├── annotate_variants.sh          # Script for annotating variants with ANNOVAR
│   └── run_multiqc.sh                # Script to run MultiQC for QC aggregation
└── README.md                         # Project overview and instructions

About

This repository provides a workflow for performing variant calling from high-throughput sequencing data (e.g., Illumina or Oxford Nanopore).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages