Skip to content

A simple python implementation of Variant Call Format intersection and complements for identifying genetic mutations

License

Notifications You must be signed in to change notification settings

brendancsmith/vcf-isec

Repository files navigation

vcf-isec

A simple python implementation of Variant Call Format intersection and complements.

Background

Bioinformaticians store variants identified by next generation sequencing in a VCF file. The VCF specification was originally maintained by the 1000 Genomes Project, and the torch has since been passed to the Global Alliance for Genomics and Health Data Working group file format team.

Specifications for VCF v4.1 can be found here.

Essentially, a variant is represented as a separate line in the VCF, where the chromosome, position, reference base(s), and alternate base(s) identified at that position are found in columns 1, 2, 4, and 5, resp. Additional information pertaining to the variant is listed in the remaining fields of the VCF.

Task

A common task for bioinformaticians is to compare variants, whether to compare VCF files generated by different analytical pipelines or to simply compare variants between related individuals.

This script takes as input two VCFs and performs a comparison of the variants found in each file. The script outputs 3 VCFs, reflecting those variants that are shared and unique to each individual.

NOTE: An example VCF is provided at tests/resources/sample.vcf. VCFs can grow up to 4 million variants in size, as in the case of whole genome sequencing.

About

A simple python implementation of Variant Call Format intersection and complements for identifying genetic mutations

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages