-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
18 changed files
with
499 additions
and
782 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,20 @@ | ||
FLAGS = -g -std=c++11 -Igenfile/include -lz | ||
bgen_to_vcf: example/bgen_to_vcf.cpp | ||
FLAGS = -g -std=c++11 -lz \ | ||
-I genfile/include \ | ||
-I db/include \ | ||
-I 3rd_party/boost_1_55_0 \ | ||
-I 3rd_party/zstd-1.1.0 \ | ||
-I 3rd_party/zstd-1.1.0/lib \ | ||
-I 3rd_party/zstd-1.1.0/lib/common \ | ||
-I 3rd_party/zstd-1.1.0/lib/compress \ | ||
-I 3rd_party/zstd-1.1.0/lib/decompress \ | ||
-I 3rd_party/sqlite3 \ | ||
-I include/3rd_party/sqlite3 \ | ||
-D SQLITE_ENABLE_COLUMN_METADATA \ | ||
-D SQLITE_ENABLE_STAT4 \ | ||
-D SQLITE_MAX_EXPR_DEPTH=10000 \ | ||
-D SQLITE_USE_URI=1 \ | ||
-Wno-unused-local-typedefs \ | ||
-fPIC -O3 | ||
|
||
bgen_to_vcf: example/bgen_to_vcf.cpp $(wildcard 3rd_party/zstd-1.1.0/*.o) $(wildcard 3rd_party/sqlite3/*.o) | ||
g++ ${FLAGS} -o build/bgen_to_vcf example/bgen_to_vcf.cpp src/*.cpp |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,95 @@ | ||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" | ||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> | ||
|
||
<html xmlns="http://www.w3.org/1999/xhtml"> | ||
<head> | ||
<!-- Global Site Tag (gtag.js) - Google Analytics --> | ||
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-16521993-10"></script> | ||
<script> | ||
window.dataLayer = window.dataLayer || []; | ||
function gtag(){dataLayer.push(arguments);} | ||
gtag('js', new Date()); | ||
|
||
gtag('config', 'UA-16521993-10'); | ||
</script> | ||
<title>The BGEN format</title> | ||
<link href="style.css" rel="stylesheet" type="text/css" /> | ||
<script type="text/javascript" src="js/jquery-1.4.2.min.js"></script> | ||
<script type="text/javascript" src="js/local.js"></script> | ||
</head> | ||
|
||
<body> | ||
<div class="header"> | ||
<div class="header_text"> | ||
The BGEN format | ||
</div> | ||
|
||
<div class="header_subtext"> | ||
A compressed binary format for typed and imputed genotype data | ||
</div> | ||
<!-- navigation --> | ||
</div> | ||
<nav> | ||
<ul> | ||
<li><a href="index.html">home</a></li> | ||
<li><a href="software.html">software</a></li> | ||
<li><a href="history.html">history</a></li> | ||
<li><a href="paper.html">paper</a></li> | ||
<li><a href="spec/latest.html">specification</a></li> | ||
</ul> | ||
</nav> | ||
<div class = "boxed"> | ||
<h3>History of revisions</h3> | ||
<dl> | ||
<p> | ||
A history of revisions of the BGEN format specification is as follows: | ||
</p> | ||
<ul> | ||
<dt><b>BGEN v1.3</b> (January 2017): <a href="spec/v1.3.html">link to spec</a></dt> | ||
<dd> | ||
<li>Support for the <a href="http://zstd.net">zstandard</a> compression library. | ||
Tests indicate this has better performance both in terms of file size | ||
and speed of reading and writing.</li> | ||
</dd> | ||
<dt><b>BGEN v1.2</b> (March 2016): <a href="spec/v1.2.html">link to spec</a></dt> | ||
<dd> | ||
Major update extending the BGEN format to add: | ||
<ul> | ||
<li>Support for variable ploidy and explicit missing data.</li> | ||
|
||
<li>Support for multi-allelic variants (e.g. complex structural | ||
variants).</li> | ||
|
||
<li>Allow for control over file size by supporting genotype probabilities | ||
stored at configurable precision.</li> | ||
|
||
<li>Support for storing sample identifiers.</li> | ||
</ul> | ||
A draft version of this spec was published beginning May 2015. The following changes have been made since the earlier draft: | ||
<ul> | ||
<li><b>2015-11-05</b> (<code>v1.2 beta1</code>): modified the treatment of missing data in Layout 2 (v1.2-style) variant data blocks.</li> | ||
<li><b>2016-03-21</b> (<code>v1.2 beta2</code>): modified the order of stored probabilities for samples with ploidy greater than 2; | ||
clarified specification of the <code>phased</code> flag for samples with ploidy less than 2.</li> | ||
</ul> | ||
|
||
</dd> | ||
<dt><b>BGEN v1.1</b> (March 2012): <a href="spec/v1.1.html">link to spec</a></dt> | ||
<dd> | ||
The first widely used version of the BGEN format. The UK Biobank interim imputed data was released in this format. | ||
Relative to v1.0, this version is designed to cope with the long alleles present at indels and | ||
structural variants in recent releases of the 1000 genomes project. Features | ||
of this version are: | ||
<ul> | ||
<li>Support for biallelic SNPs and indels with alleles of arbitrary length (up to 2<sup>32</sup>-1).</li> | ||
<li>Store probabilities to at least 4 decimal places worth' of accuracy</li> | ||
</ul> | ||
</dd> | ||
<dt><b>BGEN v1.0</b> (2009):</dt> | ||
<dd> | ||
The original BGEN format. <em>This version is now unsupported.</em> | ||
</dd> | ||
</dl> | ||
</div> | ||
|
||
</body> | ||
</html> |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,107 @@ | ||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" | ||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> | ||
|
||
<html xmlns="http://www.w3.org/1999/xhtml"> | ||
<head> | ||
<!-- Global Site Tag (gtag.js) - Google Analytics --> | ||
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-16521993-10"></script> | ||
<script> | ||
window.dataLayer = window.dataLayer || []; | ||
function gtag(){dataLayer.push(arguments);} | ||
gtag('js', new Date()); | ||
|
||
gtag('config', 'UA-16521993-10'); | ||
</script> | ||
<title>The BGEN format</title> | ||
<link href="style.css" rel="stylesheet" type="text/css" /> | ||
<script type="text/javascript" src="js/jquery-1.4.2.min.js"></script> | ||
<script type="text/javascript" src="js/local.js"></script> | ||
</head> | ||
|
||
<body> | ||
<div class="header"> | ||
<div class="header_text"> | ||
The BGEN format | ||
</div> | ||
|
||
<div class="header_subtext"> | ||
A compressed binary format for typed and imputed genotype data | ||
</div> | ||
<!-- navigation --> | ||
</div> | ||
<nav> | ||
<ul> | ||
<li><a href="index.html">home</a></li> | ||
<li><a href="software.html">software</a></li> | ||
<li><a href="history.html">history</a></li> | ||
<li><a href="paper.html">paper</a></li> | ||
<li><a href="spec/latest.html">specification</a></li> | ||
</ul> | ||
</nav> | ||
<div id = "introduction" class="section"> | ||
<h3>Introduction</h3> | ||
|
||
<p> | ||
Modern genetic association studies routinely employ data on tens to hundreds of thousands | ||
of individuals, genotyped or imputed at tens of millions of markers genome-wide. Traditional | ||
data formats based on text representation of these data - such as | ||
the <a href="http://www.stats.ox.ac.uk/%7Emarchini/software/gwas/file_format.html">GEN</a> | ||
format output by <a href="https://mathgen.stats.ox.ac.uk/impute/impute_v2.html">IMPUTE</a>, | ||
or the <a href="http://ga4gh.org/#/fileformats-team">Variant Call Format</a> | ||
- are sometimes not well suited to these data quantities. Indeed, for simple programs the time spent parsing | ||
these formats can dominate program execution time. | ||
</p> | ||
<p> | ||
This page describes a binary GEN file format (the "BGEN" format) which aims to address these problems. | ||
BGEN is a robust format that has been designed to have a specific blend of features that we believe make it | ||
useful for this type of study. It is targetted for use with large, potentially imputed genetic datasets. | ||
Key features include: | ||
</ul> | ||
<li>The ability store both directly typed and imputed data.</li> | ||
<li>The ability to store both unphased genotypes and phased haplotype data.</li> | ||
<li>Small file sizes through the use of efficient, variable-precision packed bit representations and compression.</li> | ||
<li>The use of per-variant compression makes the format simple to index and easy to catalogue.</li> | ||
</ul> | ||
</p> | ||
<p> | ||
For example, the following plot shows the time taken to list variant identifying data - i.e. the genomic | ||
position, ID fields and alleles - for various common formats (Y-axis), against file size (X axis), for a dataset | ||
of 18,496 samples typed at 121,668 SNPs on chromosome 1. Both variants of BGEN defined below are shown. | ||
</p> | ||
<div class="embedded_plot"> | ||
<center> | ||
<img width = 400 height = 300 src="images/bgen_comparison.png"></img> | ||
</center> | ||
</div> | ||
<p> | ||
For <a href="https://www.cog-genomics.org/plink2/input#bed">PLINK binary (<code>.bed</code>) files</a>, identifying data is | ||
stored in a separate file (the <span class="monospace"><code>.bim</code></span> file) so the time is effectively zero. | ||
For text-based formats there is a significant trade-off between the use of file compression and | ||
read performance. BGEN stores | ||
the entire dataset of 2,250 million genotypes in 334Mb, slightly over one bit per genotype, and in this test took 1.5s. | ||
</p> | ||
<p> | ||
(Performance optimisation of all formats may of course be possible, so the above plot | ||
will not represent the best possible timings, but should be regarded as illustrative.) | ||
</p> | ||
<p> | ||
The BGEN format has been used in several major projects, including the | ||
<a href="http://www.wtccc.org.uk/ccc2/">Wellcome Trust Case-Control Consortium 2</a>, | ||
the <a href="https://www.malariagen.net/projects/host">MalariaGEN</a> project, and the <a href="http://www.bristol.ac.uk/alspac/">ALSPAC study</a>. | ||
It has been adopted as the release format for genome-wide imputed genotypes | ||
for the <a href="http://www.ukbiobank.ac.uk">UK Biobank</a>. | ||
</p> | ||
</div> | ||
<div id = "contributors"> | ||
<strong>Acknowledgements.</strong> The following people contributed to the design and implementation of the BGEN format: | ||
<ul> | ||
<li> | ||
<a href="http://www.well.ox.ac.uk/~gav/">Gavin Band</a> | ||
</li> | ||
<li> | ||
<a href="http://www.stats.ox.ac.uk/~marchini/">Jonathan Marchini</a> | ||
</li> | ||
</ul> | ||
</div> | ||
</body> | ||
</html> |
Oops, something went wrong.