Skip to content

Commit

Permalink
Merged from master branch
Browse files Browse the repository at this point in the history
  • Loading branch information
gavinband committed Jan 15, 2020
2 parents 5c80f0c + 281e9a6 commit 986ac3d
Show file tree
Hide file tree
Showing 18 changed files with 499 additions and 782 deletions.
25 changes: 25 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,31 @@
History
====

15 January 2020
----
v1.1.5 release. Changes are:

- incorporate fix from Maarten Kooyman to build using Python 3.
- fix issue #39 <https://bitbucket.org/gavinband/bgen/issues/39/rbgen-segfault-when-samples-are-given-in>

7 August 2018
----
v1.1.4 release. Update to fix sample subset issue with BGEN v1.1.

2 May 2015
-----
v1.1.3 release. The main changes are:

- The rbgen R package, which gets data from indexed BGEN files into R, is has several improvements - it's easier to install, and has additional features (see below).
- New, improved bgenix vcf output - now up to 50X faster.
- Further performance improvements and resolution of a number of issues across the library.

To accompany this we have written a paper which is now available on bioArxiv: https://doi.org/10.1101/308296.

13 July 2017
----
v1.0 release

7 July 2016
----

Expand Down
21 changes: 19 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,3 +1,20 @@
FLAGS = -g -std=c++11 -Igenfile/include -lz
bgen_to_vcf: example/bgen_to_vcf.cpp
FLAGS = -g -std=c++11 -lz \
-I genfile/include \
-I db/include \
-I 3rd_party/boost_1_55_0 \
-I 3rd_party/zstd-1.1.0 \
-I 3rd_party/zstd-1.1.0/lib \
-I 3rd_party/zstd-1.1.0/lib/common \
-I 3rd_party/zstd-1.1.0/lib/compress \
-I 3rd_party/zstd-1.1.0/lib/decompress \
-I 3rd_party/sqlite3 \
-I include/3rd_party/sqlite3 \
-D SQLITE_ENABLE_COLUMN_METADATA \
-D SQLITE_ENABLE_STAT4 \
-D SQLITE_MAX_EXPR_DEPTH=10000 \
-D SQLITE_USE_URI=1 \
-Wno-unused-local-typedefs \
-fPIC -O3

bgen_to_vcf: example/bgen_to_vcf.cpp $(wildcard 3rd_party/zstd-1.1.0/*.o) $(wildcard 3rd_party/sqlite3/*.o)
g++ ${FLAGS} -o build/bgen_to_vcf example/bgen_to_vcf.cpp src/*.cpp
2 changes: 1 addition & 1 deletion R/package/src/load.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ namespace {

// Called once per sample to determine whether we want data for this sample
bool set_sample( std::size_t i ) {
if( m_requested_sample_i->first == i ) {
if( m_requested_sample_i != m_requested_samples.end() && m_requested_sample_i->first == i ) {
m_storage_i = m_requested_sample_i->second ;
++m_requested_sample_i ;
#if DEBUG
Expand Down
2 changes: 1 addition & 1 deletion apps/bgenix.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -816,7 +816,7 @@ struct IndexBgenApplication: public appcontext::ApplicationContext
assert( alleles.size() > 1 ) ;
std::cout << chromosome
<< "\t" << position
<< "\t" << rsid << "," << SNPID
<< "\t" << rsid << ( (SNPID == rsid) ? "" : (";" + SNPID))
<< "\t" << alleles[0]
<< "\t" ;
for( std::size_t j = 1; j < alleles.size(); ++j ) {
Expand Down
1 change: 0 additions & 1 deletion doc/html/bgen_format.html

This file was deleted.

106 changes: 0 additions & 106 deletions doc/html/bgen_format_v1.0.html

This file was deleted.

95 changes: 95 additions & 0 deletions doc/html/history.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<!-- Global Site Tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-16521993-10"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());

gtag('config', 'UA-16521993-10');
</script>
<title>The BGEN format</title>
<link href="style.css" rel="stylesheet" type="text/css" />
<script type="text/javascript" src="js/jquery-1.4.2.min.js"></script>
<script type="text/javascript" src="js/local.js"></script>
</head>

<body>
<div class="header">
<div class="header_text">
The BGEN format
</div>

<div class="header_subtext">
A compressed binary format for typed and imputed genotype data
</div>
<!-- navigation -->
</div>
<nav>
<ul>
<li><a href="index.html">home</a></li>
<li><a href="software.html">software</a></li>
<li><a href="history.html">history</a></li>
<li><a href="paper.html">paper</a></li>
<li><a href="spec/latest.html">specification</a></li>
</ul>
</nav>
<div class = "boxed">
<h3>History of revisions</h3>
<dl>
<p>
A history of revisions of the BGEN format specification is as follows:
</p>
<ul>
<dt><b>BGEN v1.3</b> (January 2017): <a href="spec/v1.3.html">link to spec</a></dt>
<dd>
<li>Support for the <a href="http://zstd.net">zstandard</a> compression library.
Tests indicate this has better performance both in terms of file size
and speed of reading and writing.</li>
</dd>
<dt><b>BGEN v1.2</b> (March 2016): <a href="spec/v1.2.html">link to spec</a></dt>
<dd>
Major update extending the BGEN format to add:
<ul>
<li>Support for variable ploidy and explicit missing data.</li>

<li>Support for multi-allelic variants (e.g. complex structural
variants).</li>

<li>Allow for control over file size by supporting genotype probabilities
stored at configurable precision.</li>

<li>Support for storing sample identifiers.</li>
</ul>
A draft version of this spec was published beginning May 2015. The following changes have been made since the earlier draft:
<ul>
<li><b>2015-11-05</b> (<code>v1.2 beta1</code>): modified the treatment of missing data in Layout 2 (v1.2-style) variant data blocks.</li>
<li><b>2016-03-21</b> (<code>v1.2 beta2</code>): modified the order of stored probabilities for samples with ploidy greater than 2;
clarified specification of the <code>phased</code> flag for samples with ploidy less than 2.</li>
</ul>

</dd>
<dt><b>BGEN v1.1</b> (March 2012): <a href="spec/v1.1.html">link to spec</a></dt>
<dd>
The first widely used version of the BGEN format. The UK Biobank interim imputed data was released in this format.
Relative to v1.0, this version is designed to cope with the long alleles present at indels and
structural variants in recent releases of the 1000 genomes project. Features
of this version are:
<ul>
<li>Support for biallelic SNPs and indels with alleles of arbitrary length (up to 2<sup>32</sup>-1).</li>
<li>Store probabilities to at least 4 decimal places worth' of accuracy</li>
</ul>
</dd>
<dt><b>BGEN v1.0</b> (2009):</dt>
<dd>
The original BGEN format. <em>This version is now unsupported.</em>
</dd>
</dl>
</div>

</body>
</html>
1 change: 0 additions & 1 deletion doc/html/index.html

This file was deleted.

107 changes: 107 additions & 0 deletions doc/html/index.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<!-- Global Site Tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-16521993-10"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());

gtag('config', 'UA-16521993-10');
</script>
<title>The BGEN format</title>
<link href="style.css" rel="stylesheet" type="text/css" />
<script type="text/javascript" src="js/jquery-1.4.2.min.js"></script>
<script type="text/javascript" src="js/local.js"></script>
</head>

<body>
<div class="header">
<div class="header_text">
The BGEN format
</div>

<div class="header_subtext">
A compressed binary format for typed and imputed genotype data
</div>
<!-- navigation -->
</div>
<nav>
<ul>
<li><a href="index.html">home</a></li>
<li><a href="software.html">software</a></li>
<li><a href="history.html">history</a></li>
<li><a href="paper.html">paper</a></li>
<li><a href="spec/latest.html">specification</a></li>
</ul>
</nav>
<div id = "introduction" class="section">
<h3>Introduction</h3>

<p>
Modern genetic association studies routinely employ data on tens to hundreds of thousands
of individuals, genotyped or imputed at tens of millions of markers genome-wide. Traditional
data formats based on text representation of these data - such as
the <a href="http://www.stats.ox.ac.uk/%7Emarchini/software/gwas/file_format.html">GEN</a>
format output by <a href="https://mathgen.stats.ox.ac.uk/impute/impute_v2.html">IMPUTE</a>,
or the <a href="http://ga4gh.org/#/fileformats-team">Variant Call Format</a>
- are sometimes not well suited to these data quantities. Indeed, for simple programs the time spent parsing
these formats can dominate program execution time.
</p>
<p>
This page describes a binary GEN file format (the "BGEN" format) which aims to address these problems.
BGEN is a robust format that has been designed to have a specific blend of features that we believe make it
useful for this type of study. It is targetted for use with large, potentially imputed genetic datasets.
Key features include:
</ul>
<li>The ability store both directly typed and imputed data.</li>
<li>The ability to store both unphased genotypes and phased haplotype data.</li>
<li>Small file sizes through the use of efficient, variable-precision packed bit representations and compression.</li>
<li>The use of per-variant compression makes the format simple to index and easy to catalogue.</li>
</ul>
</p>
<p>
For example, the following plot shows the time taken to list variant identifying data - i.e. the genomic
position, ID fields and alleles - for various common formats (Y-axis), against file size (X axis), for a dataset
of 18,496 samples typed at 121,668 SNPs on chromosome 1. Both variants of BGEN defined below are shown.
</p>
<div class="embedded_plot">
<center>
<img width = 400 height = 300 src="images/bgen_comparison.png"></img>
</center>
</div>
<p>
For <a href="https://www.cog-genomics.org/plink2/input#bed">PLINK binary (<code>.bed</code>) files</a>, identifying data is
stored in a separate file (the <span class="monospace"><code>.bim</code></span> file) so the time is effectively zero.
For text-based formats there is a significant trade-off between the use of file compression and
read performance. BGEN stores
the entire dataset of 2,250 million genotypes in 334Mb, slightly over one bit per genotype, and in this test took 1.5s.
</p>
<p>
(Performance optimisation of all formats may of course be possible, so the above plot
will not represent the best possible timings, but should be regarded as illustrative.)
</p>
<p>
The BGEN format has been used in several major projects, including the
<a href="http://www.wtccc.org.uk/ccc2/">Wellcome Trust Case-Control Consortium 2</a>,
the <a href="https://www.malariagen.net/projects/host">MalariaGEN</a> project, and the <a href="http://www.bristol.ac.uk/alspac/">ALSPAC study</a>.
It has been adopted as the release format for genome-wide imputed genotypes
for the <a href="http://www.ukbiobank.ac.uk">UK Biobank</a>.
</p>
</div>
<div id = "contributors">
<strong>Acknowledgements.</strong> The following people contributed to the design and implementation of the BGEN format:
<ul>
<li>
<a href="http://www.well.ox.ac.uk/~gav/">Gavin Band</a>
</li>
<li>
<a href="http://www.stats.ox.ac.uk/~marchini/">Jonathan Marchini</a>
</li>
</ul>
</div>
</body>
</html>
Loading

0 comments on commit 986ac3d

Please sign in to comment.