Merged from master branch

huntdatacenter · Jan 15, 2020 · 986ac3d · 986ac3d
2 parents 5c80f0c + 281e9a6
commit 986ac3d
Show file tree

Hide file tree

Showing 18 changed files with 499 additions and 782 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,31 @@
 History
 ====
 
+15 January 2020
+----
+v1.1.5 release.  Changes are:
+
+- incorporate fix from Maarten Kooyman to build using Python 3.
+- fix issue #39 <https://bitbucket.org/gavinband/bgen/issues/39/rbgen-segfault-when-samples-are-given-in>
+
+7 August 2018
+----
+v1.1.4 release.  Update to fix sample subset issue with BGEN v1.1.
+
+2 May 2015
+-----
+v1.1.3 release.  The main changes are:
+
+- The rbgen R package, which gets data from indexed BGEN files into R, is has several improvements - it's easier to install, and has additional features (see below).
+- New, improved bgenix vcf output - now up to 50X faster.
+- Further performance improvements and resolution of a number of issues across the library.
+
+To accompany this we have written a paper which is now available on bioArxiv: https://doi.org/10.1101/308296. 
+
+13 July 2017
+----
+v1.0 release
+
 7 July 2016
 ----
 

diff --git a/Makefile b/Makefile
@@ -1,3 +1,20 @@
-FLAGS = -g -std=c++11 -Igenfile/include -lz
-bgen_to_vcf: example/bgen_to_vcf.cpp
+FLAGS = -g -std=c++11 -lz \
+-I genfile/include \
+-I db/include \
+-I 3rd_party/boost_1_55_0 \
+-I 3rd_party/zstd-1.1.0 \
+-I 3rd_party/zstd-1.1.0/lib \
+-I 3rd_party/zstd-1.1.0/lib/common \
+-I 3rd_party/zstd-1.1.0/lib/compress \
+-I 3rd_party/zstd-1.1.0/lib/decompress \
+-I 3rd_party/sqlite3 \
+-I include/3rd_party/sqlite3 \
+-D SQLITE_ENABLE_COLUMN_METADATA \
+-D SQLITE_ENABLE_STAT4 \
+-D SQLITE_MAX_EXPR_DEPTH=10000 \
+-D SQLITE_USE_URI=1 \
+-Wno-unused-local-typedefs \
+-fPIC -O3
+
+bgen_to_vcf: example/bgen_to_vcf.cpp $(wildcard 3rd_party/zstd-1.1.0/*.o) $(wildcard 3rd_party/sqlite3/*.o)
 	g++ ${FLAGS} -o build/bgen_to_vcf example/bgen_to_vcf.cpp src/*.cpp
diff --git a/R/package/src/load.cpp b/R/package/src/load.cpp
@@ -67,7 +67,7 @@ namespace {
 
 		// Called once per sample to determine whether we want data for this sample
 		bool set_sample( std::size_t i ) {
-			if( m_requested_sample_i->first == i ) {
+			if( m_requested_sample_i != m_requested_samples.end() && m_requested_sample_i->first == i ) {
 				m_storage_i = m_requested_sample_i->second ;
 				++m_requested_sample_i ;
 #if DEBUG

diff --git a/apps/bgenix.cpp b/apps/bgenix.cpp
@@ -816,7 +816,7 @@ struct IndexBgenApplication: public appcontext::ApplicationContext
 				assert( alleles.size() > 1 ) ;
 				std::cout << chromosome
 					<< "\t" << position
-					<< "\t" << rsid << "," << SNPID
+					<< "\t" << rsid << ( (SNPID == rsid) ? "" : (";" + SNPID))
 					<< "\t" << alleles[0]
 					<< "\t" ;
 				for( std::size_t j = 1; j < alleles.size(); ++j ) {

diff --git a/doc/html/bgen_format.html b/doc/html/bgen_format.html
diff --git a/doc/html/bgen_format_v1.0.html b/doc/html/bgen_format_v1.0.html
diff --git a/doc/html/history.html b/doc/html/history.html
@@ -0,0 +1,95 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
+    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+	<!-- Global Site Tag (gtag.js) - Google Analytics -->
+	<script async src="https://www.googletagmanager.com/gtag/js?id=UA-16521993-10"></script>
+	<script>
+	  window.dataLayer = window.dataLayer || [];
+	  function gtag(){dataLayer.push(arguments);}
+	  gtag('js', new Date());
+
+	  gtag('config', 'UA-16521993-10');
+	</script>
+	<title>The BGEN format</title>
+	<link href="style.css" rel="stylesheet" type="text/css" />
+	<script type="text/javascript" src="js/jquery-1.4.2.min.js"></script>
+	<script type="text/javascript" src="js/local.js"></script>
+</head>
+
+<body>
+  <div class="header">
+    <div class="header_text">
+      The BGEN format
+    </div>
+
+    <div class="header_subtext">
+      A compressed binary format for typed and imputed genotype data
+    </div>
+	<!-- navigation -->
+  </div>
+	<nav>
+		<ul>
+			<li><a href="index.html">home</a></li>
+			<li><a href="software.html">software</a></li>
+			<li><a href="history.html">history</a></li>
+			<li><a href="paper.html">paper</a></li>
+			<li><a href="spec/latest.html">specification</a></li>
+		</ul>
+	</nav>
+	  <div class = "boxed">
+			<h3>History of revisions</h3>
+	  <dl>
+	  <p>
+		  A history of revisions of the BGEN format specification is as follows:
+	  </p>
+	  <ul>
+		  <dt><b>BGEN v1.3</b> (January 2017): <a href="spec/v1.3.html">link to spec</a></dt>
+		  <dd>
+			  <li>Support for the <a href="http://zstd.net">zstandard</a> compression library. 
+				 Tests indicate this has better performance both in terms of file size
+				 and speed of reading and writing.</li>
+		  </dd>
+		  <dt><b>BGEN v1.2</b> (March 2016): <a href="spec/v1.2.html">link to spec</a></dt>
+		  <dd>
+			  Major update extending the BGEN format to add:
+              <ul>
+                <li>Support for variable ploidy and explicit missing data.</li>
+
+                <li>Support for multi-allelic variants (e.g. complex structural
+                variants).</li>
+
+                <li>Allow for control over file size by supporting genotype probabilities
+                stored at configurable precision.</li>
+
+                <li>Support for storing sample identifiers.</li>
+              </ul>
+			  A draft version of this spec was published beginning May 2015.  The following changes have been made since the earlier draft:
+			  <ul>
+				  <li><b>2015-11-05</b> (<code>v1.2 beta1</code>): modified the treatment of missing data in Layout 2 (v1.2-style) variant data blocks.</li>
+				  <li><b>2016-03-21</b> (<code>v1.2 beta2</code>): modified the order of stored probabilities for samples with ploidy greater than 2;
+					  clarified specification of the <code>phased</code> flag for samples with ploidy less than 2.</li>
+				  </ul>
+
+		  </dd>
+		  <dt><b>BGEN v1.1</b> (March 2012): <a href="spec/v1.1.html">link to spec</a></dt>
+		  <dd>
+              The first widely used version of the BGEN format.  The UK Biobank interim imputed data was released in this format.
+			  Relative to v1.0, this version is designed to cope with the long alleles present at indels and
+              structural variants in recent releases of the 1000 genomes project.  Features
+			  of this version are:
+              <ul>
+                <li>Support for biallelic SNPs and indels with alleles of arbitrary length (up to 2<sup>32</sup>-1).</li>
+                <li>Store probabilities to at least 4 decimal places worth' of accuracy</li>
+              </ul>
+		  </dd>
+		  <dt><b>BGEN v1.0</b> (2009):</dt>
+		  <dd>
+			  The original BGEN format.  <em>This version is now unsupported.</em>
+		  </dd>
+		</dl>
+	</div>
+
+</body>
+</html>
diff --git a/doc/html/index.html b/doc/html/index.html
diff --git a/doc/html/index.html b/doc/html/index.html
@@ -0,0 +1,107 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
+    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<!-- Global Site Tag (gtag.js) - Google Analytics -->
+<script async src="https://www.googletagmanager.com/gtag/js?id=UA-16521993-10"></script>
+<script>
+  window.dataLayer = window.dataLayer || [];
+  function gtag(){dataLayer.push(arguments);}
+  gtag('js', new Date());
+
+  gtag('config', 'UA-16521993-10');
+</script>
+	<title>The BGEN format</title>
+	<link href="style.css" rel="stylesheet" type="text/css" />
+	<script type="text/javascript" src="js/jquery-1.4.2.min.js"></script>
+	<script type="text/javascript" src="js/local.js"></script>
+</head>
+
+<body>
+  <div class="header">
+    <div class="header_text">
+      The BGEN format
+    </div>
+
+    <div class="header_subtext">
+      A compressed binary format for typed and imputed genotype data
+    </div>
+	<!-- navigation -->
+  </div>
+	<nav>
+		<ul>
+			<li><a href="index.html">home</a></li>
+			<li><a href="software.html">software</a></li>
+			<li><a href="history.html">history</a></li>
+			<li><a href="paper.html">paper</a></li>
+			<li><a href="spec/latest.html">specification</a></li>
+		</ul>
+	</nav>
+	<div id = "introduction" class="section">
+		<h3>Introduction</h3>
+
+		<p>
+			Modern genetic association studies routinely employ data on tens to hundreds of thousands
+			of individuals, genotyped or imputed at tens of millions of markers genome-wide.  Traditional
+			data formats based on text representation of these data - such as
+			the <a href="http://www.stats.ox.ac.uk/%7Emarchini/software/gwas/file_format.html">GEN</a>
+			format output by <a href="https://mathgen.stats.ox.ac.uk/impute/impute_v2.html">IMPUTE</a>,
+			or the <a href="http://ga4gh.org/#/fileformats-team">Variant Call Format</a>
+			- are sometimes not well suited to these data quantities.  Indeed, for simple programs the time spent parsing
+			these formats can dominate program execution time.
+		</p>
+		<p>
+			This page describes a binary GEN file format (the "BGEN" format) which aims to address these problems.
+			BGEN is a robust format that has been designed to have a specific blend of features that we believe make it
+			useful for this type of study.  It is targetted for use with large, potentially imputed genetic datasets.
+			Key features include:
+			</ul>
+				<li>The ability store both directly typed and imputed data.</li>
+				<li>The ability to store both unphased genotypes and phased haplotype data.</li>
+				<li>Small file sizes through the use of efficient, variable-precision packed bit representations and compression.</li>
+				<li>The use of per-variant compression makes the format simple to index and easy to catalogue.</li>
+			</ul>
+		</p>
+		<p>
+			For example, the following plot shows the time taken to list variant identifying data - i.e. the genomic
+			position, ID fields and alleles - for various common formats (Y-axis), against file size (X axis), for a dataset
+			of 18,496 samples typed at 121,668 SNPs on chromosome 1.  Both variants of BGEN defined below are shown.
+		</p>
+		<div class="embedded_plot">
+			<center>
+				<img width = 400 height = 300 src="images/bgen_comparison.png"></img>
+			</center>
+		</div>
+		<p>
+			For <a href="https://www.cog-genomics.org/plink2/input#bed">PLINK binary (<code>.bed</code>) files</a>, identifying data is
+			stored in a separate file (the <span class="monospace"><code>.bim</code></span> file) so the time is effectively zero.
+			For text-based formats there is a significant trade-off between the use of file compression and
+			read performance.  BGEN stores
+			the entire dataset of 2,250 million genotypes in 334Mb, slightly over one bit per genotype, and in this test took 1.5s.
+		</p>
+		<p>
+			(Performance optimisation of all formats may of course be possible, so the above plot
+			will not represent the best possible timings, but should be regarded as illustrative.)
+		</p>
+		<p>
+			The BGEN format has been used in several major projects, including the
+			<a href="http://www.wtccc.org.uk/ccc2/">Wellcome Trust Case-Control Consortium 2</a>,
+			the <a href="https://www.malariagen.net/projects/host">MalariaGEN</a> project, and the <a href="http://www.bristol.ac.uk/alspac/">ALSPAC study</a>.
+			It has been adopted as the release format for genome-wide imputed genotypes
+			for the <a href="http://www.ukbiobank.ac.uk">UK Biobank</a>.
+		</p>
+	</div>
+	<div id = "contributors">
+		<strong>Acknowledgements.</strong> The following people contributed to the design and implementation of the BGEN format:
+        <ul>
+                <li>
+                    <a href="http://www.well.ox.ac.uk/~gav/">Gavin Band</a>
+                </li>
+                <li>
+                    <a href="http://www.stats.ox.ac.uk/~marchini/">Jonathan Marchini</a>
+                </li>
+			</ul>
+	</div>
+</body>
+</html>