VCFs in Hadoop ecosystem #86

hammer · 2024-01-08T17:53:01Z

hammer
Jan 8, 2024
Maintainer

An important section of the paper explains and discusses the prior art for distributed computation on genetic variation data. This discussion collects various resources together in preparation for the actual paper narrative.

This started out as a discussion about Parquet, but Kudo also appears and generally I think our goal is to describe the attempts to use the original big data platform for VCF processing.

[Note: converted from an issue by @jeromekelleher ]

jeromekelleher · 2024-01-09T12:37:01Z

jeromekelleher
Jan 9, 2024
Maintainer

Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet (2017 - 9 citation)

Early tool to convert VCF to JSON (and then natively to Parquet), and some basic experiments on computing allele frequencies with varying cluster size and replication factors. Focuses on doing this for small genome regions on one hundred of samples.

Performance is surprisingly slow - with a cluster of 150 nodes it takes ~100 seconds to compute allele frequencies for only 100 samples.

Parquet files were a ~4 fold reduction in size from VCF (unclear if this was uncompressed?)

Results not all that impressive actually

0 replies

jeromekelleher · 2024-01-09T12:37:10Z

jeromekelleher
Jan 9, 2024
Maintainer

High-performance genetic datastore on AWS S3 using Parquet and Arrow (2021)

Blog post by 23andme engineers

, we share the experiments we performed to compare the performance of queries running against compressed VCF and Parquet files. Each VCF file has dosages for 1000 samples and over 200,000 markers and a Parquet file has dosages for 1 million samples and 1000 markers.

For a database with roughly 10 million samples and 58 million markers, compressed VCF files take up roughly 600 TB of disk space compared to just 200 TB for the Parquet files.

Interesting that they can get Parquet working well (apparently) with 1M samples, perhaps through transposing the matrix? But they still have 10s of millions of columns. I guess this example only has 1000 markers, so maybe that's all they are demonstrating?

0 replies

jeromekelleher · 2024-01-09T12:37:57Z

jeromekelleher
Jan 9, 2024
Maintainer

Genomic data in Parquet format on Azure (2022)

Blog post from microsoft

We converted two Azure Genomics Data Lake datasets from VCF to Parquet: 1000 Genomes variant call information (release 20130502) and gnomAD v2.1.1, to make this data more accessible, especially for people without extensive genomics background. Datasets in Parquet format are publicly available for queries and exploration.

Show examples of running SQL queries on the datasets.

gnomAD doesn't include sample-level data, so not especially interesting or challenging here.

1000G example uses Glow to convert VCF data.

0 replies

jeromekelleher · 2024-01-09T12:38:08Z

jeromekelleher
Jan 9, 2024
Maintainer

GCP Variant Transforms

Google's method to ingest VCFs to BigQuery. Not directly connected to Hadoop ecosystem really, but uses Apache Beam

0 replies

jeromekelleher · 2024-01-09T13:10:04Z

jeromekelleher
Jan 9, 2024
Maintainer

Variant-Kudu: An Efficient Tool kit Leveraging Distributed Bitmap Index for Analysis of Massive Genetic Variation Datasets (2020)

0 citations

Based on Apace Kudu

Explicitly discuss the need for scalability. Uses distributed bitmap indexes to make queries efficient

Interesting comparison with different mixes of technologies SparkSQL+Parquet and HBase+HDFS.

Only tested on 1000 genomes data (so very few samples)

0 replies

jeromekelleher · 2024-01-09T13:39:27Z

jeromekelleher
Jan 9, 2024
Maintainer

Glow

Repo: https://github.com/projectglow/glow (250 stars, 24 contributors)
Docs
Announcement

Glow, an open-source collaboration between the Regeneron Genetics Center® and Databricks.

Glow is built on Apache Spark and Delta Lake, enabling distributed computation on and distributed storage of genotype data. The library is backwards compatible with genomics file formats and bioinformatics tools developed in academia, enabling users to easily share data with collaborators.

Glow provides genomic datasources: To read datasets in common file formats such as VCF, BGEN, and Plink into Spark DataFrames.

Seems to be mainly based on the idea of using SparkSQL to query the data.

No suggestion of an academic paper anywhere

2 replies

hammer Jan 9, 2024
Maintainer Author

Yeah it's a dead project, we used to benchmark against it in the early days of sgkit, and it's actually where the distributed REGENIE implementation was first placed. You can see some of Eric's issues at https://github.com/projectglow/glow/issues?q=author%3Aeric-czech. It would be interesting to hear why Regeneron moved away from it. I don't think it's worth mentioning, to be honest, it's not different enough from Hail to merit much discussion. It was largely just an attempt by a vendor (Databricks) to monetize Hail users through minor differentiation.

jeromekelleher Jan 9, 2024
Maintainer

Aha, great to know, thanks.

jeromekelleher · 2024-01-09T15:42:07Z

jeromekelleher
Jan 9, 2024
Maintainer

So the meta question I'd like to ask (and hopefully @hammer and @tomwhite can answer) is why trying to use general tools like Parquet+SparkSQL don't work that well for genetic variation data. My understanding was that the overall ecosystem is optimised for "long and narrow" data, whereas large-scale VCF data is "longish and super wide" and it's just not designed for it.

Is that a fair characterisation?

3 replies

hammer Jan 9, 2024
Maintainer Author

I think Parquet and SparkSQL are distinct considerations. Parquet is quite decoupled from the Hadoop ecosystem at this point and is the preferred storage format for columnar data in most settings.

Parquet is good when your data can be mapped nicely to JSON and when your most expensive computations are column-wise rather than row-wise. For genetic variation data, the JSON data model is not such a great fit, as the genotype calls are more naturally represented with a 3-dimensional array as in sgkit, and we have sample and variant metadata we'd like to keep aligned with this array. We also have a mix of column-wise and row-wise operations, so partitioning on one dimension is not as big of a win as in other use cases.

The drawback of query engines like Spark for our use cases tends to be their limited expressiveness for dense multi-dimensional array operations. Referring back to Eric's excellent breakdown of core operations, we observed Spark to be particularly bad for LD estimation / pruning, Relatedness estimation / pruning, and Population structure estimation.

tomwhite Jan 9, 2024
Maintainer

Yes, I think that's it. Fundamentally, none of these systems are designed to partition across groups of columns. So using one column per sample, you start to see scaling problems when reaching thousands of samples.

jeromekelleher Jan 9, 2024
Maintainer

Aha - these systems aren't designed for multi-dimensional data. Got it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VCFs in Hadoop ecosystem #86

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

VCFs in Hadoop ecosystem #86

hammer Jan 8, 2024 Maintainer

Replies: 7 comments · 5 replies

jeromekelleher Jan 9, 2024 Maintainer

jeromekelleher Jan 9, 2024 Maintainer

jeromekelleher Jan 9, 2024 Maintainer

jeromekelleher Jan 9, 2024 Maintainer

jeromekelleher Jan 9, 2024 Maintainer

jeromekelleher Jan 9, 2024 Maintainer

Glow

hammer Jan 9, 2024 Maintainer Author

jeromekelleher Jan 9, 2024 Maintainer

jeromekelleher Jan 9, 2024 Maintainer

hammer Jan 9, 2024 Maintainer Author

tomwhite Jan 9, 2024 Maintainer

jeromekelleher Jan 9, 2024 Maintainer

hammer
Jan 8, 2024
Maintainer

Replies: 7 comments 5 replies

jeromekelleher
Jan 9, 2024
Maintainer

jeromekelleher
Jan 9, 2024
Maintainer

jeromekelleher
Jan 9, 2024
Maintainer

jeromekelleher
Jan 9, 2024
Maintainer

jeromekelleher
Jan 9, 2024
Maintainer

jeromekelleher
Jan 9, 2024
Maintainer

hammer Jan 9, 2024
Maintainer Author

jeromekelleher Jan 9, 2024
Maintainer

jeromekelleher
Jan 9, 2024
Maintainer

hammer Jan 9, 2024
Maintainer Author

tomwhite Jan 9, 2024
Maintainer

jeromekelleher Jan 9, 2024
Maintainer