Replies: 7 comments 5 replies
-
Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet (2017 - 9 citation) Early tool to convert VCF to JSON (and then natively to Parquet), and some basic experiments on computing allele frequencies with varying cluster size and replication factors. Focuses on doing this for small genome regions on one hundred of samples. Performance is surprisingly slow - with a cluster of 150 nodes it takes ~100 seconds to compute allele frequencies for only 100 samples. Parquet files were a ~4 fold reduction in size from VCF (unclear if this was uncompressed?) Results not all that impressive actually |
Beta Was this translation helpful? Give feedback.
-
High-performance genetic datastore on AWS S3 using Parquet and Arrow (2021) Blog post by 23andme engineers
Interesting that they can get Parquet working well (apparently) with 1M samples, perhaps through transposing the matrix? But they still have 10s of millions of columns. I guess this example only has 1000 markers, so maybe that's all they are demonstrating? |
Beta Was this translation helpful? Give feedback.
-
Genomic data in Parquet format on Azure (2022) Blog post from microsoft
Show examples of running SQL queries on the datasets. gnomAD doesn't include sample-level data, so not especially interesting or challenging here. 1000G example uses Glow to convert VCF data. |
Beta Was this translation helpful? Give feedback.
-
Google's method to ingest VCFs to BigQuery. Not directly connected to Hadoop ecosystem really, but uses Apache Beam |
Beta Was this translation helpful? Give feedback.
-
0 citations Based on Apace Kudu Explicitly discuss the need for scalability. Uses distributed bitmap indexes to make queries efficient Interesting comparison with different mixes of technologies SparkSQL+Parquet and HBase+HDFS. Only tested on 1000 genomes data (so very few samples) |
Beta Was this translation helpful? Give feedback.
-
Glow
Seems to be mainly based on the idea of using SparkSQL to query the data. No suggestion of an academic paper anywhere |
Beta Was this translation helpful? Give feedback.
-
So the meta question I'd like to ask (and hopefully @hammer and @tomwhite can answer) is why trying to use general tools like Parquet+SparkSQL don't work that well for genetic variation data. My understanding was that the overall ecosystem is optimised for "long and narrow" data, whereas large-scale VCF data is "longish and super wide" and it's just not designed for it. Is that a fair characterisation? |
Beta Was this translation helpful? Give feedback.
-
An important section of the paper explains and discusses the prior art for distributed computation on genetic variation data. This discussion collects various resources together in preparation for the actual paper narrative.
This started out as a discussion about Parquet, but Kudo also appears and generally I think our goal is to describe the attempts to use the original big data platform for VCF processing.
[Note: converted from an issue by @jeromekelleher ]
Beta Was this translation helpful? Give feedback.
All reactions