Skip to content

Parquet, editable packages, de-duplication

Pre-release
Pre-release
Compare
Choose a tag to compare
@kevinemoore kevinemoore released this 09 Jun 00:22
· 4741 commits to master since this release

Version 2.5 includes the following:

  • Ability to edit packages and build packages from in-memory DataFrames or from source files
  • The Quilt store now serializes all structured data to Parquet (instead of HDF5). Parquet opens the door to high-performance querying over Quilt packages with tools like Spark and AWS Athena.
  • Preliminary Spark support for Quilt packages through pyspark
  • Content-aware file de-duplication: all upload and download fragments are uniqued in the registry as well as in the local quilt_packages. Data fragments that already exist on the server, or on local disk, are skipped saving you time, bandwidth, and storage.