Parquet, editable packages, de-duplication
Pre-release
Pre-release
·
4741 commits
to master
since this release
Version 2.5 includes the following:
- Ability to edit packages and build packages from in-memory DataFrames or from source files
- The Quilt store now serializes all structured data to Parquet (instead of HDF5). Parquet opens the door to high-performance querying over Quilt packages with tools like Spark and AWS Athena.
- Preliminary Spark support for Quilt packages through pyspark
- Content-aware file de-duplication: all upload and download fragments are uniqued in the registry as well as in the local
quilt_packages
. Data fragments that already exist on the server, or on local disk, are skipped saving you time, bandwidth, and storage.