Add intelligent sampling in `ParquetSource` #778

dsmilkov · 2023-10-23T13:22:33Z

The parquet reader can now read from local files, S3 or GCS. If the dataset is sharded, the reader takes a glob pattern to load multiple files.

The reader now takes shuffle_before_sampling in addition to sample_size. When
shuffle_before_sampling is True, the reader will shuffle the entire dataset before sampling, but
this requires fetching the entire dataset. If your dataset is massive and you only want to load the
first sample_size rows, set shuffle_before_sampling to False. When you have many shards and
shuffle_before_sampling is False, the reader will try to sample a few rows from each shard, to
avoid any shard skew.

I tested this by making 4 shards, each 180MB = 720MB dataset and uploading to s3://lilac-public-data/test-*.parquet. Loading with sample_size=1000 and shuffle_before_sampling=True takes ~2min since it has to download the entire dataset.
However with shuffle_before_sampling=False, it takes 4secs since it uses range requests to partially read the files.

Fixes #770
Fixes #779

brilee · 2023-10-23T14:26:46Z

When I call sample(), I think I expect to get a uniform random sampling, whereas this seems to be heavily implementation dependent - smaller shards will be overrepresented; you'll get multiple samples from one shard before moving on to the next, etc.

Still, this is probably a good enough implementation to move ahead for now.

docs/datasets/dataset_load.md

lilac/sources/parquet_source.py

nsthorat

wooop nice work!

lilac/sources/parquet_source.py

docs/datasets/dataset_load.md

dsmilkov added 7 commits October 21, 2023 18:47

save

7947e56

save

3ad5a96

save

c535fed

save

3031aa4

save

294c907

save

28b6ff5

save

8a87102

dsmilkov requested a review from nsthorat October 23, 2023 13:22

github-actions bot added the backend label Oct 23, 2023

dsmilkov requested a review from brilee October 23, 2023 13:23

dsmilkov mentioned this pull request Oct 23, 2023

Intelligent sampling for large datasets #770

Closed

brilee approved these changes Oct 23, 2023

View reviewed changes

nsthorat approved these changes Oct 23, 2023

View reviewed changes

dsmilkov added 7 commits October 23, 2023 11:35

save

1d1c772

save

db03e7a

Merge remote-tracking branch 'origin' into ds-shard

e109c57

save

a16d721

save

d01dec4

save

a1e6a90

save

0313681

dsmilkov merged commit 31a49dc into main Oct 23, 2023
4 checks passed

dsmilkov deleted the ds-shard branch October 23, 2023 16:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add intelligent sampling in `ParquetSource` #778

Add intelligent sampling in `ParquetSource` #778

dsmilkov commented Oct 23, 2023 •

edited

Loading

brilee commented Oct 23, 2023

nsthorat left a comment

Add intelligent sampling in ParquetSource #778

Add intelligent sampling in ParquetSource #778

Conversation

dsmilkov commented Oct 23, 2023 • edited Loading

brilee commented Oct 23, 2023

nsthorat left a comment

Choose a reason for hiding this comment

Add intelligent sampling in `ParquetSource` #778

Add intelligent sampling in `ParquetSource` #778

dsmilkov commented Oct 23, 2023 •

edited

Loading