Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add intelligent sampling in ParquetSource #778

Merged
merged 14 commits into from
Oct 23, 2023
Merged

Add intelligent sampling in ParquetSource #778

merged 14 commits into from
Oct 23, 2023

Conversation

dsmilkov
Copy link
Collaborator

@dsmilkov dsmilkov commented Oct 23, 2023

The parquet reader can now read from local files, S3 or GCS. If the dataset is sharded, the reader takes a glob pattern to load multiple files.

The reader now takes shuffle_before_sampling in addition to sample_size. When
shuffle_before_sampling is True, the reader will shuffle the entire dataset before sampling, but
this requires fetching the entire dataset. If your dataset is massive and you only want to load the
first sample_size rows, set shuffle_before_sampling to False. When you have many shards and
shuffle_before_sampling is False, the reader will try to sample a few rows from each shard, to
avoid any shard skew.

I tested this by making 4 shards, each 180MB = 720MB dataset and uploading to s3://lilac-public-data/test-*.parquet. Loading with sample_size=1000 and shuffle_before_sampling=True takes ~2min since it has to download the entire dataset.
However with shuffle_before_sampling=False, it takes 4secs since it uses range requests to partially read the files.

Fixes #770
Fixes #779

@brilee
Copy link
Contributor

brilee commented Oct 23, 2023

When I call sample(), I think I expect to get a uniform random sampling, whereas this seems to be heavily implementation dependent - smaller shards will be overrepresented; you'll get multiple samples from one shard before moving on to the next, etc.

Still, this is probably a good enough implementation to move ahead for now.

docs/datasets/dataset_load.md Outdated Show resolved Hide resolved
docs/datasets/dataset_load.md Show resolved Hide resolved
docs/datasets/dataset_load.md Show resolved Hide resolved
docs/datasets/dataset_load.md Outdated Show resolved Hide resolved
docs/datasets/dataset_load.md Outdated Show resolved Hide resolved
lilac/sources/parquet_source.py Show resolved Hide resolved
lilac/sources/parquet_source.py Outdated Show resolved Hide resolved
Copy link
Contributor

@nsthorat nsthorat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wooop nice work!

lilac/sources/parquet_source.py Show resolved Hide resolved
lilac/sources/parquet_source.py Outdated Show resolved Hide resolved
lilac/sources/parquet_source.py Show resolved Hide resolved
lilac/sources/parquet_source.py Show resolved Hide resolved
docs/datasets/dataset_load.md Outdated Show resolved Hide resolved
docs/datasets/dataset_load.md Show resolved Hide resolved
docs/datasets/dataset_load.md Outdated Show resolved Hide resolved
docs/datasets/dataset_load.md Show resolved Hide resolved
@dsmilkov dsmilkov merged commit 31a49dc into main Oct 23, 2023
4 checks passed
@dsmilkov dsmilkov deleted the ds-shard branch October 23, 2023 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support private datasets in HuggingFaceSource Intelligent sampling for large datasets
3 participants