-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add intelligent sampling in ParquetSource
#778
Conversation
When I call sample(), I think I expect to get a uniform random sampling, whereas this seems to be heavily implementation dependent - smaller shards will be overrepresented; you'll get multiple samples from one shard before moving on to the next, etc. Still, this is probably a good enough implementation to move ahead for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wooop nice work!
The parquet reader can now read from local files, S3 or GCS. If the dataset is sharded, the reader takes a glob pattern to load multiple files.
The reader now takes
shuffle_before_sampling
in addition tosample_size
. Whenshuffle_before_sampling
isTrue
, the reader will shuffle the entire dataset before sampling, butthis requires fetching the entire dataset. If your dataset is massive and you only want to load the
first
sample_size
rows, setshuffle_before_sampling
toFalse
. When you have many shards andshuffle_before_sampling
isFalse
, the reader will try to sample a few rows from each shard, toavoid any shard skew.
I tested this by making 4 shards, each 180MB = 720MB dataset and uploading to
s3://lilac-public-data/test-*.parquet
. Loading with sample_size=1000 and shuffle_before_sampling=True takes ~2min since it has to download the entire dataset.However with shuffle_before_sampling=False, it takes 4secs since it uses range requests to partially read the files.
Fixes #770
Fixes #779