-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intelligent sampling for large datasets #770
Comments
We've definitely been focusing on the single-node solution but we'll start thinking about larger datasets soon. How big is one of your shards and in what format? We can currently read just a single shard (e.g. if it's parquet) if you use the parquet source. We have sampling for the HuggingFace source, but we can also add a sampling mechanism for arbitrary data. |
We try to keep shards small...maybe <200 MB. The main issue with taking a single shard is skewness, if the dataset isn't properly shuffled. An ideal solution would sample across all shards of a parquet dataset, but I don't know off-hand how feasible that is. |
Is your data in HF datasets format, or just raw parquet files? Also we are in slack room with HuggingFace so I just asked for their advice on how to best load a single shard, and ideally a random sample of many shards. Will follow up shortly. |
Also I'm assuming the s3 buckets are private? We are using duckdb for really fast parquet reads and will use it for sampling. They support S3 reads directly: https://duckdb.org/docs/guides/import/s3_import.html You will just need to set some env variables to authorize with S3:
Let me know if this would work for you |
- Add sampling for parquet files via duckdb - Improve memory usage by reading smaller batches from huge parquet files - Add support for parquet files on S3 This isn't yet addressing the problem of having large number of shards and thus needing to sample from random shards, but that's a follow up. Towards #770
We are using GCP actually! And we have our own custom datasets in parquet format usually (some .jsonl, .csv, but parquet the main one). Does duckdb enable smart sampling parquet files within s3/gcs buckets, so that the sample is taken from many shards? |
I just sent #778 for review. In summary, when you have many shards, and you don't want to fully shuffle a dataset, we fetch a few rows from each shard. This is done by creating a batched reader for each shard and flipping a coin to select the next reader.
I tested this on 4 shards. How many shards does your large dataset have ? 100s, 1,000s, 10,000s? Because the current implementation will open a reader for each shard, and not sure how the OS will handle having many readers open at once. |
The parquet reader can now read from local files, S3 or GCS. If the dataset is sharded, the reader takes a glob pattern to load multiple files. The reader now takes `shuffle_before_sampling` in addition to `sample_size`. When `shuffle_before_sampling` is `True`, the reader will shuffle the entire dataset before sampling, but this requires fetching the entire dataset. If your dataset is massive and you only want to load the first `sample_size` rows, set `shuffle_before_sampling` to `False`. When you have many shards and `shuffle_before_sampling` is `False`, the reader will try to sample a few rows from each shard, to avoid any shard skew. I tested this by making 4 shards, each 180MB = 720MB dataset and uploading to `s3://lilac-public-data/test-*.parquet`. Loading with sample_size=1000 and shuffle_before_sampling=True takes ~2min since it has to download the entire dataset. However with shuffle_before_sampling=False, it takes 4secs since it uses range requests to partially read the files. Fixes #770 Fixes #779
Lilac currently does not work well with large datasets living in cloud storage. We often need to inspect samples of multi-GB/TB, partitioned parquet datasets but we aren't able to load the entires dataset locally. Ideally this would mean sampling a few rows from each shard, but a simpler version would be taking a single shard.
If Lilac adds better support for sampling, it could be used to share/inspect/evaluate large pre-training datasets
The text was updated successfully, but these errors were encountered: