Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intelligent sampling for large datasets #770

Closed
jtalmi opened this issue Oct 19, 2023 · 6 comments · Fixed by #778
Closed

Intelligent sampling for large datasets #770

jtalmi opened this issue Oct 19, 2023 · 6 comments · Fixed by #778

Comments

@jtalmi
Copy link

jtalmi commented Oct 19, 2023

Lilac currently does not work well with large datasets living in cloud storage. We often need to inspect samples of multi-GB/TB, partitioned parquet datasets but we aren't able to load the entires dataset locally. Ideally this would mean sampling a few rows from each shard, but a simpler version would be taking a single shard.

If Lilac adds better support for sampling, it could be used to share/inspect/evaluate large pre-training datasets

@nsthorat
Copy link
Contributor

We've definitely been focusing on the single-node solution but we'll start thinking about larger datasets soon.

How big is one of your shards and in what format? We can currently read just a single shard (e.g. if it's parquet) if you use the parquet source. We have sampling for the HuggingFace source, but we can also add a sampling mechanism for arbitrary data.

@jtalmi
Copy link
Author

jtalmi commented Oct 20, 2023

We try to keep shards small...maybe <200 MB. The main issue with taking a single shard is skewness, if the dataset isn't properly shuffled. An ideal solution would sample across all shards of a parquet dataset, but I don't know off-hand how feasible that is.

@dsmilkov
Copy link
Collaborator

Is your data in HF datasets format, or just raw parquet files? Also we are in slack room with HuggingFace so I just asked for their advice on how to best load a single shard, and ideally a random sample of many shards. Will follow up shortly.

@dsmilkov
Copy link
Collaborator

Also I'm assuming the s3 buckets are private? We are using duckdb for really fast parquet reads and will use it for sampling. They support S3 reads directly: https://duckdb.org/docs/guides/import/s3_import.html

You will just need to set some env variables to authorize with S3:

S3_REGION=
S3_ENDPOINT=
S3_ACCESS_KEY=
S3_SECRET_KEY=  

Let me know if this would work for you

dsmilkov added a commit that referenced this issue Oct 20, 2023
- Add sampling for parquet files via duckdb
- Improve memory usage by reading smaller batches from huge parquet
files
- Add support for parquet files on S3

This isn't yet addressing the problem of having large number of shards
and thus needing to sample from random shards, but that's a follow up.

Towards #770
@jtalmi
Copy link
Author

jtalmi commented Oct 23, 2023

We are using GCP actually! And we have our own custom datasets in parquet format usually (some .jsonl, .csv, but parquet the main one).

Does duckdb enable smart sampling parquet files within s3/gcs buckets, so that the sample is taken from many shards?

@dsmilkov
Copy link
Collaborator

dsmilkov commented Oct 23, 2023

I just sent #778 for review. In summary, when you have many shards, and you don't want to fully shuffle a dataset, we fetch a few rows from each shard. This is done by creating a batched reader for each shard and flipping a coin to select the next reader.

while ...:
  index = random.randint(0, len(readers) - 1)
  reader = readers[index]
  batch = reader.read_next_batch() # Fetch a batch of rows.

I tested this on 4 shards. How many shards does your large dataset have ? 100s, 1,000s, 10,000s? Because the current implementation will open a reader for each shard, and not sure how the OS will handle having many readers open at once.

dsmilkov added a commit that referenced this issue Oct 23, 2023
The parquet reader can now read from local files, S3 or GCS. If the
dataset is sharded, the reader takes a glob pattern to load multiple
files.

The reader now takes `shuffle_before_sampling` in addition to
`sample_size`. When
`shuffle_before_sampling` is `True`, the reader will shuffle the entire
dataset before sampling, but
this requires fetching the entire dataset. If your dataset is massive
and you only want to load the
first `sample_size` rows, set `shuffle_before_sampling` to `False`. When
you have many shards and
`shuffle_before_sampling` is `False`, the reader will try to sample a
few rows from each shard, to
avoid any shard skew.

I tested this by making 4 shards, each 180MB = 720MB dataset and
uploading to `s3://lilac-public-data/test-*.parquet`. Loading with
sample_size=1000 and shuffle_before_sampling=True takes ~2min since it
has to download the entire dataset.
However with shuffle_before_sampling=False, it takes 4secs since it uses
range requests to partially read the files.

Fixes #770
Fixes #779
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants