Intelligent sampling for large datasets #770

jtalmi · 2023-10-19T12:59:10Z

Lilac currently does not work well with large datasets living in cloud storage. We often need to inspect samples of multi-GB/TB, partitioned parquet datasets but we aren't able to load the entires dataset locally. Ideally this would mean sampling a few rows from each shard, but a simpler version would be taking a single shard.

If Lilac adds better support for sampling, it could be used to share/inspect/evaluate large pre-training datasets

nsthorat · 2023-10-19T17:54:20Z

We've definitely been focusing on the single-node solution but we'll start thinking about larger datasets soon.

How big is one of your shards and in what format? We can currently read just a single shard (e.g. if it's parquet) if you use the parquet source. We have sampling for the HuggingFace source, but we can also add a sampling mechanism for arbitrary data.

jtalmi · 2023-10-20T12:01:02Z

We try to keep shards small...maybe <200 MB. The main issue with taking a single shard is skewness, if the dataset isn't properly shuffled. An ideal solution would sample across all shards of a parquet dataset, but I don't know off-hand how feasible that is.

dsmilkov · 2023-10-20T13:31:04Z

Is your data in HF datasets format, or just raw parquet files? Also we are in slack room with HuggingFace so I just asked for their advice on how to best load a single shard, and ideally a random sample of many shards. Will follow up shortly.

dsmilkov · 2023-10-20T14:12:30Z

Also I'm assuming the s3 buckets are private? We are using duckdb for really fast parquet reads and will use it for sampling. They support S3 reads directly: https://duckdb.org/docs/guides/import/s3_import.html

You will just need to set some env variables to authorize with S3:

S3_REGION=
S3_ENDPOINT=
S3_ACCESS_KEY=
S3_SECRET_KEY=

Let me know if this would work for you

- Add sampling for parquet files via duckdb - Improve memory usage by reading smaller batches from huge parquet files - Add support for parquet files on S3 This isn't yet addressing the problem of having large number of shards and thus needing to sample from random shards, but that's a follow up. Towards #770

jtalmi · 2023-10-23T10:32:54Z

We are using GCP actually! And we have our own custom datasets in parquet format usually (some .jsonl, .csv, but parquet the main one).

Does duckdb enable smart sampling parquet files within s3/gcs buckets, so that the sample is taken from many shards?

dsmilkov · 2023-10-23T13:33:10Z

I just sent #778 for review. In summary, when you have many shards, and you don't want to fully shuffle a dataset, we fetch a few rows from each shard. This is done by creating a batched reader for each shard and flipping a coin to select the next reader.

while ...:
  index = random.randint(0, len(readers) - 1)
  reader = readers[index]
  batch = reader.read_next_batch() # Fetch a batch of rows.

I tested this on 4 shards. How many shards does your large dataset have ? 100s, 1,000s, 10,000s? Because the current implementation will open a reader for each shard, and not sure how the OS will handle having many readers open at once.

The parquet reader can now read from local files, S3 or GCS. If the dataset is sharded, the reader takes a glob pattern to load multiple files. The reader now takes `shuffle_before_sampling` in addition to `sample_size`. When `shuffle_before_sampling` is `True`, the reader will shuffle the entire dataset before sampling, but this requires fetching the entire dataset. If your dataset is massive and you only want to load the first `sample_size` rows, set `shuffle_before_sampling` to `False`. When you have many shards and `shuffle_before_sampling` is `False`, the reader will try to sample a few rows from each shard, to avoid any shard skew. I tested this by making 4 shards, each 180MB = 720MB dataset and uploading to `s3://lilac-public-data/test-*.parquet`. Loading with sample_size=1000 and shuffle_before_sampling=True takes ~2min since it has to download the entire dataset. However with shuffle_before_sampling=False, it takes 4secs since it uses range requests to partially read the files. Fixes #770 Fixes #779

dsmilkov mentioned this issue Oct 20, 2023

Add sampling to our ParquetSource #773

Merged

dsmilkov mentioned this issue Oct 23, 2023

Add intelligent sampling in ParquetSource #778

Merged

dsmilkov closed this as completed in #778 Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intelligent sampling for large datasets #770

Intelligent sampling for large datasets #770

jtalmi commented Oct 19, 2023 •

edited

Loading

nsthorat commented Oct 19, 2023

jtalmi commented Oct 20, 2023

dsmilkov commented Oct 20, 2023

dsmilkov commented Oct 20, 2023

jtalmi commented Oct 23, 2023

dsmilkov commented Oct 23, 2023 •

edited

Loading

Intelligent sampling for large datasets #770

Intelligent sampling for large datasets #770

Comments

jtalmi commented Oct 19, 2023 • edited Loading

nsthorat commented Oct 19, 2023

jtalmi commented Oct 20, 2023

dsmilkov commented Oct 20, 2023

dsmilkov commented Oct 20, 2023

jtalmi commented Oct 23, 2023

dsmilkov commented Oct 23, 2023 • edited Loading

jtalmi commented Oct 19, 2023 •

edited

Loading

dsmilkov commented Oct 23, 2023 •

edited

Loading