Add sampling to our `ParquetSource` #773

dsmilkov · 2023-10-20T15:16:43Z

Add sampling for parquet files via duckdb
Improve memory usage by reading smaller batches from huge parquet files
Add support for parquet files on S3

This isn't yet addressing the problem of having large number of shards and thus needing to sample from random shards, but that's a follow up.

Towards #770

brilee · 2023-10-20T16:04:26Z

lilac/sources/parquet_source.py

+    """)
+    res = self._con.execute('SELECT COUNT(*) FROM t').fetchone()
+    num_items = cast(tuple[int], res)[0]
+    self._reader = self._con.execute('SELECT * from t').fetch_record_batch(rows_per_batch=10_000)


these computations seem to belong in process(), not in setup()

great q.

self._reader = self._con.execute('SELECT * from t').fetch_record_batch(rows_per_batch=10_000)

returns a lazy iterator, so no data is being read yet, but we found that executing this in setup catches a lot of "setup" bugs like file not found, unrecognized parquet format (broken head), unauthorized S3/GCS bucket read etc.

In addition to this, once you have a reader , you can read the inferred schema before reading the data, and our sources need the schema before process() so they can setup a parquet writer with buffer ahead of time.

.env

nsthorat · 2023-10-20T16:42:44Z

lilac/sources/json_source.py

@@ -62,7 +61,7 @@ def setup(self) -> None:
  @override
  def source_schema(self) -> SourceSchema:


not for now but maybe we should make schemas optional and let duckdb infer types to reduce cognitive overhead of both sources and signals

then signals are very close to a map

it's our pq.ParquetWriter that needs a schema ahead of time to setup a writer, before writing a single row to disk. And that schema needs to be consistent with 100% of the rows that are going in that writer to avoid write error. That means we need to see the entire data in order to correctly infer the schema, if not provided by the user. Or we circumvent our writer and get duckdb to read the format and dump to paquet directly.

note that pq.ParquetWriter also doesn't hold everything in memory, it dumbs to parquet every 128MB row_group_buffer_size with 10k items per rowgroup.

lilac/sources/parquet_source.py

dsmilkov added 3 commits October 20, 2023 10:50

save

45bde94

save

44eef5d

save

3d570ea

dsmilkov requested review from nsthorat and brilee October 20, 2023 15:16

github-actions bot added the backend label Oct 20, 2023

brilee approved these changes Oct 20, 2023

View reviewed changes

dsmilkov changed the title ~~Add parquet sampling and use duckdb to read files~~ Add sampling to our ParquetSource Oct 20, 2023

Merge branch 'main' into ds-smaple

6fd600d

nsthorat approved these changes Oct 20, 2023

View reviewed changes

save

ebcbe93

dsmilkov merged commit ca44094 into main Oct 20, 2023
4 checks passed

dsmilkov deleted the ds-smaple branch October 20, 2023 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sampling to our `ParquetSource` #773

Add sampling to our `ParquetSource` #773

dsmilkov commented Oct 20, 2023 •

edited

Loading

brilee Oct 20, 2023

dsmilkov Oct 20, 2023 •

edited

Loading

nsthorat Oct 20, 2023

dsmilkov Oct 20, 2023

dsmilkov Oct 20, 2023 •

edited

Loading

		@@ -62,7 +61,7 @@ def setup(self) -> None:
		@override
		def source_schema(self) -> SourceSchema:

Add sampling to our ParquetSource #773

Add sampling to our ParquetSource #773

Conversation

dsmilkov commented Oct 20, 2023 • edited Loading

brilee Oct 20, 2023

Choose a reason for hiding this comment

dsmilkov Oct 20, 2023 • edited Loading

Choose a reason for hiding this comment

nsthorat Oct 20, 2023

Choose a reason for hiding this comment

dsmilkov Oct 20, 2023

Choose a reason for hiding this comment

dsmilkov Oct 20, 2023 • edited Loading

Choose a reason for hiding this comment

Add sampling to our `ParquetSource` #773

Add sampling to our `ParquetSource` #773

dsmilkov commented Oct 20, 2023 •

edited

Loading

dsmilkov Oct 20, 2023 •

edited

Loading

dsmilkov Oct 20, 2023 •

edited

Loading