Append multiple `Dataset`s to a single subset #7367

lmmx · 2025-01-13T12:57:19Z

lmmx
Jan 13, 2025

I’ve been forced into an extremely slow route to get the dataset I have uploaded as-is, and I’d really like to find a workaround.

To explain:

The dataset starts with a subset that is 300M rows, then the 2nd subset is 1.8B rows.
I find there is something like a 10x size difference between the parquet files and Arrow files
Due to the large size of the subsets in this dataset (from the 2nd one onwards) I find that, while I can fit the parquet intermediate files, I cannot fit the Arrow table files generated from them on my internal SSD — so they have to get moved to my external HDD
For some reason (which I am assuming is relatively slow read speeds via USB 3.0 compared to on the internal SSD) the “Loading shards” part (in which, as I understand it from reading the source code, the shard indices are used to select parquet files in memory as a buffer that is sent via HTTP to the Hub) is taking an exorbitant amount of time. We are talking about maybe a 20-30x slowdown.
Since this slowdown does not occur on the internal SSD, it would be desirable to push smaller Arrow tables, but the only way I know how to reduce the size of said Arrow tables generated from my data’s source parquet files is to split the dataset subset up (and this wouldn’t be quite correct)

One way I would like to see to work around this limitation is if I could somehow make multiple Datasets and append them rather than overwrite, is this possible?
Alternatively, if I could upload multiple subsets (my_subset_part-001-of-099, …) and then just use the API to merge them remotely?
Alternatively, am I missing something obvious where I can just upload my parquet files and avoid the Arrow step entirely? I presume this is not valid as otherwise why would Dataset.from_parquet().push_to_hub() exist?
Should I be using IterableDataset here? Would that avoid materialising the entire Arrow table? My understanding is that this would avoid ingesting the entire source parquet files (but I have not tried it yet to find out!)

Any help would be much appreciated!