Iterating over values of a column in the IterableDataset #7381

TopCoder2K · 2025-01-28T13:17:36Z

Feature request

I would like to be able to iterate (and re-iterate if needed) over a column of an IterableDataset instance. The following example shows the supposed API:

def gen():
    yield {"text": "Good", "label": 0}
    yield {"text": "Bad", "label": 1}

ds = IterableDataset.from_generator(gen)
texts = ds["text"]

for v in texts:
    print(v)  # Prints "Good" and "Bad"

for v in texts:
    print(v)  # Prints "Good" and "Bad" again

Motivation

In the real world problems, huge NNs like Transformer are not always the best option, so there is a need to conduct experiments with different methods. While 🤗Datasets is perfectly adapted to 🤗Transformers, it may be inconvenient when being used with other libraries. The ability to retrieve a particular column is the case (e.g., gensim's FastText requires only lists of strings, not dictionaries).
While there are ways to achieve the desired functionality, they are not good (forum). It would be great if there was a built-in solution.

Your contribution

Theoretically, I can submit a PR, but I have very little knowledge of the internal structure of 🤗Datasets, so some help may be needed.
Moreover, I can only work on weekends, since I have a full-time job. However, the feature does not seem to be popular, so there is no need to implement it as fast as possible.

The text was updated successfully, but these errors were encountered:

lhoestq · 2025-02-03T15:30:14Z

I'd be in favor of that ! I saw many people implementing their own iterables that wrap a dataset just to iterate on a single column, that would make things more practical.

Kinda related: #5847

TopCoder2K · 2025-02-18T17:15:50Z

(For anyone's information, I'm going on vacation for the next 3 weeks, so the work is postponed. If anyone can implement this feature within the next 4 weeks, go ahead :) )

UPD from 04/06/25:
I'm planning to start work on the feature in early May.

TopCoder2K · 2025-05-02T13:02:21Z

#self-assign

TopCoder2K · 2025-05-02T17:55:34Z

Preliminary discussion

Ideally, I would like to be able to operate on a column with map, filter, batch and probably some other IterableDataset's methods, however, the same results can be achieved by using the methods on an IterableDataset object and utilizing __getitem__() afterwards. Thus, one may not support these methods at first and try to make the implementation as simple as possible.

Implementation

Based on the preliminary discussion, one can do the following:

class IterableColumn:
    def __init__(self, dataset: "IterableDataset", column_name: str):
        self.dataset = dataset
        self.column_name = column_name

    def __iter__(self) -> Iterator[Any]:
        for example in self.dataset:
            yield example[self.column_name]


class IterableDataset(DatasetInfoMixin):
    ...
    def __getitem__(self, column_name: str) -> IterableColumn:
        return IterableColumn(self, column_name)
    ...

Testing

It works as expected in our simple test:

def gen():
    yield {"text": "Good", "label": 0}
    yield {"text": "Bad", "label": 1}

ds = IterableDataset.from_generator(gen)

texts = ds["text"]  # `texts` is an IterableColumn object
for v in texts:
    print(v)  # Prints "Good" and "Bad"
for v in texts:
    print(v)  # Prints "Good" and "Bad" again

Questions

What do you think about the implementation, @lhoestq?
How to properly test the implementation? I've found test_iterable_dataset.py but 1) I haven't found any guidelines for testing, 2) the script tests a lot of things while I'd like to test only my feature.

lhoestq · 2025-05-03T16:02:24Z

Sounds great !

Regarding testing, it's actually possible to have your test function in test_iterable_dataset.py, which you can run using

pytest tests/test_iterable_dataset.py::my_function

TopCoder2K · 2025-05-08T15:14:31Z

Regarding testing, it's actually possible to have your test function in test_iterable_dataset.py, which you can run using

I hoped not to run pip install -e ".[dev]", but your answer implies that I should. The problem is that I was unable to install the dependencies with Python 3.13 due to tensorflow and with Python 3.11-3.12 due to "there are no versions of pyav" [¬º-°]¬ Therefore, I had to test in a separate script file to avoid importing optional dependencies. Anyway, I've opened a PR: #7564. Please, take a look (there are questions about the documentation).

Moreover, I want to note that make style and pre-commit give different results for test_iterable_dataset.py (and a couple of files). Example:

    assert skip_ex_iterable.shuffle_data_sources(np.random.default_rng(42)) is skip_ex_iterable, (
        "skip examples makes the shards order fixed"
    )

vs

    assert (
        skip_ex_iterable.shuffle_data_sources(np.random.default_rng(42)) is skip_ex_iterable
    ), "skip examples makes the shards order fixed"

¯\(ツ)/¯

Kinda related: #5847

I had forgotten about this, but I've looked at it by now. This comment implies that IterableColumn should support chained indexing, so thank you for pointing this out! Did you mean anything else by referencing the issue?

TopCoder2K added the enhancement New feature or request label Jan 28, 2025

lhoestq assigned TopCoder2K May 2, 2025

TopCoder2K mentioned this issue May 8, 2025

Implementation of iteration over values of a column in an IterableDataset object #7564

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iterating over values of a column in the IterableDataset #7381

Iterating over values of a column in the IterableDataset #7381

TopCoder2K commented Jan 28, 2025 •

edited

Loading

lhoestq commented Feb 3, 2025

TopCoder2K commented Feb 18, 2025 •

edited

Loading

TopCoder2K commented May 2, 2025

TopCoder2K commented May 2, 2025 •

edited

Loading

lhoestq commented May 3, 2025

TopCoder2K commented May 8, 2025

Iterating over values of a column in the IterableDataset #7381

Iterating over values of a column in the IterableDataset #7381

Comments

TopCoder2K commented Jan 28, 2025 • edited Loading

Feature request

Motivation

Your contribution

lhoestq commented Feb 3, 2025

TopCoder2K commented Feb 18, 2025 • edited Loading

TopCoder2K commented May 2, 2025

TopCoder2K commented May 2, 2025 • edited Loading

Preliminary discussion

Implementation

Testing

Questions

lhoestq commented May 3, 2025

TopCoder2K commented May 8, 2025

TopCoder2K commented Jan 28, 2025 •

edited

Loading

TopCoder2K commented Feb 18, 2025 •

edited

Loading

TopCoder2K commented May 2, 2025 •

edited

Loading