Skip to content

Iterating over values of a column in the IterableDataset #7381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TopCoder2K opened this issue Jan 28, 2025 · 6 comments
Open

Iterating over values of a column in the IterableDataset #7381

TopCoder2K opened this issue Jan 28, 2025 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@TopCoder2K
Copy link

TopCoder2K commented Jan 28, 2025

Feature request

I would like to be able to iterate (and re-iterate if needed) over a column of an IterableDataset instance. The following example shows the supposed API:

def gen():
    yield {"text": "Good", "label": 0}
    yield {"text": "Bad", "label": 1}

ds = IterableDataset.from_generator(gen)
texts = ds["text"]

for v in texts:
    print(v)  # Prints "Good" and "Bad"

for v in texts:
    print(v)  # Prints "Good" and "Bad" again

Motivation

In the real world problems, huge NNs like Transformer are not always the best option, so there is a need to conduct experiments with different methods. While 🤗Datasets is perfectly adapted to 🤗Transformers, it may be inconvenient when being used with other libraries. The ability to retrieve a particular column is the case (e.g., gensim's FastText requires only lists of strings, not dictionaries).
While there are ways to achieve the desired functionality, they are not good (forum). It would be great if there was a built-in solution.

Your contribution

Theoretically, I can submit a PR, but I have very little knowledge of the internal structure of 🤗Datasets, so some help may be needed.
Moreover, I can only work on weekends, since I have a full-time job. However, the feature does not seem to be popular, so there is no need to implement it as fast as possible.

@TopCoder2K TopCoder2K added the enhancement New feature or request label Jan 28, 2025
@lhoestq
Copy link
Member

lhoestq commented Feb 3, 2025

I'd be in favor of that ! I saw many people implementing their own iterables that wrap a dataset just to iterate on a single column, that would make things more practical.

Kinda related: #5847

@TopCoder2K
Copy link
Author

TopCoder2K commented Feb 18, 2025

(For anyone's information, I'm going on vacation for the next 3 weeks, so the work is postponed. If anyone can implement this feature within the next 4 weeks, go ahead :) )

UPD from 04/06/25:
I'm planning to start work on the feature in early May.

@TopCoder2K
Copy link
Author

#self-assign

@TopCoder2K
Copy link
Author

TopCoder2K commented May 2, 2025

Preliminary discussion

Ideally, I would like to be able to operate on a column with map, filter, batch and probably some other IterableDataset's methods, however, the same results can be achieved by using the methods on an IterableDataset object and utilizing __getitem__() afterwards. Thus, one may not support these methods at first and try to make the implementation as simple as possible.

Implementation

Based on the preliminary discussion, one can do the following:

class IterableColumn:
    def __init__(self, dataset: "IterableDataset", column_name: str):
        self.dataset = dataset
        self.column_name = column_name

    def __iter__(self) -> Iterator[Any]:
        for example in self.dataset:
            yield example[self.column_name]


class IterableDataset(DatasetInfoMixin):
    ...
    def __getitem__(self, column_name: str) -> IterableColumn:
        return IterableColumn(self, column_name)
    ...

Testing

It works as expected in our simple test:

def gen():
    yield {"text": "Good", "label": 0}
    yield {"text": "Bad", "label": 1}

ds = IterableDataset.from_generator(gen)

texts = ds["text"]  # `texts` is an IterableColumn object
for v in texts:
    print(v)  # Prints "Good" and "Bad"
for v in texts:
    print(v)  # Prints "Good" and "Bad" again

Questions

  1. What do you think about the implementation, @lhoestq?
  2. How to properly test the implementation? I've found test_iterable_dataset.py but 1) I haven't found any guidelines for testing, 2) the script tests a lot of things while I'd like to test only my feature.

@lhoestq
Copy link
Member

lhoestq commented May 3, 2025

Sounds great !

Regarding testing, it's actually possible to have your test function in test_iterable_dataset.py, which you can run using

pytest tests/test_iterable_dataset.py::my_function

@TopCoder2K
Copy link
Author

Regarding testing, it's actually possible to have your test function in test_iterable_dataset.py, which you can run using

I hoped not to run pip install -e ".[dev]", but your answer implies that I should. The problem is that I was unable to install the dependencies with Python 3.13 due to tensorflow and with Python 3.11-3.12 due to "there are no versions of pyav" [¬º-°]¬ Therefore, I had to test in a separate script file to avoid importing optional dependencies. Anyway, I've opened a PR: #7564. Please, take a look (there are questions about the documentation).

Moreover, I want to note that make style and pre-commit give different results for test_iterable_dataset.py (and a couple of files). Example:

    assert skip_ex_iterable.shuffle_data_sources(np.random.default_rng(42)) is skip_ex_iterable, (
        "skip examples makes the shards order fixed"
    )

vs

    assert (
        skip_ex_iterable.shuffle_data_sources(np.random.default_rng(42)) is skip_ex_iterable
    ), "skip examples makes the shards order fixed"

¯\(ツ)

Kinda related: #5847

I had forgotten about this, but I've looked at it by now. This comment implies that IterableColumn should support chained indexing, so thank you for pointing this out! Did you mean anything else by referencing the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants