-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Iterating over values of a column in the IterableDataset #7381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'd be in favor of that ! I saw many people implementing their own iterables that wrap a dataset just to iterate on a single column, that would make things more practical. Kinda related: #5847 |
(For anyone's information, I'm going on vacation for the next 3 weeks, so the work is postponed. If anyone can implement this feature within the next 4 weeks, go ahead :) ) UPD from 04/06/25: |
#self-assign |
Preliminary discussionIdeally, I would like to be able to operate on a column with map, filter, batch and probably some other ImplementationBased on the preliminary discussion, one can do the following: class IterableColumn:
def __init__(self, dataset: "IterableDataset", column_name: str):
self.dataset = dataset
self.column_name = column_name
def __iter__(self) -> Iterator[Any]:
for example in self.dataset:
yield example[self.column_name]
class IterableDataset(DatasetInfoMixin):
...
def __getitem__(self, column_name: str) -> IterableColumn:
return IterableColumn(self, column_name)
... TestingIt works as expected in our simple test: def gen():
yield {"text": "Good", "label": 0}
yield {"text": "Bad", "label": 1}
ds = IterableDataset.from_generator(gen)
texts = ds["text"] # `texts` is an IterableColumn object
for v in texts:
print(v) # Prints "Good" and "Bad"
for v in texts:
print(v) # Prints "Good" and "Bad" again Questions
|
Sounds great ! Regarding testing, it's actually possible to have your test function in test_iterable_dataset.py, which you can run using pytest tests/test_iterable_dataset.py::my_function |
I hoped not to run Moreover, I want to note that assert skip_ex_iterable.shuffle_data_sources(np.random.default_rng(42)) is skip_ex_iterable, (
"skip examples makes the shards order fixed"
) vs assert (
skip_ex_iterable.shuffle_data_sources(np.random.default_rng(42)) is skip_ex_iterable
), "skip examples makes the shards order fixed" ¯\(ツ)/¯
I had forgotten about this, but I've looked at it by now. This comment implies that |
Feature request
I would like to be able to iterate (and re-iterate if needed) over a column of an
IterableDataset
instance. The following example shows the supposed API:Motivation
In the real world problems, huge NNs like Transformer are not always the best option, so there is a need to conduct experiments with different methods. While 🤗Datasets is perfectly adapted to 🤗Transformers, it may be inconvenient when being used with other libraries. The ability to retrieve a particular column is the case (e.g., gensim's FastText requires only lists of strings, not dictionaries).
While there are ways to achieve the desired functionality, they are not good (forum). It would be great if there was a built-in solution.
Your contribution
Theoretically, I can submit a PR, but I have very little knowledge of the internal structure of 🤗Datasets, so some help may be needed.
Moreover, I can only work on weekends, since I have a full-time job. However, the feature does not seem to be popular, so there is no need to implement it as fast as possible.
The text was updated successfully, but these errors were encountered: