pytorch · andrewkho · Nov 6, 2024 · Nov 6, 2024
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -1,7 +1,8 @@
 Please read through our [contribution guide](https://github.com/pytorch/data/blob/main/CONTRIBUTING.md) prior to
 creating your pull request.
 
-- Note that there is a section on requirements related to adding a new DataPipe.
+- If you are adding a new node, ensure you read that section in the contribution guide, as it includes requirements for
+  functionality and testing.
 
 Fixes #{issue number}
 

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -35,6 +35,16 @@ description is clear and has sufficient instructions to be able to reproduce the
 For question related to the usage of this library, please post a question on the
 [PyTorch forum, under the "data" category](https://discuss.pytorch.org/c/data/37).
 
+## Pull Requests
+
+We actively welcome your pull requests.
+
+1. Fork the repo and create your branch from `main`.
+2. If you've added code that should be tested, add tests.
+3. If you've changed APIs, update the documentation and examples.
+4. Ensure the test suite passes.
+5. If you haven't already, complete the Contributor License Agreement ("CLA").
+
 ## Development installation
 
 ### Install PyTorch Nightly
@@ -46,66 +56,87 @@ conda install pytorch -c pytorch-nightly
 # pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu
 ```
 
-### Install TorchData
+### Install TorchData and Test Requirements
 
 ```bash
 git clone https://github.com/pytorch/data.git
 cd data
 pip install -e .
-pip install flake8 typing mypy pytest expecttest
+pip install -r test/requirements.txt
 ```
 
-## Pull Requests
-
-We actively welcome your pull requests.
-
-1. Fork the repo and create your branch from `main`.
-2. If you've added code that should be tested, add tests.
-3. If you've changed APIs, update the documentation and examples.
-4. Ensure the test suite passes.
-5. If you haven't already, complete the Contributor License Agreement ("CLA").
-
 ### Code style
 
 `torchdata` enforces a fairly strict code format through [`pre-commit`](https://pre-commit.com). You can install it with
 
-```shell
-pip install pre-commit
+```bash
+conda install -c conda-forge pre-commit
+# or pip install pre-commit
+cd data
+with-proxy conda install pre-commit
+pre-commit install --install-hooks
 ```
 
-or
+### Running mypy and unit-tests locally
 
-```shell
-conda install -c conda-forge pre-commit
+Currently we don't run mypy as part of pre-commit hooks
+
+```bash
+mypy --config-file=mypy.ini
+```
+
+```bash
+pytest --durations=0 --no-header -v test/nodes/
 ```
 
-To check and in most cases fix the code format, stage all your changes (`git add`) and run `pre-commit run`. To perform
-the checks automatically before every `git commit`, you can install them with `pre-commit install`.
-
-### Adding a new DataPipe
-
-When adding a new DataPipe, there are few things that need to be done to ensure it is working and documented properly.
-
-1. Naming - please follow the naming convention as
-   [described here](https://pytorch.org/data/main/dp_tutorial.html#naming).
-2. Testing - please add unit tests to ensure that the DataPipe is functioning properly. Here are the
-   [test requirements](https://github.com/pytorch/data/issues/106) that we have.
-   - One test that is commonly missed is the serialization test. Please add the new DataPipe to
-     [`test_serialization.py`](https://github.com/pytorch/data/blob/main/test/test_serialization.py).
-   - If your test requires interacting with files in the file system (e.g. opening a `csv` or `tar` file, we prefer
-     those files to be generated during the test (see `test_local_io.py`). If the file is on a remote server, see
-     `test_remote_io.py`.
-3. Documentation - ensure that the DataPipe has docstring, usage example, and that it is added to the right category of
-   the right RST file to be rendered.
-   - If your DataPipe has a functional form (i.e. `@functional_datapipe(...)`), include at the
-     [end of the first sentence](https://github.com/pytorch/data/blob/main/torchdata/datapipes/iter/util/combining.py#L25)
-     of your docstring. This will make sure it correctly shows up in the
-     [summary table](https://pytorch.org/data/main/torchdata.datapipes.iter.html#archive-datapipes) of our
-     documentation.
-4. Import - import the DataPipe in the correct `__init__.py` file.
-5. Interface - if the DataPipe has a functional form, make sure that is generated properly by `gen_pyi.py` into the
-   relevant interface file.
-   - You can re-generate the pyi files by re-running `pip install -e .`, then you can examine the new outputs.
+### Adding a new Node
+
+When adding a new Node, there are few things that need to be done to ensure it is working and documented properly.
+
+The following simplifying assumptions are made of node implementations:
+
+- state is managed solely by the BaseNode, not through any iterators returned from them.
+- state_dict() returns the state of the most recently requested iterator.
+- load_state_dict() will set the state for the next iterator.
+
+1. Functionality - Nodes must subclass BaseNode and implement the required methods.
+
+   - `.iterator(self, initial_state: Optional[Dict[str, Any]])` - return a new iterator/generator that is properly
+     initialized with the optional initial_state
+   - `.get_state(self) -> Dict[str, Any]` - return a dictionary representing the state of the most recently returned
+     iterator, or if not yet requested, the initial state.
+   - ensure you're calling `state_dict()/load_state_dict()` on ancestor BaseNodes. Here is a simple example of a pretty
+     useless node:
+
+   ```python
+   class MyNode(BaseNode[T]):
+     def __init__(self, parent: BaseNode[T]):
+       self.parent = parent
+       self.idx = 0  # not very useful state
+
+     def iterator(self, initial_state: Optional[Dict[str, Any]]) -> Iterator[T]
+       if initial_state is not None:
+         self.parent.load_state_dict(initial_state["parent"])
+         self.idx = initial_state["idx"]
+
+       for item in self.parent:
+         self.idx += 1
+         yield item
+
+     def get_state(self) -> Dict[str, Any]:
+       return {
+         "parent": self.parent.state_dict(),  # note we call state_dict() and not get_state() here
+         "idx": self.idx,
+       }
+   ```
+
+2. Typing - Include type-hints for all public functions and methods
+3. Testing - please add unit tests to ensure that the Node is functioning properly.
+   - In addition to testing basic functionatity, state management must also be tested.
+   - For basic state testing, you may use `test.nodes.utils.run_test_save_load_state`. See `test/nodes/test_batch.py`
+     for an example.
+4. Documentation - ensure that the Node has a docstring, and a usage example.
+5. Import - import the Node in the correct `__init__.py` file.
 
 ## Contributor License Agreement ("CLA")