Fix pip extras (#4)

VectorInstitute · Aug 19, 2024 · 71e6838 · 71e6838
1 parent 9d52a41
commit 71e6838
Show file tree

Hide file tree

Showing 7 changed files with 374 additions and 633 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -22,7 +22,7 @@ repos:
         args: [--lock]
 
   - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.5.7
+    rev: v0.6.1
     hooks:
       - id: ruff
         args: [--fix, --exit-non-zero-on-fix]
@@ -40,9 +40,10 @@ repos:
     rev: v1.11.1
     hooks:
       - id: mypy
-        entry: python3 -m mypy --show-error-codes --pretty --config-file pyproject.toml
-        types: [python]
-        exclude: "tests"
+        entry: mypy
+        args: ["--config-file=pyproject.toml", "--show-error-codes", "--pretty"]
+        types_or: [python, pyi]
+        exclude: tests|projects
 
   - repo: https://github.com/nbQA-dev/nbQA
     rev: 1.8.7
@@ -58,5 +59,3 @@ repos:
       entry: python3 -m pytest -m "not integration_test"
       pass_filenames: false
       always_run: true
-
-exclude: "projects"
diff --git a/README.md b/README.md
@@ -16,6 +16,63 @@ python3 -m venv /path/to/new/virtual/environment
 source /path/to/new/virtual/environment/bin/activate
 ```
 
+<details>
+<summary><b>Installation Options</b></summary>
+You can install optional dependencies to enable additional features. Use one or more of the pip extras listed below to
+install the desired dependencies.
+
+<table>
+<tr>
+<th style="text-align: left; width: 150px"> pip extra </th>
+<th style="text-align: center"> Dependencies </th>
+<th style="text-align: center"> Notes </th>
+</tr>
+
+<tr>
+<td>
+vision
+</td>
+<td>
+"torchvision", "opencv-python", "timm"
+</td>
+<td>
+Enables image processing and vision tasks.
+</td>
+</tr>
+
+<tr>
+<td>
+audio
+</td>
+<td>
+"torchaudio"
+</td>
+<td>
+Enables audio processing and tasks.
+</td>
+</tr>
+
+<tr>
+<td>
+peft
+</td>
+<td>
+"peft"
+</td>
+<td>
+Uses the <a href=https://huggingface.co/docs/peft/index>PEFT</a> library to enable parameter-efficient fine-tuning.
+</td>
+</tr>
+
+</table>
+
+For example, to install the library with the `vision` and `audio` extras, run:
+```bash
+python3 -m pip install mmlearn[vision,audio]
+```
+
+</details>
+
 #### Installing binaries
 To install the pre-built binaries, run:
 ```bash
@@ -32,25 +89,31 @@ python3 -m pip install -e .
 ```
 
 ### Running Experiments
-To run an experiment, create a folder with a similar structure as the [`configs`](configs/) folder.
-Then, use the `mmlearn_run` command to run the experiment as defined in a `.yaml` file under the `experiment` folder, like so:
+We use [Hydra](https://hydra.cc/docs/intro/) and [hydra-zen](https://mit-ll-responsible-ai.github.io/hydra-zen/) to manage configurations
+in the library.
+
+For new experiments, it is recommended to create a new directory to store the configuration files. The directory should
+have an `__init__.py` file to make it a Python package and an `experiment` folder to store the experiment configuration files.
+This format allows the use of `.yaml` configuration files as well as Python modules (using [structured configs](https://hydra.cc/docs/tutorials/structured_config/intro/) or [hydra-zen](https://mit-ll-responsible-ai.github.io/hydra-zen/)) to define the experiment configurations.
+
+To run an experiment, use the following command:
 ```bash
-mmlearn_run --config-dir /path/to/config/dir +experiment=<name_of_experiment_config> experiment=your_experiment_name
+mmlearn_run 'hydra.searchpath=[pkg://path.to.config.directory]' +experiment=<name_of_experiment_yaml_file> experiment=your_experiment_name
 ```
-Notice that the config directory refers to the top-level directory containing the `experiment` folder. The experiment
-name is the name of the `.yaml` file under the `experiment` folder, without the extension.
+Hydra will compose the experiment configuration from all the configurations in the specified directory as well as all the
+configurations in the `mmlearn` package. *Note the dot-separated path to the directory containing the experiment configuration
+files.*
 
-We use [Hydra](https://hydra.cc/docs/intro/) to manage configurations, so you can override any configuration parameter
-from the command line. To see the available options and other information, run:
+Hydra also allows for overriding configurations parameters from the command line. To see the available options and other information, run:
 ```bash
-mmlearn_run --config-dir /path/to/config/dir +experiment=<name_of_experiment> --help
+mmlearn_run 'hydra.searchpath=[pkg://path.to.config.directory]' +experiment=<name_of_experiment_yaml_file> --help
 ```
 
 By default, the `mmlearn_run` command will run the experiment locally. To run the experiment on a SLURM cluster, we use
 the [submitit launcher](https://hydra.cc/docs/plugins/submitit_launcher/) plugin built into Hydra. The following is an example
 of how to run an experiment on a SLURM cluster:
 ```bash
-mmlearn_run --multirun hydra.launcher.mem_gb=32 hydra.launcher.qos=your_qos hydra.launcher.partition=your_partition hydra.launcher.gres=gpu:4 hydra.launcher.cpus_per_task=8 hydra.launcher.tasks_per_node=4 hydra.launcher.nodes=1 hydra.launcher.stderr_to_stdout=true hydra.launcher.timeout_min=60 '+hydra.launcher.additional_parameters={export: ALL}' --config-dir /path/to/config/dir +experiment=<name_of_experiment_config> experiment=your_experiment_name
+mmlearn_run --multirun hydra.launcher.mem_gb=32 hydra.launcher.qos=your_qos hydra.launcher.partition=your_partition hydra.launcher.gres=gpu:4 hydra.launcher.cpus_per_task=8 hydra.launcher.tasks_per_node=4 hydra.launcher.nodes=1 hydra.launcher.stderr_to_stdout=true hydra.launcher.timeout_min=60 '+hydra.launcher.additional_parameters={export: ALL}' 'hydra.searchpath=[pkg://path.to.config.directory]' +experiment=<name_of_experiment_yaml_file> experiment=your_experiment_name
 ```
 This will submit a job to the SLURM cluster with the specified resources.
 
@@ -93,20 +156,18 @@ using recall@k metric. This is applicable to any number of pairs of modalities a
 
 ## Components
 ### Datasets
-Every dataset object must return an instance of [`Example`](mmlearn/datasets/core/example.py) with one or more keys/attributes
-corresponding to a modality name as specified in the [`Modalities registry`](mmlearn/datasets/core/modalities.py).
-The `Example` object must also include an `example_index` attribute/key, which is used, in addition to the dataset index,
-to uniquely identify the example.
+Every dataset object must return an instance of `Example` with one or more keys/attributes corresponding to a modality name
+as specified in the `Modalities` registry. The `Example` object must also include an `example_index` attribute/key, which
+is used, in addition to the dataset index, to uniquely identify the example.
 
 <details>
 <summary><b>CombinedDataset</b></summary>
 
-The [`CombinedDataset`](mmlearn/datasets/core/combined_dataset.py) object is used to combine multiple datasets into one. It
-accepts an iterable of `torch.utils.data.Dataset` and/or `torch.utils.data.IterableDataset` objects and returns an `Example`
-object from one of the datasets, given an index. Conceptually, the `CombinedDataset` object is a concatenation of the
-datasets in the input iterable, so the given index can be mapped to a specific dataset based on the size of the datasets.
-As iterable-style datasets do not support random access, the examples from these datasets are returned in order as they
-are iterated over.
+The `CombinedDataset` object is used to combine multiple datasets into one. It accepts an iterable of `torch.utils.data.Dataset`
+and/or `torch.utils.data.IterableDataset` objects and returns an `Example` object from one of the datasets, given an index.
+Conceptually, the `CombinedDataset` object is a concatenation of the datasets in the input iterable, so the given index
+can be mapped to a specific dataset based on the size of the datasets. As iterable-style datasets do not support random access,
+the examples from these datasets are returned in order as they are iterated over.
 
 The `CombinedDataset` object also adds a `dataset_index` attribute to the `Example` object, corresponding to the index of
 the dataset in the input iterable. Every example returned by the `CombinedDataset` will have an `example_ids` attribute,
@@ -116,7 +177,7 @@ which is instance of `Example` containing the same keys/attributes as the origin
 
 ### Dataloading
 When dealing with multiple datasets with different modalities, the default `collate_fn` of `torch.utils.data.DataLoader`
-may not work, as it assumes that all examples have the same keys/attributes. In that case, the [`collate_example_list`](mmlearn/datasets/core/example.py)
+may not work, as it assumes that all examples have the same keys/attributes. In that case, the `collate_example_list`
 function can be used as the `collate_fn` argument of `torch.utils.data.DataLoader`. This function takes a list of `Example`
 objects and returns a dictionary of tensors, with all the keys/attributes of the `Example` objects.
 

diff --git a/mmlearn/datasets/__init__.py b/mmlearn/datasets/__init__.py
@@ -1,7 +1,6 @@
 """Datasets."""
 
 from mmlearn.datasets.chexpert import CheXpert
-from mmlearn.datasets.ego4d import Ego4DDataset
 from mmlearn.datasets.imagenet import ImageNet
 from mmlearn.datasets.librispeech import LibriSpeech
 from mmlearn.datasets.llvip import LLVIPDataset
@@ -12,7 +11,6 @@
 
 __all__ = [
     "CheXpert",
-    "Ego4DDataset",
     "ImageNet",
     "LibriSpeech",
     "LLVIPDataset",

diff --git a/mmlearn/datasets/ego4d.py b/mmlearn/datasets/ego4d.py