Skip to content

Commit

Permalink
Fix pip extras (#4)
Browse files Browse the repository at this point in the history
  • Loading branch information
fcogidi authored Aug 19, 2024
1 parent 9d52a41 commit 71e6838
Show file tree
Hide file tree
Showing 7 changed files with 374 additions and 633 deletions.
11 changes: 5 additions & 6 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ repos:
args: [--lock]

- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.5.7
rev: v0.6.1
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]
Expand All @@ -40,9 +40,10 @@ repos:
rev: v1.11.1
hooks:
- id: mypy
entry: python3 -m mypy --show-error-codes --pretty --config-file pyproject.toml
types: [python]
exclude: "tests"
entry: mypy
args: ["--config-file=pyproject.toml", "--show-error-codes", "--pretty"]
types_or: [python, pyi]
exclude: tests|projects

- repo: https://github.com/nbQA-dev/nbQA
rev: 1.8.7
Expand All @@ -58,5 +59,3 @@ repos:
entry: python3 -m pytest -m "not integration_test"
pass_filenames: false
always_run: true

exclude: "projects"
101 changes: 81 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,63 @@ python3 -m venv /path/to/new/virtual/environment
source /path/to/new/virtual/environment/bin/activate
```

<details>
<summary><b>Installation Options</b></summary>
You can install optional dependencies to enable additional features. Use one or more of the pip extras listed below to
install the desired dependencies.

<table>
<tr>
<th style="text-align: left; width: 150px"> pip extra </th>
<th style="text-align: center"> Dependencies </th>
<th style="text-align: center"> Notes </th>
</tr>

<tr>
<td>
vision
</td>
<td>
"torchvision", "opencv-python", "timm"
</td>
<td>
Enables image processing and vision tasks.
</td>
</tr>

<tr>
<td>
audio
</td>
<td>
"torchaudio"
</td>
<td>
Enables audio processing and tasks.
</td>
</tr>

<tr>
<td>
peft
</td>
<td>
"peft"
</td>
<td>
Uses the <a href=https://huggingface.co/docs/peft/index>PEFT</a> library to enable parameter-efficient fine-tuning.
</td>
</tr>

</table>

For example, to install the library with the `vision` and `audio` extras, run:
```bash
python3 -m pip install mmlearn[vision,audio]
```

</details>

#### Installing binaries
To install the pre-built binaries, run:
```bash
Expand All @@ -32,25 +89,31 @@ python3 -m pip install -e .
```

### Running Experiments
To run an experiment, create a folder with a similar structure as the [`configs`](configs/) folder.
Then, use the `mmlearn_run` command to run the experiment as defined in a `.yaml` file under the `experiment` folder, like so:
We use [Hydra](https://hydra.cc/docs/intro/) and [hydra-zen](https://mit-ll-responsible-ai.github.io/hydra-zen/) to manage configurations
in the library.

For new experiments, it is recommended to create a new directory to store the configuration files. The directory should
have an `__init__.py` file to make it a Python package and an `experiment` folder to store the experiment configuration files.
This format allows the use of `.yaml` configuration files as well as Python modules (using [structured configs](https://hydra.cc/docs/tutorials/structured_config/intro/) or [hydra-zen](https://mit-ll-responsible-ai.github.io/hydra-zen/)) to define the experiment configurations.

To run an experiment, use the following command:
```bash
mmlearn_run --config-dir /path/to/config/dir +experiment=<name_of_experiment_config> experiment=your_experiment_name
mmlearn_run 'hydra.searchpath=[pkg://path.to.config.directory]' +experiment=<name_of_experiment_yaml_file> experiment=your_experiment_name
```
Notice that the config directory refers to the top-level directory containing the `experiment` folder. The experiment
name is the name of the `.yaml` file under the `experiment` folder, without the extension.
Hydra will compose the experiment configuration from all the configurations in the specified directory as well as all the
configurations in the `mmlearn` package. *Note the dot-separated path to the directory containing the experiment configuration
files.*

We use [Hydra](https://hydra.cc/docs/intro/) to manage configurations, so you can override any configuration parameter
from the command line. To see the available options and other information, run:
Hydra also allows for overriding configurations parameters from the command line. To see the available options and other information, run:
```bash
mmlearn_run --config-dir /path/to/config/dir +experiment=<name_of_experiment> --help
mmlearn_run 'hydra.searchpath=[pkg://path.to.config.directory]' +experiment=<name_of_experiment_yaml_file> --help
```

By default, the `mmlearn_run` command will run the experiment locally. To run the experiment on a SLURM cluster, we use
the [submitit launcher](https://hydra.cc/docs/plugins/submitit_launcher/) plugin built into Hydra. The following is an example
of how to run an experiment on a SLURM cluster:
```bash
mmlearn_run --multirun hydra.launcher.mem_gb=32 hydra.launcher.qos=your_qos hydra.launcher.partition=your_partition hydra.launcher.gres=gpu:4 hydra.launcher.cpus_per_task=8 hydra.launcher.tasks_per_node=4 hydra.launcher.nodes=1 hydra.launcher.stderr_to_stdout=true hydra.launcher.timeout_min=60 '+hydra.launcher.additional_parameters={export: ALL}' --config-dir /path/to/config/dir +experiment=<name_of_experiment_config> experiment=your_experiment_name
mmlearn_run --multirun hydra.launcher.mem_gb=32 hydra.launcher.qos=your_qos hydra.launcher.partition=your_partition hydra.launcher.gres=gpu:4 hydra.launcher.cpus_per_task=8 hydra.launcher.tasks_per_node=4 hydra.launcher.nodes=1 hydra.launcher.stderr_to_stdout=true hydra.launcher.timeout_min=60 '+hydra.launcher.additional_parameters={export: ALL}' 'hydra.searchpath=[pkg://path.to.config.directory]' +experiment=<name_of_experiment_yaml_file> experiment=your_experiment_name
```
This will submit a job to the SLURM cluster with the specified resources.

Expand Down Expand Up @@ -93,20 +156,18 @@ using recall@k metric. This is applicable to any number of pairs of modalities a

## Components
### Datasets
Every dataset object must return an instance of [`Example`](mmlearn/datasets/core/example.py) with one or more keys/attributes
corresponding to a modality name as specified in the [`Modalities registry`](mmlearn/datasets/core/modalities.py).
The `Example` object must also include an `example_index` attribute/key, which is used, in addition to the dataset index,
to uniquely identify the example.
Every dataset object must return an instance of `Example` with one or more keys/attributes corresponding to a modality name
as specified in the `Modalities` registry. The `Example` object must also include an `example_index` attribute/key, which
is used, in addition to the dataset index, to uniquely identify the example.

<details>
<summary><b>CombinedDataset</b></summary>

The [`CombinedDataset`](mmlearn/datasets/core/combined_dataset.py) object is used to combine multiple datasets into one. It
accepts an iterable of `torch.utils.data.Dataset` and/or `torch.utils.data.IterableDataset` objects and returns an `Example`
object from one of the datasets, given an index. Conceptually, the `CombinedDataset` object is a concatenation of the
datasets in the input iterable, so the given index can be mapped to a specific dataset based on the size of the datasets.
As iterable-style datasets do not support random access, the examples from these datasets are returned in order as they
are iterated over.
The `CombinedDataset` object is used to combine multiple datasets into one. It accepts an iterable of `torch.utils.data.Dataset`
and/or `torch.utils.data.IterableDataset` objects and returns an `Example` object from one of the datasets, given an index.
Conceptually, the `CombinedDataset` object is a concatenation of the datasets in the input iterable, so the given index
can be mapped to a specific dataset based on the size of the datasets. As iterable-style datasets do not support random access,
the examples from these datasets are returned in order as they are iterated over.

The `CombinedDataset` object also adds a `dataset_index` attribute to the `Example` object, corresponding to the index of
the dataset in the input iterable. Every example returned by the `CombinedDataset` will have an `example_ids` attribute,
Expand All @@ -116,7 +177,7 @@ which is instance of `Example` containing the same keys/attributes as the origin

### Dataloading
When dealing with multiple datasets with different modalities, the default `collate_fn` of `torch.utils.data.DataLoader`
may not work, as it assumes that all examples have the same keys/attributes. In that case, the [`collate_example_list`](mmlearn/datasets/core/example.py)
may not work, as it assumes that all examples have the same keys/attributes. In that case, the `collate_example_list`
function can be used as the `collate_fn` argument of `torch.utils.data.DataLoader`. This function takes a list of `Example`
objects and returns a dictionary of tensors, with all the keys/attributes of the `Example` objects.

Expand Down
2 changes: 0 additions & 2 deletions mmlearn/datasets/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
"""Datasets."""

from mmlearn.datasets.chexpert import CheXpert
from mmlearn.datasets.ego4d import Ego4DDataset
from mmlearn.datasets.imagenet import ImageNet
from mmlearn.datasets.librispeech import LibriSpeech
from mmlearn.datasets.llvip import LLVIPDataset
Expand All @@ -12,7 +11,6 @@

__all__ = [
"CheXpert",
"Ego4DDataset",
"ImageNet",
"LibriSpeech",
"LLVIPDataset",
Expand Down
126 changes: 0 additions & 126 deletions mmlearn/datasets/ego4d.py

This file was deleted.

Loading

0 comments on commit 71e6838

Please sign in to comment.