mmlearn aims at enabling the evaluation of existing multimodal representation learning methods, as well as facilitating experimentation and research for new techniques.
The library requires Python 3.10 or later. We recommend using a virtual environment to manage dependencies. You can create a virtual environment using the following command:
python3 -m venv /path/to/new/virtual/environment
source /path/to/new/virtual/environment/bin/activate
To install the pre-built binaries, run:
python3 -m pip install mmlearn
Installation Options
You can install optional dependencies to enable additional features. Use one or more of the pip extras listed below to install the desired dependencies.pip extra | Dependencies | Notes |
---|---|---|
vision | "torchvision", "opencv-python", "timm" | Enables image processing and vision tasks. |
audio | "torchaudio" | Enables audio processing and tasks. |
peft | "peft" | Uses the PEFT library to enable parameter-efficient fine-tuning. |
For example, to install the library with the vision
and audio
extras, run:
python3 -m pip install mmlearn[vision,audio]
To install the library from source, run:
git clone https://github.com/VectorInstitute/mmlearn.git
cd mmlearn
python3 -m pip install -e .
We use Hydra and hydra-zen to manage configurations in the library.
For new experiments, it is recommended to create a new directory to store the configuration files. The directory should
have an __init__.py
file to make it a Python package and an experiment
folder to store the experiment configuration files.
This format allows the use of .yaml
configuration files as well as Python modules (using structured configs or hydra-zen) to define the experiment configurations.
To run an experiment, use the following command:
mmlearn_run 'hydra.searchpath=[pkg://path.to.config.directory]' +experiment=<name_of_experiment_yaml_file> experiment=your_experiment_name
Hydra will compose the experiment configuration from all the configurations in the specified directory as well as all the
configurations in the mmlearn
package. Note the dot-separated path to the directory containing the experiment configuration
files.
One can add a path to hydra.searchpath
either as a package (pkg://path.to.config.directory
) or as a file system
(file://path/to/config/directory
). However, new configs in mmlearn
are added to hydra's external store inside
path/to/config/directory/__init__.py
which is only interpreted when the config directory is added as a package.
Hence, please refrain from using the file://
notation.
Hydra also allows for overriding configuration parameters from the command line. To see the available options and other information, run:
mmlearn_run 'hydra.searchpath=[pkg://path.to.config.directory]' +experiment=<name_of_experiment_yaml_file> --help
By default, the mmlearn_run
command will run the experiment locally. To run the experiment on a SLURM cluster, we use
the submitit launcher plugin built into Hydra. The following is an example
of how to run an experiment on a SLURM cluster:
mmlearn_run --multirun \
hydra.launcher.mem_per_cpu=5G \
hydra.launcher.qos=your_qos \
hydra.launcher.partition=your_partition \
hydra.launcher.gres=gpu:4 \
hydra.launcher.cpus_per_task=8 \
hydra.launcher.tasks_per_node=4 \
hydra.launcher.nodes=1 \
hydra.launcher.stderr_to_stdout=true \
hydra.launcher.timeout_min=720 \
'hydra.searchpath=[pkg://path.to.my_project.configs]' \
+experiment=my_experiment \
experiment_name=my_experiment_name
This will submit a job to the SLURM cluster with the specified resources.
Note: After the job is submitted, it is okay to cancel the program with Ctrl+C
. The job will continue running on
the cluster. You can also add &
at the end of the command to run it in the background.
Pretraining Methods | Notes |
---|---|
Contrastive Pretraining |
Uses the contrastive loss to align the representations from N modalities. Supports sharing of encoders, projection heads or postprocessing modules (e.g. logit/temperature scaling) across modalities. Also supports multi-task learning with auxiliary unimodal tasks applied to specific modalities. |
I-JEPA |
The Image-based Joint-Embedding Predictive Architecture (I-JEPA) is a unimodal non-generative self-supervised learning method that predicts the representations of several target blocks of an image given a context block from the same image. This task can be combined with the contrastive pretraining task to learn multimodal representations from paired and unpaired data. |
Evaluation Methods | Notes |
Zero-shot Cross-modal Retrieval |
Evaluates the quality of the learned representations in retrieving the k most similar examples from a different modality, using recall@k metric. This is applicable to any number of pairs of modalities at once, depending on memory constraints. |
Zero-shot Classification |
Evaluates the ability of a pre-trained encoder-based multimodal model to predict classes that were not explicitly seen during training. The new classes are given as text prompts, and the query modality can be any of the supported modalities. Binary and multi-class classification tasks are supported. |
If you are interested in contributing to the library, please see CONTRIBUTING.MD. This file contains many details around contributing to the code base, including are development practices, code checks, tests, and more.