Add Composition Support to LoRA and (IA)³ (#598)

Follow-up to #591. This PR provides initial support for adapter composition in LoRA & (IA)³ modules. Currently LoRA & (IA)³ don't support composition. With this PR, the following blocks will be supported: **Stack, BatchSplit, Average, Parallel** Additionally, the LoRA implementation is refactored a bit in an effort to make it cleaner. ### Limitations - Split & Fuse compositions are **not** supported - LoRA/ (IA)³ composition is **not** supported for models using the `LoRAMergedLinear` implementation. These currently are: **GPT-2, DeBERTa (v1)**
adapter-hub · Nov 18, 2023 · 42fff1e · 42fff1e
1 parent d6d44cd
commit 42fff1e
Show file tree

Hide file tree

Showing 35 changed files with 344 additions and 143 deletions.
diff --git a/docs/adapter_composition.md b/docs/adapter_composition.md
@@ -42,14 +42,16 @@ The following table gives an overview on the supported composition blocks and th
 
 | Block | Bottleneck<br> Adapters | Prefix<br> Tuning | Compacter | LoRA | (IA)³ |
 | --- | --- | --- | --- | --- | --- |
-| [`Stack`](#stack) | ✅ | ✅ | ✅ |  |  |
+| [`Stack`](#stack) | ✅ | ✅ | ✅ | ✅(*) | ✅(*) |
 | [`Fuse`](#fuse) | ✅ |  | ✅ |  |  |
 | [`Split`](#split) | ✅ |  | ✅ |  |  |
-| [`BatchSplit`](#batchsplit) | ✅ | ✅ | ✅ |  |  |
-| [`Parallel`](#parallel) | ✅ | ✅ | ✅ |  |  |
-| [Output averaging](#output-averaging) | ✅ |  | ✅ |  |  |
+| [`BatchSplit`](#batchsplit) | ✅ | ✅ | ✅ | ✅(*) | ✅(*) |
+| [`Parallel`](#parallel) | ✅ | ✅ | ✅ | ✅(*) | ✅(*) |
+| [Output averaging](#output-averaging) | ✅ |  | ✅ | ✅(*) | ✅(*) |
 | [Parameter averaging](#parameter-averaging) | ✅ | ✅ | ✅ | ✅ | ✅ |
 
+(*) except for Deberta-v1, GPT-2.
+
 Next, we present all composition blocks in more detail.
 
 ## `Stack`

diff --git a/docs/index.rst b/docs/index.rst
@@ -94,7 +94,6 @@ Currently, we support the PyTorch versions of all models as listed on the `Model
 
    classes/adapter_config
    classes/model_adapters_config
-   classes/adapter_modules
    classes/adapter_layer
    classes/model_mixins
    classes/adapter_training

diff --git a/src/adapters/composition.py b/src/adapters/composition.py
@@ -1,6 +1,8 @@
 import itertools
 from collections.abc import Sequence
-from typing import List, Optional, Set, Union
+from typing import List, Optional, Set, Tuple, Union
+
+import torch
 
 
 class AdapterCompositionBlock(Sequence):
@@ -242,3 +244,16 @@ def adjust_tensors_for_parallel_(hidden_states, *tensors):
             repeats[0] = hidden_states.shape[0] // tensor.shape[0]
             new_tensor = tensor.repeat(*repeats)
             tensor.set_(new_tensor)
+
+
+def match_attn_matrices_for_parallel(query, key, value) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+    """
+    Matches the shapes of query, key and value matrices for parallel composition.
+    """
+    max_bsz = max(query.shape[0], key.shape[0], value.shape[0])
+
+    query = query.repeat(max_bsz // query.shape[0], *([1] * len(query.shape[1:])))
+    key = key.repeat(max_bsz // key.shape[0], *([1] * len(key.shape[1:])))
+    value = value.repeat(max_bsz // value.shape[0], *([1] * len(value.shape[1:])))
+
+    return query, key, value
diff --git a/src/adapters/methods/adapter_layer_base.py b/src/adapters/methods/adapter_layer_base.py
@@ -150,10 +150,13 @@ class ComposableAdapterLayerBase(AdapterLayerBase):
     Base class for all adapter methods that support composition.
 
     Make sure the 'adapter_modules_name' and 'supported_compositions' attributes as well as all abstract methods are
-    overriden in derived classes.
+    overriden in derived classes. 'allow_multi_parallelize' can be set to True to allow inputs to be parallelized
+    independently multiple times. This is useful when there are multiple parallel input flows through an adapter layer
+    (e.g. in LoRA).
     """
 
     supported_compositions = []
+    allow_multi_parallelize = False
 
     def __init__(self, *args, **kwargs):
         super().__init__(*args, **kwargs)
@@ -382,15 +385,23 @@ def compose_parallel(self, adapter_setup: Parallel, state: NamedTuple, lvl: int
             orig_batch_size = self._bsz(state)
             state = self.repeat(state, adapter_setup.parallel_channels)
             context.adapters_parallelized = True
+            context.original_batch_size = orig_batch_size
         else:
+            bsz = self._bsz(state)
+            # If the input was already parallelized, we can parallelize it again.
+            # This is useful e.g. for LoRA, where attention matrices are parallelized independently.
+            if self.allow_multi_parallelize and bsz == getattr(context, "original_batch_size", -1):
+                state = self.repeat(state, adapter_setup.parallel_channels)
+                orig_batch_size = bsz
             # The base model should handle replication of input.
             # Therefore, we assume the (replicated) input batch to be divisible by the number of parallel channels.
-            if self._bsz(state) % adapter_setup.parallel_channels != 0:
+            elif bsz % adapter_setup.parallel_channels != 0:
                 raise ValueError(
                     "The total input batch size in a Parallel adapter block must be divisible by the number of"
                     " parallel channels."
                 )
-            orig_batch_size = self._bsz(state) // adapter_setup.parallel_channels
+            else:
+                orig_batch_size = bsz // adapter_setup.parallel_channels
 
         state = self.pre_block(adapter_setup, state)