diff --git a/README.md b/README.md
index a3efa17e85..ae6855350d 100644
--- a/README.md
+++ b/README.md
@@ -60,7 +60,7 @@ To get started with adapters, refer to these locations:
 - **[Colab notebook tutorials](https://github.com/Adapter-Hub/adapter-transformers/tree/master/notebooks)**, a series notebooks providing an introduction to all the main concepts of (adapter-)transformers and AdapterHub
 - **https://docs.adapterhub.ml**, our documentation on training and using adapters with _adapter-transformers_
 - **https://adapterhub.ml** to explore available pre-trained adapter modules and share your own adapters
-- **[Examples folder](https://github.com/Adapter-Hub/adapter-transformers/tree/master/examples)** of this repository containing HuggingFace's example training scripts, many adapted for training adapters
+- **[Examples folder](https://github.com/Adapter-Hub/adapter-transformers/tree/master/examples/pytorch)** of this repository containing HuggingFace's example training scripts, many adapted for training adapters
 
 ## Implemented Methods
 
diff --git a/adapter_docs/adapter_composition.md b/adapter_docs/adapter_composition.md
index 7f06d2a28c..2bd472658c 100644
--- a/adapter_docs/adapter_composition.md
+++ b/adapter_docs/adapter_composition.md
@@ -14,7 +14,7 @@ model.active_adapters = "adapter_name"
 
 Note that we also could have used `model.set_active_adapters("adapter_name")` which does the same.
 
-```eval_rst
+```{eval-rst}
 .. important::
     ``active_adapters`` defines which of the available adapters are used in each forward and backward pass through the model. This means:
 
@@ -39,7 +39,7 @@ They are presented in more detail in the following.
 
 ## `Stack`
 
-```eval_rst
+```{eval-rst}
 .. figure:: img/stacking_adapters.png
     :height: 300
     :align: center
@@ -71,7 +71,7 @@ For backwards compatibility, you can still do this, although it is recommended t
 
 ## `Fuse`
 
-```eval_rst
+```{eval-rst}
 .. figure:: img/Fusion.png
     :height: 300
     :align: center
@@ -98,7 +98,7 @@ model.add_adapter_fusion(["d", "e", "f"])
 model.active_adapters = ac.Fuse("d", "e", "f")
 ```
 
-```eval_rst
+```{eval-rst}
 .. important::
     Fusing adapters with the ``Fuse`` block only works successfully if an adapter fusion layer combining all of the adapters listed in the ``Fuse`` has been added to the model.
     This can be done either using ``add_adapter_fusion()`` or ``load_adapter_fusion()``.
@@ -111,7 +111,7 @@ For backwards compatibility, you can still do this, although it is recommended t
 
 ## `Split`
 
-```eval_rst
+```{eval-rst}
 .. figure:: img/splitting_adapters.png
     :height: 300
     :align: center
@@ -159,7 +159,7 @@ model.active_adapters = ac.BatchSplit("i", "k", "l", batch_sizes=[2, 1, 2])
 
 ## `Parallel`
 
-```eval_rst
+```{eval-rst}
 .. figure:: img/parallel.png
     :height: 300
     :align: center
@@ -206,7 +206,7 @@ model.active_adapters = ac.Stack("a", ac.Split("b", "c", split_index=60))
 
 However, combinations of adapter composition blocks cannot be arbitrarily deep. All currently supported possibilities are visualized in the figure below. 
 
-```eval_rst
+```{eval-rst}
 .. figure:: img/adapter_blocks_nesting.png
     :height: 300
     :align: center
diff --git a/adapter_docs/classes/adapter_config.rst b/adapter_docs/classes/adapter_config.rst
index 29479ca1fc..dfc56300a2 100644
--- a/adapter_docs/classes/adapter_config.rst
+++ b/adapter_docs/classes/adapter_config.rst
@@ -28,6 +28,12 @@ Single (bottleneck) adapters
 .. autoclass:: transformers.ParallelConfig
     :members:
 
+.. autoclass:: transformers.CompacterConfig
+    :members:
+
+.. autoclass:: transformers.CompacterPlusPlusConfig
+    :members:
+
 Prefix Tuning
 ~~~~~~~~~~~~~~~~~~~~~~~
 
diff --git a/adapter_docs/conf.py b/adapter_docs/conf.py
index a98323399a..e636f6d5a9 100644
--- a/adapter_docs/conf.py
+++ b/adapter_docs/conf.py
@@ -6,8 +6,6 @@
 import os
 import sys
 
-from recommonmark.transform import AutoStructify
-
 
 # -- Path setup --------------------------------------------------------------
 
@@ -90,5 +88,4 @@
 
 def setup(app):
     app.add_config_value("recommonmark_config", {"enable_eval_rst": True}, True)
-    app.add_transform(AutoStructify)
     app.add_css_file("custom.css")
diff --git a/adapter_docs/contributing.md b/adapter_docs/contributing.md
index 904efd6e82..3dcc36e5b2 100644
--- a/adapter_docs/contributing.md
+++ b/adapter_docs/contributing.md
@@ -1,6 +1,6 @@
 # Contributing to AdapterHub
 
-```eval_rst
+```{eval-rst}
 .. note::
     This document describes how to contribute adapters via the AdapterHub `Hub repository <https://github.com/adapter-hub/hub>`_. See `Integration with HuggingFace's Model Hub <huggingface_hub.html>`_ for uploading adapters via the HuggingFace Model Hub.
 ```
@@ -49,7 +49,7 @@ Let's go through the upload process step by step:
     ```
     `adapter-hub-cli` will search for available adapters in the path you specify and interactively lead you through the packing process.
 
-    ```eval_rst
+    ```{eval-rst}
     .. note::
         The configuration of the adapter is specified by an identifier string in the YAML file. This string should refer to an adapter architecture available in the Hub. If you use a new or custom architecture, make sure to also `add an entry for your architecture <#add-a-new-adapter-architecture>`_ to the repo. 
     ```
diff --git a/adapter_docs/huggingface_hub.md b/adapter_docs/huggingface_hub.md
index e387ea4799..86311ce6e3 100644
--- a/adapter_docs/huggingface_hub.md
+++ b/adapter_docs/huggingface_hub.md
@@ -1,6 +1,6 @@
 # Integration with HuggingFace's Model Hub
 
-```eval_rst
+```{eval-rst}
 .. figure:: img/hfhub.svg
     :align: center
     :alt: HuggingFace Hub logo.
@@ -53,7 +53,7 @@ For more options and information, e.g. for managing models via the CLI and Git,
     This will create a repository `my-awesome-adapter` under your username, generate a default adapter card as `README.md` and upload the adapter named `awesome_adapter` together with the adapter card to the new repository.
     `adapterhub_tag` and `datasets_tag` provide additional information for categorization.
 
-    ```eval_rst
+    ```{eval-rst}
     .. important::
         All adapters uploaded to HuggingFace's Model Hub are automatically also listed on AdapterHub.ml. Thus, for better categorization, either ``adapterhub_tag`` or ``datasets_tag`` is required when uploading a new adapter to the Model Hub.
 
diff --git a/adapter_docs/installation.md b/adapter_docs/installation.md
index 585f8cc1f3..5364bcee95 100644
--- a/adapter_docs/installation.md
+++ b/adapter_docs/installation.md
@@ -3,7 +3,7 @@
 Our *adapter-transformers* package is a drop-in replacement for Huggingface's *transformers* library.
 It currently supports Python 3.6+ and PyTorch 1.3.1+. You will have to [install PyTorch](https://pytorch.org/get-started/locally/) first. 
 
-```eval_rst
+```{eval-rst}
 .. important::
     ``adapter-transformers`` is a direct fork of ``transformers``.
     This means our package includes all the awesome features of HuggingFace's original package plus the adapter implementation.
diff --git a/adapter_docs/loading.md b/adapter_docs/loading.md
index 16dbb21376..9a44af5bf9 100644
--- a/adapter_docs/loading.md
+++ b/adapter_docs/loading.md
@@ -117,7 +117,7 @@ The identifier string used to find a matching adapter follows a format consistin
 
 An example of a full identifier following this format might look like `qa/squad1.1@example-org`.
 
-```eval_rst
+```{eval-rst}
 .. important::
     In many cases, you don't have to give the full string identifier with all three components to successfully load an adapter from the Hub. You can drop the `<username>` you don't care about the uploader of the adapter.  Also, if the resulting identifier is still unique, you can drop the ``<task>`` or the ``<subtask>``. So, ``qa/squad1.1``, ``squad1.1`` or ``squad1.1@example-org`` all may be valid identifiers.
 ```
diff --git a/adapter_docs/model_overview.md b/adapter_docs/model_overview.md
index 23ec0673a4..81173a3346 100644
--- a/adapter_docs/model_overview.md
+++ b/adapter_docs/model_overview.md
@@ -3,7 +3,7 @@
 This page gives an overview of the Transformer models currently supported by `adapter-transformers`.
 The table below further shows which model architectures support which adaptation methods and which features of `adapter-transformers`.
 
-```eval_rst
+```{eval-rst}
 .. note::
     Each supported model architecture X typically provides a class ``XAdapterModel`` for usage with ``AutoAdapterModel``.
     Additionally, it is possible to use adapters with the model classes already shipped with HuggingFace Transformers.
diff --git a/adapter_docs/overview.md b/adapter_docs/overview.md
index 3fb7d45533..9350818868 100644
--- a/adapter_docs/overview.md
+++ b/adapter_docs/overview.md
@@ -35,7 +35,7 @@ config = ... # config class deriving from AdapterConfigBase
 model.add_adapter("name", config=config)
 ```
 
-```eval_rst
+```{eval-rst}
 .. important::
     In literature, different terms are used to refer to efficient fine-tuning methods.
     The term "adapter" is usually only applied to bottleneck adapter modules.
@@ -67,7 +67,7 @@ $$
 A visualization of further configuration options related to the adapter structure is given in the figure below. For more details, refer to the documentation of [`AdapterConfig`](transformers.AdapterConfig).
 
 
-```eval_rst
+```{eval-rst}
 .. figure:: img/architecture.png
     :width: 350
     :align: center
@@ -120,7 +120,7 @@ model.add_adapter("lang_adapter", config=config)
 _Papers:_
 - [MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer](https://arxiv.org/pdf/2005.00052.pdf) (Pfeiffer et al., 2020)
 
-```eval_rst
+```{eval-rst}
 .. note::
     V1.x of adapter-transformers made a distinction between task adapters (without invertible adapters) and language adapters (with invertible adapters) with the help of the ``AdapterType`` enumeration.
     This distinction was dropped with v2.x.
@@ -171,7 +171,7 @@ for a PHM layer by specifying `use_phm=True` in the config.
 The PHM layer has the following additional properties: `phm_dim`, `shared_phm_rule`, `factorized_phm_rule`, `learn_phm`, 
 `factorized_phm_W`, `shared_W_phm`, `phm_c_init`, `phm_init_range`, `hypercomplex_nonlinearity`
 
-For more information check out the [AdapterConfig](classes/adapter_config.html#transformers.AdapterConfig) class.
+For more information check out the [`AdapterConfig`](transformers.AdapterConfig) class.
 
 To add a Compacter to your model you can use the predefined configs:
 ```python
diff --git a/adapter_docs/prediction_heads.md b/adapter_docs/prediction_heads.md
index 7552b9fdd3..c975bc3f78 100644
--- a/adapter_docs/prediction_heads.md
+++ b/adapter_docs/prediction_heads.md
@@ -3,7 +3,7 @@
 This section gives an overview how different prediction heads can be used together with adapter modules and how pre-trained adapters can be distributed side-by-side with matching prediction heads in AdapterHub.
 We will take a look at the `AdapterModel` classes (e.g. `BertAdapterModel`) introduced by adapter-transformers, which provide **flexible** support for prediction heads, as well as models with **static** heads provided out-of-the-box by HuggingFace Transformers (e.g. `BertForSequenceClassification`).
 
-```eval_rst
+```{eval-rst}
 .. tip::
     We recommend to use the `AdapterModel classes <#adaptermodel-classes>`_ whenever possible. 
     They have been created specifically for working with adapters and provide more flexibility.
@@ -37,7 +37,7 @@ Since we gave the task adapter the same name as our head, we can easily identify
 The call to `set_active_adapters()` in the second line tells our model to use the adapter - head configuration we specified by default in a forward pass.
 At this point, we can start to [train our setup](training.md).
 
-```eval_rst
+```{eval-rst}
 .. note::
     The ``set_active_adapters()`` will search for an adapter and a prediction head with the given name to be activated.
     Alternatively, prediction heads can also be activated explicitly (i.e. without adapter modules).
@@ -87,7 +87,7 @@ In case the classes match, our prediction head weights will be automatically loa
 
 ## Automatic conversion 
 
-```eval_rst
+```{eval-rst}
 .. important::
     Although the two prediction head implementations serve the same use case, their weights are *not* directly compatible, i.e. you cannot load a head created with ``AutoAdapterModel`` into a model of type ``AutoModelForSequenceClassification``.
     There is however an automatic conversion to model classes with flexible heads.
diff --git a/adapter_docs/quickstart.md b/adapter_docs/quickstart.md
index 6a7da87bc1..e510f0d11d 100644
--- a/adapter_docs/quickstart.md
+++ b/adapter_docs/quickstart.md
@@ -6,7 +6,7 @@ Currently, *adapter-transformers* adds adapter components to the PyTorch impleme
 For working with adapters, a couple of methods for creation (`add_adapter()`), loading (`load_adapter()`), 
 storing (`save_adapter()`) and deletion (`delete_adapter()`) are added to the model classes. In the following, we will briefly go through some examples.
 
-```eval_rst
+```{eval-rst}
 .. note::
     This document focuses on the adapter-related functionalities added by *adapter-transformers*.
     For a more general overview of the *transformers* library, visit
diff --git a/adapter_docs/training.md b/adapter_docs/training.md
index c005aa918a..e444c4ac69 100644
--- a/adapter_docs/training.md
+++ b/adapter_docs/training.md
@@ -47,7 +47,7 @@ if task_name not in model.config.adapters:
 model.train_adapter(task_name)
 ```
 
-```eval_rst
+```{eval-rst}
 .. important::
     The most crucial step when training an adapter module is to freeze all weights in the model except for those of the
     adapter. In the previous snippet, this is achieved by calling the ``train_adapter()`` method which disables training
@@ -90,12 +90,12 @@ python run_glue.py \
 
 The important flag here is `--train_adapter` which switches from fine-tuning the full model to training an adapter module for the given GLUE task.
 
-```eval_rst
+```{eval-rst}
 .. tip::
     Adapter weights are usually initialized randomly. That is why we require a higher learning rate. We have found that a default adapter learning rate of ``1e-4`` works well for most settings.
 ```
 
-```eval_rst
+```{eval-rst}
 .. tip::
     Depending on your data set size you might also need to train longer than usual. To avoid overfitting you can evaluating the adapters after each epoch on the development set and only save the best model.
 ```
@@ -129,7 +129,7 @@ python run_mlm.py \
 We provide an example for training _AdapterFusion_ ([Pfeiffer et al., 2020](https://arxiv.org/pdf/2005.00247)) on the GLUE dataset: [run_fusion_glue.py](https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/adapterfusion/run_fusion_glue.py). 
 You can adapt this script to train AdapterFusion with different pre-trained adapters on your own dataset.
 
-```eval_rst
+```{eval-rst}
 .. important::
     AdapterFusion on a target task is trained in a second training stage, after independently training adapters on individual tasks.
     When setting up a fusion architecture on your model, make sure to load the pre-trained adapter modules to be fused using ``model.load_adapter()`` before adding a fusion layer.
@@ -180,7 +180,7 @@ trainer = AdapterTrainer(
         data_collator=data_collator,
     )
 ```
-```eval_rst
+```{eval-rst}
 .. tip::
     When you migrate from the previous versions, which use the Trainer class for adapter training and fully fine-tuning, note that the 
     specialized AdapterTrainer class does not have the parameters `do_save_full_model`, `do_save_adapters` and `do_save_adapter_fusion`.
diff --git a/adapter_docs/v2_transition.md b/adapter_docs/v2_transition.md
index 171dbdeffb..91d85a1ac2 100644
--- a/adapter_docs/v2_transition.md
+++ b/adapter_docs/v2_transition.md
@@ -106,7 +106,7 @@ model.active_adapters = "awesome_adapter"
 model(**input_data)
 ```
 
-```eval_rst
+```{eval-rst}
 .. note::
     Version 2.0.0 temporarily removed the ``adapter_names`` parameter entirely.
     Due to user feedback regarding limitations of the ``active_adapters`` property in multi-threaded contexts,
diff --git a/examples/README.md b/examples/README.md
deleted file mode 100644
index 603eb564c2..0000000000
--- a/examples/README.md
+++ /dev/null
@@ -1,80 +0,0 @@
-<!---
-Copyright 2020 The HuggingFace Team. All rights reserved.
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
--->
-
-# Examples
-
-We host a wide range of example scripts for multiple learning frameworks. Simply choose your favorite: [TensorFlow](https://github.com/huggingface/transformers/tree/master/examples/tensorflow), [PyTorch](https://github.com/huggingface/transformers/tree/master/examples/pytorch) or [JAX/Flax](https://github.com/huggingface/transformers/tree/master/examples/flax).
-
-We also have some [research projects](https://github.com/huggingface/transformers/tree/master/examples/research_projects), as well as some [legacy examples](https://github.com/huggingface/transformers/tree/master/examples/legacy). Note that unlike the main examples these are not actively maintained, and may require specific older versions of dependencies in order to run. 
-
-While we strive to present as many use cases as possible, the example scripts are just that - examples. It is expected that they won't work out-of-the box on your specific problem and that you will be required to change a few lines of code to adapt them to your needs. To help you with that, most of the examples fully expose the preprocessing of the data, allowing you to tweak and edit them as required.
-
-Please discuss on the [forum](https://discuss.huggingface.co/) or in an [issue](https://github.com/huggingface/transformers/issues) a feature you would like to implement in an example before submitting a PR; we welcome bug fixes, but since we want to keep the examples as simple as possible it's unlikely that we will merge a pull request adding more functionality at the cost of readability.
-
-## Important note
-
-**Important**
-
-To make sure you can successfully run the latest versions of the example scripts, you have to **install the library from source** and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
-```bash
-git clone https://github.com/huggingface/transformers
-cd transformers
-pip install .
-```
-Then cd in the example folder of your choice and run
-```bash
-pip install -r requirements.txt
-```
-
-To browse the examples corresponding to released versions of 🤗 Transformers, click on the line below and then on your desired version of the library:
-
-<details>
-  <summary>Examples for older versions of 🤗 Transformers</summary>
-	<ul>
-		<li><a href="https://github.com/huggingface/transformers/tree/v4.5.1/examples">v4.5.1</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v4.4.2/examples">v4.4.2</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v4.3.3/examples">v4.3.3</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v4.2.2/examples">v4.2.2</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v4.1.1/examples">v4.1.1</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v4.0.1/examples">v4.0.1</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v3.5.1/examples">v3.5.1</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v3.4.0/examples">v3.4.0</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v3.3.1/examples">v3.3.1</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v3.2.0/examples">v3.2.0</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v3.1.0/examples">v3.1.0</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v3.0.2/examples">v3.0.2</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v2.11.0/examples">v2.11.0</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v2.10.0/examples">v2.10.0</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v2.9.1/examples">v2.9.1</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v2.8.0/examples">v2.8.0</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v2.7.0/examples">v2.7.0</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v2.6.0/examples">v2.6.0</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v2.5.1/examples">v2.5.1</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v2.4.0/examples">v2.4.0</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v2.3.0/examples">v2.3.0</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v2.2.0/examples">v2.2.0</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v2.1.0/examples">v2.1.1</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v2.0.0/examples">v2.0.0</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v1.2.0/examples">v1.2.0</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v1.1.0/examples">v1.1.0</a></li>
-		<li><a href="https://github.com/huggingface/transformers/tree/v1.0.0/examples">v1.0.0</a></li>
-	</ul>
-</details>
-
-Alternatively, you can switch your cloned 🤗 Transformers to a specific version (for instance with v3.5.1) with
-```bash
-git checkout tags/v3.5.1
-```
-and run the example command as usual afterward.
diff --git a/src/transformers/adapters/configuration.py b/src/transformers/adapters/configuration.py
index 6cd9a8b77b..5ccb9ffc99 100644
--- a/src/transformers/adapters/configuration.py
+++ b/src/transformers/adapters/configuration.py
@@ -136,17 +136,17 @@ class AdapterConfig(AdapterConfigBase):
     Args:
         mh_adapter (:obj:`bool`): If True, add adapter modules after the multi-head attention block of each layer.
         output_adapter (:obj:`bool`): If True, add adapter modules after the output FFN of each layer.
-        reduction_factor (:
-            obj:`int` or :obj:`Mapping`): Either an integer specifying the reduction factor for all layers or a mapping
-            specifying the reduction_factor for individual layers. If not all layers are represented in the mapping a
-            default value should be given e.g. {'1': 8, '6': 32, 'default': 16}
+        reduction_factor (:obj:`int` or :obj:`Mapping`):
+            Either an integer specifying the reduction factor for all layers or a mapping specifying the
+            reduction_factor for individual layers. If not all layers are represented in the mapping a default value
+            should be given e.g. {'1': 8, '6': 32, 'default': 16}
         non_linearity (:obj:`str`): The activation function to use in the adapter bottleneck.
-        original_ln_before (:
-            obj:`bool`, optional): If True, apply layer pre-trained normalization and residual connection before the
-            adapter modules. Defaults to False. Only applicable if :obj:`is_parallel` is False.
-        original_ln_after (:
-            obj:`bool`, optional): If True, apply pre-trained layer normalization and residual connection after the
-            adapter modules. Defaults to True.
+        original_ln_before (:obj:`bool`, optional):
+            If True, apply layer pre-trained normalization and residual connection before the adapter modules. Defaults
+            to False. Only applicable if :obj:`is_parallel` is False.
+        original_ln_after (:obj:`bool`, optional):
+            If True, apply pre-trained layer normalization and residual connection after the adapter modules. Defaults
+            to True.
         ln_before (:obj:`bool`, optional): If True, add a new layer normalization before the adapter bottleneck.
             Defaults to False.
         ln_after (:obj:`bool`, optional): If True, add a new layer normalization after the adapter bottleneck.
@@ -155,37 +155,35 @@ class AdapterConfig(AdapterConfigBase):
             Currently, this can be either "bert" (default) or "mam_adapter".
         is_parallel (:obj:`bool`, optional): If True, apply adapter transformations in parallel.
             By default (False), sequential application is used.
-        scaling:
-            (:obj:`float` or :obj:`str`, optional): Scaling factor to use for scaled addition of adapter outputs as
-            done by He et al. (2021). Can bei either a constant factor (float) or the string "learned", in which case
-            the scaling factor is learned. Defaults to 1.0.
-        residual_before_ln (:
-            obj:`bool`, optional): If True, take the residual connection around the adapter bottleneck before the layer
-            normalization. Only applicable if :obj:`original_ln_before` is True.
-        adapter_residual_before_ln (:
-            obj:`bool`, optional): If True, apply the residual connection around the adapter modules before the new
-            layer normalization within the adapter. Only applicable if :obj:`ln_after` is True and :obj:`is_parallel`
-            is False.
-        inv_adapter:
-            (:obj:`str`, optional): If not None (default), add invertible adapter modules after the model embedding
-            layer. Currently, this can be either "nice" or "glow".
-        inv_adapter_reduction_factor (:
-            obj:`int`, optional): The reduction to use within the invertible adapter modules. Only applicable if
-            :obj:`inv_adapter` is not None.
-        cross_adapter (:
-            obj:`bool`, optional): If True, add adapter modules after the cross attention block of each decoder layer
-            in an encoder-decoder model. Defaults to False.
-        leave_out (:
-            obj:`List[int]`, optional): The IDs of the layers (starting at 0) where NO adapter modules should be added.
+        scaling (:obj:`float` or :obj:`str`, optional):
+            Scaling factor to use for scaled addition of adapter outputs as done by He et al. (2021). Can bei either a
+            constant factor (float) or the string "learned", in which case the scaling factor is learned. Defaults to
+            1.0.
+        residual_before_ln (:obj:`bool`, optional):
+            If True, take the residual connection around the adapter bottleneck before the layer normalization. Only
+            applicable if :obj:`original_ln_before` is True.
+        adapter_residual_before_ln (:obj:`bool`, optional):
+            If True, apply the residual connection around the adapter modules before the new layer normalization within
+            the adapter. Only applicable if :obj:`ln_after` is True and :obj:`is_parallel` is False.
+        inv_adapter (:obj:`str`, optional):
+            If not None (default), add invertible adapter modules after the model embedding layer. Currently, this can
+            be either "nice" or "glow".
+        inv_adapter_reduction_factor (:obj:`int`, optional):
+            The reduction to use within the invertible adapter modules. Only applicable if :obj:`inv_adapter` is not
+            None.
+        cross_adapter (:obj:`bool`, optional):
+            If True, add adapter modules after the cross attention block of each decoder layer in an encoder-decoder
+            model. Defaults to False.
+        leave_out (:obj:`List[int]`, optional):
+            The IDs of the layers (starting at 0) where NO adapter modules should be added.
         phm_layer (:obj:`bool`, optional): If True the down and up projection layers are a PHMLayer.
             Defaults to False
         phm_dim (:obj:`int`, optional): The dimension of the phm matrix.
             Defaults to None.
         shared_phm_rule (:obj:`bool`, optional): Whether the phm matrix is shared across all layers.
             Defaults to True
-        factorized_phm_rule (:
-            obj:`bool`, optional): Whether the phm matrix is factorized into a left and right matrix. Defaults to
-            False.
+        factorized_phm_rule (:obj:`bool`, optional):
+            Whether the phm matrix is factorized into a left and right matrix. Defaults to False.
         learn_phm (:obj:`bool`, optional): Whether the phm matrix should be learned during training.
             Defaults to True
         factorized_phm_W (:
@@ -197,16 +195,15 @@ class AdapterConfig(AdapterConfigBase):
             The possible values are `["normal", "uniform"]`. Defaults to `normal`.
         phm_init_range (:obj:`float`, optional): std for initializing phm weights if `phm_c_init="normal"`.
             Defaults to 0.0001.
-        hypercomplex_nonlinearity (:
-            obj:`str`, optional): This specifies the distribution to draw the weights in the phm layer from, Defaults
-            to `glorot-uniform`.
-        phm_rank (:
-            obj:`int`, optional): If the weight matrix is factorized this specifies the rank of the matrix. E.g. the
-            left matrix of the down projection has the shape (phm_dim, _in_feats_per_axis, phm_rank) and the right
-            matrix (phm_dim, phm_rank, _out_feats_per_axis). Defaults to 1
-        phm_bias (:
-            obj:`bool`, optional): If True the down and up projection PHMLayer has a bias term. If `phm_layer`is False
-            this is ignored. Defaults to True
+        hypercomplex_nonlinearity (:obj:`str`, optional):
+            This specifies the distribution to draw the weights in the phm layer from. Defaults to `glorot-uniform`.
+        phm_rank (:obj:`int`, optional):
+            If the weight matrix is factorized this specifies the rank of the matrix. E.g. the left matrix of the down
+            projection has the shape (phm_dim, _in_feats_per_axis, phm_rank) and the right matrix (phm_dim, phm_rank,
+            _out_feats_per_axis). Defaults to 1
+        phm_bias (:obj:`bool`, optional):
+            If True the down and up projection PHMLayer has a bias term. If `phm_layer` is False this is ignored.
+            Defaults to True
     """
 
     # Required options
diff --git a/src/transformers/trainer.py b/src/transformers/trainer.py
index 210614a7d8..9d18653f08 100755
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -2108,6 +2108,8 @@ def _save_tpu(self, output_dir: Optional[str] = None):
             self.tokenizer.save_pretrained(output_dir)
 
     def _load(self, resume_from_checkpoint):
+        args = self.args
+
         if not os.path.isfile(os.path.join(resume_from_checkpoint, WEIGHTS_NAME)):
             raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}")
 
diff --git a/utils/style_doc.py b/utils/style_doc.py
index d65fc04efa..d1fed057be 100644
--- a/utils/style_doc.py
+++ b/utils/style_doc.py
@@ -424,7 +424,7 @@ def style_file_docstrings(code_file, max_len=119, check_only=False):
     diff = clean_code != code
     if not check_only and diff:
         print(f"Overwriting content of {code_file}.")
-        with open(code_file, "w", encoding="utf-8") as f:
+        with open(code_file, "w", encoding="utf-8", newline="\n") as f:
             f.write(clean_code)
 
     return diff, black_errors