Skip to content

Code for the ACL paper "MoRTy: Unsupervised Learning of Task-specialized Word Embeddings by Autoencoding"

Notifications You must be signed in to change notification settings

NilsRethmeier/MoRTy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MoRTy: a simple tool for Zero-shot domain adaptation of embeddings

MoRTy is a simple baseline method for zero-shot domain adaptation of embeddings that works especially well for low-resource applications, such as when little pre-traing data is available. It solves ...

Problems

In practice, one has to chose which embedding model (FastText, Glove, TransformerX) is optimal for a task. While most pre-training methods like BERT are optimized for 'high-pretraining-resource' domains, they can not be directly applied to 'low-pre-training resource settings' and incure substantial training costs. In practice, using a Multi-GPU model to fine tune on a sub 10 MB supervision task can seem counterintuitive and affords preparation and maintanance costs, which limits scalability of future use cases or during deployment.

Acronym and architecture, model-code

Menu of reconstructing transformations yields domain (de-)adapted embeddings

# See parameter settings in #Recipe section below or in MoRTy.py example (pc=OrderedDict ...)
class SparseOverCompleteAutoEncoder(torch.nn.Module):
    """ A sparse L1 autoencoder, that retrofits embeddings that are of the same (complete AE)
        or larger (overcomplete AE) dimension than the original embeddings.
    """
    def __init__(self, emb_dim, hidden_size, bias=True):
        super(SparseOverCompleteAutoEncoder, self).__init__()
        self.lin_encoder = nn.Linear(emb_dim, hidden_size, bias=bias) # no bias works too
        self.lin_decoder = nn.Linear(hidden_size, emb_dim, bias=bias)
        self.feature_size = emb_dim
        self.hidden_size = hidden_size
        self.l1weight = params['l1'] # had little effect on SUM-of-18-tasks performance
        self.act = params['activation'] # linear was best for SUM-of-18-tasks score

    def forward(self, input):
        r = self.act(self.lin_encoder(input))
        if self.l1weight is not None: # sparsity penalty
            x_ = self.lin_decoder(L1Penalty.apply(r, self.l1weight))
        else:
            x_ = self.lin_decoder(r) # no sparsity penalty
        # x_ for training via loss, r are the new/retrofit embeddings after training (1-epoch)
        return x_, r  

Recipe: 🍲

  1. pre-train/ download embeddings E_org (using FastText is recommended for out-of-vocabulary abilities)
  2. Produce k randomly autoencoded/ retro-fitted versions E_r1 ... E_rk of the original embedding E_org
  3. Chose the optimal E_ro form the k E_ri according to:
  • Embedding specialization/ Supervised use case: a supervised end-task's develoment set (E_ri is now essentially a hyperparameter). To save computation, consider selecting the optimal embedding E_ro on a low-cost baseline such as logistic regression or FastText and then use the found E_ro in the more complex model. Also works to find optimal E_ro in multi-input/channel + multi-task settings.
  • Embedding specialization/ Proxy supervised use case: use the dev/test set of a related (benchmark) task to find optimal embeddings E_ro. 'Proxy-shot' setting.
  • Embedding generalization/ Zero-shot use case: when training embeddings E_ri for 1 epoch on different pre-training corpora sizes (WikiText-2/-103, CommonCrawl) E_org we found MoRTy to always produce score improvements (between 1-9%) over the sum of 18 word-embedding benchmark tasks. This means that MoRTy generalizes embeddings 'blindly'.

Properties/ use cases

  • Zero- to few/proxy-shot domain adaptation
  • train in seconds 🕐
  • low RAM requirements, no GPU needed -- low carbon footprint, MoRTy ♥️ 🌍
  • saves annotation
  • usable to train simpler models (lower model extension costs/time)
  • cheaply produce that last 5% performance increase for customers 😏
  • MoRTy is not a Muppet

Usage 🔧

MoRTy.py contains example code that takes a .vec file (e.g. data/wikitext2_FastText_SG0.vec in #word\tembedding_values format) of FastText or GloVe pretrained embeddings and produces new autoencoded versions of those embeddings. The parameters in the pc dictionary can be adjusted. Though the script supports hyper parameter exploration via extending the value list in the pc object, this should not be neccessary.
Note: To reproduce the papers 1-epoch results (below) MoRTy was trained for 1-epoch using the scripts defaults settings. Blue is FastText embedding baseline performance = 100% for 5x FastText baselines per corpus size. On each baseline 3 MoRTy (red, yellow, green) were trained for 1 epoch.

The MT, ST results are best scores over multiple runs or MoRTy, so they indicate an upper bound that can approached on practical datasets using a development split for Mo*RT*y selection. In the paper experiments, the [word embdding benchmark](https://github.com/kudkudak/word-embeddings-benchmarks) by [Stanisław Jastrzebski et al.](https://arxiv.org/abs/1702.02170) serves as evaluation set -- i.e., no dev set was used. Since for practical applications the method is intended as a postprocessing step, to get cheap score improvements on (m)any emebdding model, only relative (potential) score changes are reported in the paper.

Dependencies

python          3.6 # due to word embedding benchmark
pandas          0.25.0
scikit-learn    0.21.2
pytorch         0.4.1
tdqm            4.32.1

requirements.txt for the rest

Paper and bibtex Reference 📜

MoRTy: Unsupervised Learning of Task-specialized Word Embeddings by Autoencoding, Nils Rethmeier, Barbara Plank, Repl4NLP@ACL, Italy, 2019

@inproceedings{rethmeier-plank-2019-morty,
    title = "{M}o{RT}y: Unsupervised Learning of Task-specialized Word Embeddings by Autoencoding",
    author = "Rethmeier, Nils  and  Plank, Barbara",
    booktitle = "Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)",
    month = "august",
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-4307",
    pages = "49--54",
}

About

Code for the ACL paper "MoRTy: Unsupervised Learning of Task-specialized Word Embeddings by Autoencoding"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages