awesome-sentence-embedding

A curated list of pretrained sentence and word embedding models

About This Repo

well there are some awesome-lists for word embeddings and sentence embeddings, but all of them are outdated and more importantly incomplete
this repo will also be incomplete, but I'll try my best to find and include all the papers with pretrained models
this is not a typical awesome list because it has tables but I guess it's ok and much better than just a huge list
if you find any mistakes or find another paper or anything please send a pull request and help me to keep this list up to date
enjoy!

General Framework

Almost all the sentence embeddings work like this:
Given some sort of word embeddings and an optional encoder (for example an LSTM) they obtain the contextualized word embeddings.
Then they define some sort of pooling (it can be as simple as last pooling).
Based on that they either use it directly for the supervised classification task (like infersent) or generate the target sequence (like skip-thought).
So, in general, we have many sentence embeddings that you have never heard of, you can simply do mean-pooling over any word embedding and it's a sentence embedding!

Word Embeddings

Note: don't worry about the language of the code, you can almost always (except for the subword models) just use the pretrained embedding table in the framework of your choice and ignore the training code

{{{word-embedding-table}}}

OOV Handling

Drop OOV words!
One OOV vector(unk vector)
Use subword models(ngram, bpe, char)
ALaCarte: A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors
Mimick: Mimicking Word Embeddings using Subword RNNs
CompactReconstruction: Subword-based Compact Reconstruction of Word Embeddings

Contextualized Word Embeddings

Note: all the unofficial models can load the official pretrained models

{{{contextualized-table}}}

Pooling Methods

{Last, Mean, Max}-Pooling
Special Token Pooling (like BERT and OpenAI's Transformer)
SIF: A Simple but Tough-to-Beat Baseline for Sentence Embeddings
TF-IDF: Unsupervised Sentence Representations as Word Information Series: Revisiting TF--IDF
P-norm: Concatenated Power Mean Word Embeddings as Universal Cross-Lingual Sentence Representations
DisC: A Compressed Sensing View of Unsupervised Text Embeddings, Bag-of-n-Grams, and LSTMs
GEM: Zero-Training Sentence Embedding via Orthogonal Basis
SWEM: Baseline Needs More Love: On Simple Word-Embedding-Based Modelsand Associated Pooling Mechanisms
VLAWE: Vector of Locally-Aggregated Word Embeddings (VLAWE): A Novel Document-level Representation
Efficient Sentence Embedding using Discrete Cosine Transform
fse: Gensim add-on for fast sentence embeddings. Supports Mean, Max, SIF, uSIF
Efficient Sentence Embedding via Semantic Subspace Analysis

Encoders

{{{encoder-table}}}

Evaluation

decaNLP: The Natural Language Decathlon: Multitask Learning as Question Answering
SentEval: SentEval: An Evaluation Toolkit for Universal Sentence Representations
GLUE: GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Exploring Semantic Properties of Sentence Embeddings
Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks
Word Embeddings Benchmarks: How to evaluate word embeddings? On importance of data efficiency and simple supervised tasks
MLDoc: A Corpus for Multilingual Document Classification in Eight Languages
LexNET: Olive Oil Is Made of Olives, Baby Oil Is Made for Babies: Interpreting Noun Compounds Using Paraphrases in a Neural Model
wordvectors.net: Community Evaluation and Exchange of Word Vectors at wordvectors.org
jiant: Looking for ELMo's friends: Sentence-Level Pretraining Beyond Language Modeling
jiant: What do you learn from context? Probing for sentence structure in contextualized word representations
Evaluation of sentence embeddings in downstream and linguistic probing tasks
QVEC: Evaluation of Word Vector Representations by Subspace Alignment
Grammatical Analysis of Pretrained Sentence Encoders with Acceptability Judgments
EQUATE : A Benchmark Evaluation Framework for Quantitative Reasoning in Natural Language Inference
Evaluating Word Embedding Models: Methods andExperimental Results
How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions
Linguistic Knowledge and Transferability of Contextual Representations: contextual-repr-analysis
LINSPECTOR: Multilingual Probing Tasks for Word Representations
Pitfalls in the Evaluation of Sentence Embeddings
Probing Multilingual Sentence Representations With X-Probe: xprobe

Misc

Word Embedding Dimensionality Selection: On the Dimensionality of Word Embedding
Half-Size: Simple and Effective Dimensionality Reduction for Word Embeddings
magnitude: Magnitude: A Fast, Efficient Universal Vector Embedding Utility Package
To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks
Don't Settle for Average, Go for the Max: Fuzzy Sets and Max-Pooled Word Vectors: fuzzymax
The Pupil Has Become the Master: Teacher-Student Model-BasedWord Embedding Distillation with Ensemble Learning: EmbeddingDistillation
Improving Distributional Similarity with Lessons Learned from Word Embeddings: hyperwords
Misspelling Oblivious Word Embeddings: moe
Single Training Dimension Selection for Word Embedding with PCA
Compressing Word Embeddings via Deep Compositional Code Learning: neuralcompressor
UER: An Open-Source Toolkit for Pre-training Models: UER-py
Situating Sentence Embedders with Nearest Neighbor Overlap
German BERT

Vector Mapping

Cross-lingual Word Vectors Projection Using CCA: Improving Vector Space Word Representations Using Multilingual Correlation
vecmap: A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings
MUSE: Unsupervised Machine Translation Using Monolingual Corpora Only
CrossLingualELMo: Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing

Articles

Comparing Sentence Similarity Methods
The Current Best of Universal Word Embeddings and Sentence Embeddings
On sentence representations, pt. 1: what can you fit into a single #$!%@*&% blog post?
Deep-learning-free Text and Sentence Embedding, Part 1
Deep-learning-free Text and Sentence Embedding, Part 2
An Overview of Sentence Embedding Methods
Word embeddings in 2017: Trends and future directions
A Walkthrough of InferSent – Supervised Learning of Sentence Embeddings
A survey of cross-lingual word embedding models
Introducing state of the art text classification with universal language models
Document Embedding Techniques

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_BASE.md

README_BASE.md

awesome-sentence-embedding

Table of Contents

About This Repo

General Framework

Word Embeddings

OOV Handling

Contextualized Word Embeddings

Pooling Methods

Encoders

Evaluation

Misc

Vector Mapping

Articles

Files

README_BASE.md

Latest commit

History

README_BASE.md

File metadata and controls

awesome-sentence-embedding

Table of Contents

About This Repo

General Framework

Word Embeddings

OOV Handling

Contextualized Word Embeddings

Pooling Methods

Encoders

Evaluation

Misc

Vector Mapping

Articles