Does spaCy support subword embeddings? #7278

koaning · 2021-03-04T16:42:58Z

koaning
Mar 4, 2021

Embeddings like fasttext and bpemb don't just offer word embeddings, technically they internally also have a representation for subwords. This is great in the realm of out-of-vocabulary words in situations like spelling errors or slang. What's also great is that bpemb supports 275 languages.

I'm wondering how easy it might be to port these embeddings into a spaCy pipeline. It seems like I can either enumerate over all the words beforehand on disk or that I can loop over all the words in a corpus of interest and append it to the Vocab. But in both these scenarios I'm mainly with words, not subwords.

One idea I'm having is that I might overwrite the .vector-property of a Token such that it dynamically fetches the appropriate embedding. But this feels overly hacky and it's likely also going to be fairly slow. Is there perhaps a better way?

adrianeboyd · 2021-03-04T18:51:40Z

adrianeboyd
Mar 4, 2021

No, spacy doesn't really support subword embeddings. You can make an unreasonably huge vector table if you want and use the minn/maxn options with get_vector, or you can add a user hook for the vectors that does something with the fastText python wrapper.

If you add a user hook for token.vector it's used for similarity, but not as a model feature. The minn/maxn options also aren't available out-of-the-box for model features. It would probably not be that hard to add it as alternative to StaticVectors. You'd need a custom tok2vec component and a custom vectors layer. The table is still really huge, though.

Last year I worked a bit on getting fastText to train subword-only vectors using Bloom embeddings in a much smaller hash table than it would normally use for the subwords. By default fastText hashes each subword into one row of a table with 2M entries, but the idea is do the same thing as HashEmbed and hash each subword into four rows of a table with ~20-50K entries. Here's the basic description: #4815 (comment).

The idea is that you'd have a md-size model with subword-only vectors that's not much larger or slower than StaticVectors. The drawbacks are that you would have to train the model with my custom fastText code and be relatively careful to get all the settings correct in the spacy config for this to work. It's probably a little more involved than most users would want to deal with, but I think it would be cool if it were all working.

1 reply

adrianeboyd Nov 9, 2021

An update: spacy v3.2.0 now includes support for this as "floret" vectors: https://spacy.io/usage/v3-2#vectors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does spaCy support subword embeddings? #7278

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Does spaCy support subword embeddings? #7278

koaning Mar 4, 2021

Replies: 1 comment · 1 reply

adrianeboyd Mar 4, 2021

adrianeboyd Nov 9, 2021

koaning
Mar 4, 2021

Replies: 1 comment 1 reply

adrianeboyd
Mar 4, 2021