Does spaCy support subword embeddings? #7278
Replies: 1 comment 1 reply
-
No, spacy doesn't really support subword embeddings. You can make an unreasonably huge vector table if you want and use the If you add a user hook for Last year I worked a bit on getting fastText to train subword-only vectors using Bloom embeddings in a much smaller hash table than it would normally use for the subwords. By default fastText hashes each subword into one row of a table with 2M entries, but the idea is do the same thing as The idea is that you'd have a |
Beta Was this translation helpful? Give feedback.
-
Embeddings like fasttext and bpemb don't just offer word embeddings, technically they internally also have a representation for subwords. This is great in the realm of out-of-vocabulary words in situations like spelling errors or slang. What's also great is that
bpemb
supports 275 languages.I'm wondering how easy it might be to port these embeddings into a spaCy pipeline. It seems like I can either enumerate over all the words beforehand on disk or that I can loop over all the words in a corpus of interest and append it to the Vocab. But in both these scenarios I'm mainly with words, not subwords.
One idea I'm having is that I might overwrite the
.vector
-property of aToken
such that it dynamically fetches the appropriate embedding. But this feels overly hacky and it's likely also going to be fairly slow. Is there perhaps a better way?Beta Was this translation helpful? Give feedback.
All reactions