Disambiguating between the possible senses of a word in the context of a sentence is a fundamental problem in NLP. However, this assumes a universal set of "meanings" to disambiguate between. A more natural but also more practical task is finding a good substitution for a word in context. For example, in the sentence "She went to the bar last night", we know bar means pub, but the word bar has other meanings: a chocolate bar, or a ban/restriction on something.
This repository uses a Word2Vec embedding based on the Google News corpus, made available here and through the gensim library to rank candidate word substitutions by their suitability to the context of the sentence.
- Download the Google News word vectors from here and make sure you have the gensim package installed.
- Make sure you've installed nltk (natural language toolkit) and have downloaded the lin thesaurus and wordnet corpora by executing the following in the python console:
import nltk
,nltk.download('lin_thesaurus')
,nltk.download('wordnet')
from lexsub import LexSub
from gensim.models import KeyedVectors
word2vec_path = "/path/to/GoogleNews-vectors-negative300.bin"
vectors = KeyedVectors.load_word2vec_format(word2vec_path, binary=True)
ls = LexSub(vectors, candidate_generator='lin')
sentence = "She had a drink at the bar"
target = "bar.n"
result = ls.lex_sub(target, sentence)
print(result)
# ['bars', 'pub', 'tavern', 'nightclub', 'restaurant']