Comparison of methods based on pre-trained Word2Vec, GloVe and FastText vectors to measure the semantic similarity between sentence pairs
-
data/
datatsets/
get_datasets.bash
: script to download the datasets used in the evaluation, which is a modification of the one provided in the SentEval toolkit.tokenizer.vec
embedding/
fasttext/get_fasttext_embeddings.bash
: script that downloads the set of word vectors computed with the FastText used.gloVe/
2word2vec.py
: transforms the GloVe vector set to Word2Vec format.get_glove_embeddings.bash
: script that downloads the GloVe word embeddings set used.
word2vec/get_word2vec_embeddings.bash
: script that downloads the Word2Vec word embeddings set used.frequencies.tsv
-
evaluation.ipynb
: Jupyter Notebook file in which the evaluation carried out is developed. -
load.py
: contains a set of functions to load and preprocess the different data sets used. The code is based on what can be found in the [SentEval]To run the evaluation code, contained in the Jupyter Notebook file evaluation.ipynb, you can follow the following steps:
First, install Python3.7 and the virtual environment tool:
sudo apt update
sudo apt install software-properties-common
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt install python3.7
sudo apt install python3.7-venv
Second, create a Python3.7 virtual environment inside this repository:
python3.7 -m venv .venv
and activate it:
source .venv/bin/activate
Once the virtual environment is activated, install the dependencies using the following command:
pip install -r requirements.txt
Note that in order to reproduce the evaluation contained in the evaluation.ipynb file, you must first download the Word2Vec, GloVe and FastText word vector sets. Each of these sets is of considerable size and may take several minutes to download.
With this repository (semantic_similarity/) being the current directory, run the following commands:
cd data/embedding/word2vec
chmod +x get_word2vec_embeddings.bash
./get_word2vec_embeddings.bash
With this repository (semantic_similarity/) being the current directory, run the following commands:
cd data/embedding/glove
chmod +x get_glove_embeddings.bash
./get_glove_embeddings.bash
python 2word2vec.py
With this repository (semantic_similarity/) being the current directory, run the following commands:
cd data/embedding/fasttext
chmod +x get_fasttext_embeddings.bash
./get_fasttext_embeddings.bash
It is also necessary to download the datasets. For them, this repository (semantic_similarity/) being the current directory, run the following commands:
cd data/datasets
sudo chmod +x get_datasets.bash
./get_datasets.bash
Run Jupyter Notebook and access the evaluation.ipynb file. To run Jupyter Notebook, execute the following command:
jupyter-notebook
Once you have finished using Jupyter Notebook, in the terminal where you executed the previous command, use Ctrl + C
to end the execution of Jupyter Notebook. Finally, disable the virtual environment using the following command:
deactivate
gensim==3.8.2
jupyter==1.0.0
notebook==6.0.3
numpy==1.18.3
Orange3==3.25.0
pandas==1.0.3
sklearn==0.0
spacy==2.2.4