-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Wok edited this page Apr 4, 2019
·
21 revisions
An in-depth commentary is provided on the Commentary page. Overall, I would suggest to match store descriptions with:
- either Term Frequency * Inverse Document Frequency (Tf-Idf),
- or a weighted average of GloVe word embeddings, with Tf-Idf reweighting, after removing some components:
- either only sentence components,
- or both sentence and word components (for slighly better results, by a tiny margin).
When using average of word embeddings as sentence embeddings:
- removing only sentence components provided a very large increase of the score (+105%),
- removing only word components provided a large increase of the score (+51%),
- removing both components provided a very large increase of the score (+108%),
- relying on a weighted average instead of a simple average lead to better results,
- Tf-Idf reweighting lead to better results than Smooth Inverse Frequency reweighting,
- GloVe word embeddings lead to better results than Word2Vec.
A table with scores for each major experiment is available. For each game series, the score is the number of games from this serie which are found among the top 10 most similar games (excluding the query). The higher the score, the better the retrieval.
Results can be accessed from the following links:
- Corpus as Bag-of-Words (BoW)
- Corpus as BoW with unit vectors
- Corpus as BoW with soft cosine similarity
- Corpus as Bow with 100 topics
- Corpus as Bow with 200 topics
- Corpus as BoW with 200 topics and with unit vectors
- Corpus as Tf-Idf with 100 topics
- Cosine: remove 1 sentence component
- Cosine: remove 2 sentence components
- Cosine: remove 10 sentence components
- Substract mean from word vectors
- Remove 1 word component
- Remove 2 word components
- Remove 10 word components