Skip to content
Wok edited this page Apr 4, 2019 · 21 revisions

An in-depth commentary is provided on the Commentary page. Overall, I would suggest to match store descriptions with:

  • either Term Frequency * Inverse Document Frequency (Tf-Idf),
  • or a weighted average of GloVe word embeddings, with Tf-Idf reweighting, after removing some components:

When using average of word embeddings as sentence embeddings:

  • removing only sentence components provided a very large increase of the score (+105%),
  • removing only word components provided a large increase of the score (+51%),
  • removing both components provided a very large increase of the score (+108%),
  • relying on a weighted average instead of a simple average lead to better results,
  • Tf-Idf reweighting lead to better results than Smooth Inverse Frequency reweighting,
  • GloVe word embeddings lead to better results than Word2Vec.

A table with scores for each major experiment is available. For each game series, the score is the number of games from this serie which are found among the top 10 most similar games (excluding the query). The higher the score, the better the retrieval.

Results can be accessed from the following links:

Google's Universal Sentence Encoder

Baselines

Term Frequency * Inverse Document Frequency (Tf-Idf)

Latent Semantic Indexing (LSI/LSA)

Random Projections (RP)

Latent Dirichlet Allocation (LDA)

Hierarchical Dirichlet Process (HDP)

Doc2Vec

AppIDs

AppIDs and categories

AppIDs and genres

AppIDs, categories and genres

Weighted average of word embeddings

GloVe

Main results

Removing sentence components

Removing word components

Removing both sentence and word components

Tweaks

Word2Vec

Main results

Cosine: removing sentence components

Minkowski: removing sentence components

Clone this wiki locally