Syma - RecSys 2022 Challenge

Objective

The objective of the project is to improve a recommendation system with application to the dressipi challenge. The goal was to obtain the best possible score on the online challenge, and to try different way to have a good score. The notebook go through our analysis of the data, then on the different process and alteration that we have done to produce our final dataset, and finally on our different machine learning technique to generate our submissions.

Dataset

For this challenge we had access to multiple datasets that gave us many information about the behaviour of the customers.

There were 5 datasets :

candidate_items.csv: contains all the items available
item_features.csv: contains all the features of each item
train_purchases.csv: contains all the purchases that occurred at the end of a session.
train_sessions.csv: contains all the items viewed in a session for each session_id
test_leaderboard_sessions: contains the input sessions for the leader-board

Data pre-processing

After this analyse, we had to pre-process the data in order to use them in our recommendation system. The items used 73 category of features. Even if 73 category is not a lot, it is still a big number and we had to reduce the dimension of our data to apply ML algorithm later.

We also analized how the sessions are represented in the train_sessions dataset, here is one of the example plot that we could produce :

We used a truncated SVD to reduce our items sparse matrix to 12 components. This matrix allowed us to find easily and faster the embedding items of each items. We had just to compare the value of their components in the matrix.

Machine Learning

For the machine learning part, we used two different models to generate our submission. The first one is a logistic regression and the second one is a simple RNN (Recursive Neural Network).

When evaluating the models on a test dataset, the logistic regression gave us a decent accuracy score of 79,99% , whereas the RNN gave us a quite better score of 80,91%.

To improve our results, we thought that removing some features could be a solution. So we decided to use only the 15 most used features category and retry our experimentation. We got as result an accuracy of 80,00% with the logistic regression and 80,92% with the RNN. We concluded that it does not really improve the performance of our predictions, because it could be based on the sample used as data, that is generated randomly.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
README.md		README.md
Rapport___SYMA.pdf		Rapport___SYMA.pdf
SYMA_RecSys2022.ipynb		SYMA_RecSys2022.ipynb
range_plot.png		range_plot.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Syma - RecSys 2022 Challenge

Objective

Dataset

Data pre-processing

Machine Learning

About

Releases

Packages

Contributors 3

Languages

SCIA-Premium/RecSys_2022

Folders and files

Latest commit

History

Repository files navigation

Syma - RecSys 2022 Challenge

Objective

Dataset

Data pre-processing

Machine Learning

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages