- How to find the best recommendations for the user based on their likes and dislikes
- Dataset : Grocery and Gourmet Food + the metadata
- The amazon dataset comes with a lot of interesting features, we have used (ASIN, ReviewerID, Rating, Category, Title, Description)
- Average number of reviews per product 27.69
- Average number of reviews per reviewer ~ 9 reviews
- And 5* seems to be the most popular rating given
- The User-item matrix has a sparsity of 0.794
- Matrix Sparsity leads to data not fitting in the RAM as we have the matrix dimensions of 127496*41320.
- Step 1: Drop columns which has no impact to the rating predictions: image, reviewername, summary etc.
- Step 2: Keep those rows for which the reviewer is verified.
- Step 3: Remove the reviewers who have reviewed less than 20 products.
- Step 4: Group by the reviewer such that each unique reviewer maps to the products they reviewed, and rating provided.
- Step 5: Use Train test split to split the processed data in the 80-20%.
We have explored the following approaches for Rating prediction. Comparison of performance of the algorithms is done :
- Baseline
- Singular Value Decomposition (SVD)
- k Nearest Neighbors (kNN)
- Slope One
- Matrix Factorization
K Nearest Neighbors: Feature similarity to predict new data points. User based CF:
- Tries to identify users with the most similar 'Interaction Profile'.
- Suggest items that are the most popular among these neighbors. Item based CF:
- Items like the ones the user already 'positively' interacted.
- Suggest items such that most users interact with those items. Eg milk, eggs in grocery dataset.
Slope One (Weighted)– Additional info used in Slope One –
- Ratings by users who have rated some common item
- Ratings of other items by the user
Latent Factor Model - SVD –
-
Users and items are mapped to latent factor space
-
qi – item-concept mapping for item i
-
pu – user-concept mapping for user u
-
Funk SVD was used for minimization using learning rate = 0.009, and regularization constant = 0.05
- Step 1: Create prediction dataframe from the algorithm. Consists of reviewerid, productid and predicted rating value.
- Step 2: Create a nested list such that for each reviewer we have the tuple (productid, predicted_rating).
- Step 3: Sort the list such that we get the top 10 values for each user.
- Incorporate more features – price ranges, seasonality of products,
- Effect of NLP, including implicit feedback.
- Fairness and Unbiases of Recommender Systems (Providing users with feedback on their recommendations. This can be done by showing users the factors that contributed to a particular recommendation, or by allowing users to flag recommendations that they do not think are relevant)
- Privacy Protection for Recommender Systems - (we can collect only the data that is necessary, also giving users control over their data and anonymization of data)
The approaches and the results are described in detail in the attached report.