status:pass1

AjNavneet · Oct 25, 2023 · 022f273 · 022f273
commit 022f273
Show file tree

Hide file tree

Showing 18 changed files with 3,764 additions and 0 deletions.
diff --git a/input/Rec_sys_data.xlsx b/input/Rec_sys_data.xlsx
diff --git a/input/config.ini b/input/config.ini
@@ -0,0 +1,2 @@
+[DATA]
+data_dir =..\\input\\Rec_sys_data.xlsx
diff --git a/lib/Rec_Sys_Hybrid.ipynb b/lib/Rec_Sys_Hybrid.ipynb
diff --git a/lib/Resources/Rec sys - hybrid - ref.pdf b/lib/Resources/Rec sys - hybrid - ref.pdf
diff --git a/output/final_model.pkl b/output/final_model.pkl
diff --git a/readme.md b/readme.md
@@ -0,0 +1,111 @@
+# Hybrid Recommender System using LightFM
+
+
+### Business Objective
+
+There are two main methods for making these suggestions: content-based and collaborative filtering. Collaborative filtering finds similarities between users to make recommendations, while content-based filtering personalizes content for each user based on their previous actions and feedback. 
+
+However, these methods struggle when there's not enough data. To address this, we'll explore a Hybrid Recommendation System, which combines both approaches.
+
+---
+
+### Data Description
+
+The dataset used in this project contains transactional data for a UK-based online retail company that sells unique gifts for various occasions.
+
+---
+
+### Aim
+
+Our goal is to build a Hybrid Recommendation system using different loss functions with the LightFM library.
+
+---
+
+### Tech Stack
+
+- Language: `Python`
+- Libraries: `pandas`, `numpy`, `scipy`, `lightfm`
+
+---
+
+## Approach
+
+1. **Import required libraries**
+2. **Read and merge the data**
+3. **Prepare the data**
+4. **Split the data into training and testing sets**
+5. **Build models**
+   - Model with WARP loss function
+   - Model with logistic loss function
+   - Model with BPR loss function
+6. **Combine data for the final model**
+7. **Generate recommendations**
+
+---
+
+## Modular Code
+
+1. **input**: Contains the data we'll use for analysis, such as `data.xlsx`.
+2. **src**: This folder holds all the code for our project, organized in a modular manner. It includes:
+   - **ML_pipeline**
+   - **engine.py**
+
+   The `ML_pipeline` folder contains functions organized in different Python files, which are called from the `engine.py` file. There's also a `config.ini` file in the input folder, storing variables used in `engine.py`.
+
+3. **output**: Contains our final models saved in pickle format.
+4. **lib**: This is a reference folder that includes the original IPython notebook and the PowerPoint presentation used during the explanation.
+5. **requirements.txt**: Lists all the required libraries with their respective versions. Install these libraries using the command `pip install -r requirements.txt`.
+6. Instructions for running the code are in the `readme.md` file.
+
+---
+
+## Key Concepts Explored
+
+1. Representations
+2. Hybrid Recommender System
+4. Evaluation metrics used for recommender system
+5. Framework of LightFM
+6. Bayesian Personalized Ranking (BPR) loss
+7. Weighted Approximate Pairwise (WARP) loss
+8. Prepare data suitable for LightFM?
+9. Hybrid recommendation model with different loss functions
+10. Recommendation system using the LightFM library?
+11. Recommendations based on the final model
+
+
+---
+
+## Getting Started
+
+### Install all the requirements
+
+- pip install -r requirements.txt
+
+#### Run the engine.py file to execute the code
+
+---
+
+
+### Note
+
+In case you face issues while installing the  'lightfm' package; Try the following two methods:
+
+1. In your VS code, perform the following executions on your terminal window
+
+	- Upgrade your pip with: python -m pip install --upgrade pip
+
+	- Upgrade your wheel with: pip install --upgrade wheel
+
+	- Upgrade your setuptools with: pip install --upgrade setuptools
+
+	- close the terminal
+
+	- Try installing the pacakage again.
+
+
+2. In case you face; error: Microsoft Visual C++
+Download and install  
+	- https://visualstudio.microsoft.com/visual-cpp-build-tools/
+
+---
+
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,5 @@
+numpy==1.19.5
+pandas==1.3.5
+scikit_learn==0.24.1
+scipy==1.6.2
+lightfm==1.16
diff --git a/src/engine.py b/src/engine.py
@@ -0,0 +1,80 @@
+# Importing basic libraries
+import pickle
+import pandas as pd
+from sklearn.model_selection import train_test_split
+from ml_pipeline.utils import read_data, merge_dataset, interactions
+from ml_pipeline.preprocessing import unique_users, unique_items, features_to_add, mapping
+from ml_pipeline.model import hybrid_model, evaluate_model
+from ml_pipeline.train_test_merge import train_test_merge
+from ml_pipeline.recommendations import get_recommendations
+import configparser
+
+# Read configuration from config.ini file
+config = configparser.RawConfigParser()
+config.read('..\\input\\config.ini')
+DATA_DIR = config.get('DATA', 'data_dir')
+
+# Reading data
+order = read_data(DATA_DIR, 'order')
+customer = read_data(DATA_DIR, 'customer')
+product = read_data(DATA_DIR, 'product')
+
+# Merge the datasets
+full_table = merge_dataset(order, customer, 'CustomerID', 'CustomerID', 'left')
+full_table = merge_dataset(full_table, product, 'StockCode', 'StockCode', 'left')
+
+### Transforming data into the required format ###
+
+# Create user, item, feature lists
+users = unique_users(order, "CustomerID")
+items = unique_items(product, "Product Name")
+features = features_to_add(customer, 'Customer Segment', "Age", "Gender")
+
+# Generate mappings for LightFM library
+user_to_index_mapping, index_to_user_mapping, \
+item_to_index_mapping, index_to_item_mapping, \
+feature_to_index_mapping, index_to_feature_mapping = mapping(users, items, features)
+
+user_to_product_rating_train = full_table[['CustomerID', 'Product Name', 'Quantity']]
+product_to_feature = full_table[['Product Name', 'Customer Segment', 'Quantity']]
+user_to_product_rating_train = user_to_product_rating_train.groupby(['CustomerID', 'Product Name']).agg({'Quantity': 'sum'}).reset_index()
+
+# Train-test split
+user_to_product_rating_train, user_to_product_rating_test = train_test_split(user_to_product_rating_train, test_size=0.33, random_state=42)
+product_to_feature = product_to_feature.groupby(['Product Name', 'Customer Segment']).agg({'Quantity': 'sum'}).reset_index()
+
+# Generate user_item_interaction_matrix for train data
+user_to_product_interaction_train = interactions(user_to_product_rating_train, "CustomerID",
+                                                 "Product Name", "Quantity", user_to_index_mapping, item_to_index_mapping)
+
+# Generate item_to_feature interaction
+product_to_feature_interaction = interactions(product_to_feature, "Product Name", "Customer Segment", "Quantity",
+                                             item_to_index_mapping, feature_to_index_mapping)
+
+# Generate user_item_interaction_matrix for test data
+user_to_product_interaction_test = interactions(user_to_product_rating_test, "CustomerID",
+                                               "Product Name", "Quantity", user_to_index_mapping, item_to_index_mapping)
+
+## To run individual models (with train-test data)
+## Select one of the three models and evaluate the results
+'''
+# Model building
+model_with_features = hybrid_model("logistic", user_to_product_interaction_train, product_to_feature_interaction)
+
+## Evaluate the model
+evaluate = evaluate_model(model_with_features, user_to_product_interaction_test, user_to_product_interaction_train, product_to_feature_interaction)
+'''
+
+# Merge the train and test data for final model building
+user_to_product_interaction = train_test_merge(user_to_product_interaction_train,
+                                             user_to_product_interaction_test)
+
+## Build the final model ##
+final_model = hybrid_model("logistic", user_to_product_interaction, product_to_feature_interaction)
+
+## Save the model ##
+pickle.dump(final_model, open('../output/final_model.pkl', 'wb'))
+
+## Get the recommendations ##
+recommendation_1 = get_recommendations(final_model, 17017, items, user_to_product_interaction, user_to_index_mapping, product_to_feature_interaction)
+print(recommendation_1)
diff --git a/src/ml_pipeline/__pycache__/model.cpython-38.pyc b/src/ml_pipeline/__pycache__/model.cpython-38.pyc
diff --git a/src/ml_pipeline/__pycache__/preprocessing.cpython-38.pyc b/src/ml_pipeline/__pycache__/preprocessing.cpython-38.pyc
diff --git a/src/ml_pipeline/__pycache__/reccomendations.cpython-38.pyc b/src/ml_pipeline/__pycache__/reccomendations.cpython-38.pyc
diff --git a/src/ml_pipeline/__pycache__/train_test_merge.cpython-38.pyc b/src/ml_pipeline/__pycache__/train_test_merge.cpython-38.pyc
diff --git a/src/ml_pipeline/__pycache__/utils.cpython-38.pyc b/src/ml_pipeline/__pycache__/utils.cpython-38.pyc
diff --git a/src/ml_pipeline/model.py b/src/ml_pipeline/model.py
@@ -0,0 +1,76 @@
+# Import required libraries
+import pandas as pd  
+import numpy as np   
+from lightfm import LightFM  # LightFM for building hybrid recommendation models
+from lightfm.evaluation import auc_score  
+import time  
+
+# Function to build a hybrid model with different loss functions
+def hybrid_model(loss, interaction_train, product_interaction):
+    if loss == 'warp':  # Loss function = WARP (Weighted Approximate-Rank Pairwise)
+        model_with_features = LightFM(loss="warp")  # Initialize the LightFM model with WARP loss
+        start = time.time()  # Record the start time
+
+        model_with_features.fit_partial(interaction_train, 
+            user_features=None, 
+            item_features=product_interaction, 
+            sample_weight=None, 
+            epochs=1, 
+            num_threads=4,
+            verbose=False)  # Fit the model with partial data
+
+        end = time.time()  # Record the end time
+        print("time taken for fitting = {0:.{1}f} seconds".format(end - start, 2))
+        return model_with_features
+
+    elif loss == 'logistic':  # Loss function = logistic
+        model_with_features = LightFM(loss="logistic", no_components=30)  # Initialize the LightFM model with logistic loss and 30 components
+        start = time.time()
+
+        model_with_features.fit_partial(interaction_train,
+            user_features=None, 
+            item_features=product_interaction, 
+            sample_weight=None, 
+            epochs=10, 
+            num_threads=20,
+            verbose=False)  # Fit the model with partial data
+
+        end = time.time()
+        print("time taken for fitting = {0:.{1}f} seconds".format(end - start, 2))
+        return model_with_features
+
+    elif loss == 'bpr':  # Loss function = BPR (Bayesian Personalized Ranking)
+        model_with_features = LightFM(loss="bpr")  # Initialize the LightFM model with BPR loss
+        start = time.time()
+
+        model_with_features.fit_partial(interaction_train,
+            user_features=None, 
+            item_features=product_interaction, 
+            sample_weight=None, 
+            epochs=1, 
+            num_threads=4,
+            verbose=False)  # Fit the model with partial data
+
+        end = time.time()
+        print("time taken = {0:.{1}f} seconds".format(end - start, 2))
+        return model_with_features
+
+    else:
+        print("Invalid loss function specified")
+
+# Function to evaluate the model with AUC score
+def evaluate_model(model, interaction_test, interaction_train, product_interaction):
+    start = time.time()  # Record the start time
+
+    auc_with_features = auc_score(model=model,  
+        test_interactions=interaction_test,
+        train_interactions=interaction_train, 
+        item_features=product_interaction,
+        num_threads=4,
+        check_intersections=False)  # Calculate AUC score
+
+    end = time.time()  # Record the end time
+
+    print("time taken for AUC score = {0:.{1}f} seconds".format(end - start, 2))
+
+    return "average AUC without adding item-feature interaction = {0:.{1}f}".format(auc_with_features.mean(), 2)
diff --git a/src/ml_pipeline/preprocessing.py b/src/ml_pipeline/preprocessing.py
@@ -0,0 +1,43 @@
+import pandas as pd 
+import numpy as np  
+
+# Function to create a list of unique users from a specified column in the data
+def unique_users(data, column):
+    return np.sort(data[column].unique())  # Return a sorted array of unique user IDs
+
+# Function to create a list of unique products/items from a specified column in the data
+def unique_items(data, column):
+    item_list = data[column].unique()  # Get the unique items from the specified column
+    return item_list
+
+# Function to create a list of features by concatenating specified columns from the customer data
+def features_to_add(customer, column1, column2, column3):
+    customer1 = customer[column1]
+    customer2 = customer[column2]
+    customer3 = customer[column3]
+    combined_features = pd.concat([customer1, customer3, customer2], ignore_index=True).unique()
+    return combined_features  # Return a unique list of concatenated features
+
+# Function to create ID mappings for users, items, and features
+def mapping(users, items, features):
+    user_to_index_mapping = {}  # Initialize an empty dictionary to map user IDs to indices
+    index_to_user_mapping = {}  # Initialize an empty dictionary to map indices to user IDs
+    for user_index, user_id in enumerate(users):
+        user_to_index_mapping[user_id] = user_index
+        index_to_user_mapping[user_index] = user_id
+
+    item_to_index_mapping = {}  # Initialize an empty dictionary to map item IDs to indices
+    index_to_item_mapping = {}  # Initialize an empty dictionary to map indices to item IDs
+    for item_index, item_id in enumerate(items):
+        item_to_index_mapping[item_id] = item_index
+        index_to_item_mapping[item_index] = item_id
+
+    feature_to_index_mapping = {}  # Initialize an empty dictionary to map feature IDs to indices
+    index_to_feature_mapping = {}  # Initialize an empty dictionary to map indices to feature IDs
+    for feature_index, feature_id in enumerate(features):
+        feature_to_index_mapping[feature_id] = feature_index
+        index_to_feature_mapping[feature_index] = feature_id
+
+    return user_to_index_mapping, index_to_user_mapping, \
+           item_to_index_mapping, index_to_item_mapping, \
+           feature_to_index_mapping, index_to_feature_mapping
diff --git a/src/ml_pipeline/recommendations.py b/src/ml_pipeline/recommendations.py
@@ -0,0 +1,33 @@
+import numpy as np 
+
+# Function to get recommendations for a user using a recommendation model
+def get_recommendations(model, user, items, user_to_product_interaction_matrix, user2index_map, product_to_feature_interaction_matrix):
+    # Get the user's index
+    user_index = user2index_map.get(user, None)
+
+    # If the user doesn't exist in the mapping, return None
+    if user_index is None:
+        return None
+
+    # Retrieve the user's index
+    users = user_index
+
+    # Get products already bought by the user
+    known_positives = items[user_to_product_interaction_matrix.tocsr()[user_index].indices]
+    print('User index =', users)
+
+    # Predict scores using the model
+    scores = model.predict(user_ids=users, item_ids=np.arange(user_to_product_interaction_matrix.shape[1]), item_features=product_to_feature_interaction_matrix)
+
+    # Get top recommended items based on scores
+    top_items = items[np.argsort(-scores)]
+
+    # Print out the results
+    print("User %s" % user)
+    print("     Known positives:")  # Already bought items
+    for x in known_positives[:10]:
+        print("                  %s" % x)
+
+    print("     Recommended:")  # Items recommended to the user
+    for x in top_items[:10]:
+        print("                  %s" % x)
diff --git a/src/ml_pipeline/train_test_merge.py b/src/ml_pipeline/train_test_merge.py
@@ -0,0 +1,34 @@
+import numpy as np  
+from scipy.sparse import coo_matrix 
+
+# Function to merge training and testing data into a single sparse matrix
+def train_test_merge(training_data, testing_data):
+    # Initialize a dictionary to store training data
+    train_dict = {}
+
+    # Populate the dictionary with training data (row, col) as keys and data as values
+    for row, col, data in zip(training_data.row, training_data.col, training_data.data):
+        train_dict[(row, col)] = data
+
+    # Replace training data with testing data if it's greater (max of data values)
+    for row, col, data in zip(testing_data.row, testing_data.col, testing_data.data):
+        train_dict[(row, col)] = max(data, train_dict.get((row, col), 0))
+
+    # Initialize lists to store row indices, column indices, and data values
+    row_list = []
+    col_list = []
+    data_list = []
+
+    # Populate the lists with data from the dictionary
+    for row, col in train_dict:
+        row_list.append(row)
+        col_list.append(col)
+        data_list.append(train_dict[(row, col)])
+
+    # Convert lists to numpy arrays
+    row_list = np.array(row_list)
+    col_list = np.array(col_list)
+    data_list = np.array(data_list)
+
+    # Create a coo_matrix (sparse matrix) with the merged data
+    return coo_matrix((data_list, (row_list, col_list)), shape=(training_data.shape[0], training_data.shape[1]))