-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 022f273
Showing
18 changed files
with
3,764 additions
and
0 deletions.
There are no files selected for viewing
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
[DATA] | ||
data_dir =..\\input\\Rec_sys_data.xlsx |
Large diffs are not rendered by default.
Oops, something went wrong.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,111 @@ | ||
# Hybrid Recommender System using LightFM | ||
|
||
|
||
### Business Objective | ||
|
||
There are two main methods for making these suggestions: content-based and collaborative filtering. Collaborative filtering finds similarities between users to make recommendations, while content-based filtering personalizes content for each user based on their previous actions and feedback. | ||
|
||
However, these methods struggle when there's not enough data. To address this, we'll explore a Hybrid Recommendation System, which combines both approaches. | ||
|
||
--- | ||
|
||
### Data Description | ||
|
||
The dataset used in this project contains transactional data for a UK-based online retail company that sells unique gifts for various occasions. | ||
|
||
--- | ||
|
||
### Aim | ||
|
||
Our goal is to build a Hybrid Recommendation system using different loss functions with the LightFM library. | ||
|
||
--- | ||
|
||
### Tech Stack | ||
|
||
- Language: `Python` | ||
- Libraries: `pandas`, `numpy`, `scipy`, `lightfm` | ||
|
||
--- | ||
|
||
## Approach | ||
|
||
1. **Import required libraries** | ||
2. **Read and merge the data** | ||
3. **Prepare the data** | ||
4. **Split the data into training and testing sets** | ||
5. **Build models** | ||
- Model with WARP loss function | ||
- Model with logistic loss function | ||
- Model with BPR loss function | ||
6. **Combine data for the final model** | ||
7. **Generate recommendations** | ||
|
||
--- | ||
|
||
## Modular Code | ||
|
||
1. **input**: Contains the data we'll use for analysis, such as `data.xlsx`. | ||
2. **src**: This folder holds all the code for our project, organized in a modular manner. It includes: | ||
- **ML_pipeline** | ||
- **engine.py** | ||
|
||
The `ML_pipeline` folder contains functions organized in different Python files, which are called from the `engine.py` file. There's also a `config.ini` file in the input folder, storing variables used in `engine.py`. | ||
|
||
3. **output**: Contains our final models saved in pickle format. | ||
4. **lib**: This is a reference folder that includes the original IPython notebook and the PowerPoint presentation used during the explanation. | ||
5. **requirements.txt**: Lists all the required libraries with their respective versions. Install these libraries using the command `pip install -r requirements.txt`. | ||
6. Instructions for running the code are in the `readme.md` file. | ||
|
||
--- | ||
|
||
## Key Concepts Explored | ||
|
||
1. Representations | ||
2. Hybrid Recommender System | ||
4. Evaluation metrics used for recommender system | ||
5. Framework of LightFM | ||
6. Bayesian Personalized Ranking (BPR) loss | ||
7. Weighted Approximate Pairwise (WARP) loss | ||
8. Prepare data suitable for LightFM? | ||
9. Hybrid recommendation model with different loss functions | ||
10. Recommendation system using the LightFM library? | ||
11. Recommendations based on the final model | ||
|
||
|
||
--- | ||
|
||
## Getting Started | ||
|
||
### Install all the requirements | ||
|
||
- pip install -r requirements.txt | ||
|
||
#### Run the engine.py file to execute the code | ||
|
||
--- | ||
|
||
|
||
### Note | ||
|
||
In case you face issues while installing the 'lightfm' package; Try the following two methods: | ||
|
||
1. In your VS code, perform the following executions on your terminal window | ||
|
||
- Upgrade your pip with: python -m pip install --upgrade pip | ||
|
||
- Upgrade your wheel with: pip install --upgrade wheel | ||
|
||
- Upgrade your setuptools with: pip install --upgrade setuptools | ||
|
||
- close the terminal | ||
|
||
- Try installing the pacakage again. | ||
|
||
|
||
2. In case you face; error: Microsoft Visual C++ | ||
Download and install | ||
- https://visualstudio.microsoft.com/visual-cpp-build-tools/ | ||
|
||
--- | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
numpy==1.19.5 | ||
pandas==1.3.5 | ||
scikit_learn==0.24.1 | ||
scipy==1.6.2 | ||
lightfm==1.16 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
# Importing basic libraries | ||
import pickle | ||
import pandas as pd | ||
from sklearn.model_selection import train_test_split | ||
from ml_pipeline.utils import read_data, merge_dataset, interactions | ||
from ml_pipeline.preprocessing import unique_users, unique_items, features_to_add, mapping | ||
from ml_pipeline.model import hybrid_model, evaluate_model | ||
from ml_pipeline.train_test_merge import train_test_merge | ||
from ml_pipeline.recommendations import get_recommendations | ||
import configparser | ||
|
||
# Read configuration from config.ini file | ||
config = configparser.RawConfigParser() | ||
config.read('..\\input\\config.ini') | ||
DATA_DIR = config.get('DATA', 'data_dir') | ||
|
||
# Reading data | ||
order = read_data(DATA_DIR, 'order') | ||
customer = read_data(DATA_DIR, 'customer') | ||
product = read_data(DATA_DIR, 'product') | ||
|
||
# Merge the datasets | ||
full_table = merge_dataset(order, customer, 'CustomerID', 'CustomerID', 'left') | ||
full_table = merge_dataset(full_table, product, 'StockCode', 'StockCode', 'left') | ||
|
||
### Transforming data into the required format ### | ||
|
||
# Create user, item, feature lists | ||
users = unique_users(order, "CustomerID") | ||
items = unique_items(product, "Product Name") | ||
features = features_to_add(customer, 'Customer Segment', "Age", "Gender") | ||
|
||
# Generate mappings for LightFM library | ||
user_to_index_mapping, index_to_user_mapping, \ | ||
item_to_index_mapping, index_to_item_mapping, \ | ||
feature_to_index_mapping, index_to_feature_mapping = mapping(users, items, features) | ||
|
||
user_to_product_rating_train = full_table[['CustomerID', 'Product Name', 'Quantity']] | ||
product_to_feature = full_table[['Product Name', 'Customer Segment', 'Quantity']] | ||
user_to_product_rating_train = user_to_product_rating_train.groupby(['CustomerID', 'Product Name']).agg({'Quantity': 'sum'}).reset_index() | ||
|
||
# Train-test split | ||
user_to_product_rating_train, user_to_product_rating_test = train_test_split(user_to_product_rating_train, test_size=0.33, random_state=42) | ||
product_to_feature = product_to_feature.groupby(['Product Name', 'Customer Segment']).agg({'Quantity': 'sum'}).reset_index() | ||
|
||
# Generate user_item_interaction_matrix for train data | ||
user_to_product_interaction_train = interactions(user_to_product_rating_train, "CustomerID", | ||
"Product Name", "Quantity", user_to_index_mapping, item_to_index_mapping) | ||
|
||
# Generate item_to_feature interaction | ||
product_to_feature_interaction = interactions(product_to_feature, "Product Name", "Customer Segment", "Quantity", | ||
item_to_index_mapping, feature_to_index_mapping) | ||
|
||
# Generate user_item_interaction_matrix for test data | ||
user_to_product_interaction_test = interactions(user_to_product_rating_test, "CustomerID", | ||
"Product Name", "Quantity", user_to_index_mapping, item_to_index_mapping) | ||
|
||
## To run individual models (with train-test data) | ||
## Select one of the three models and evaluate the results | ||
''' | ||
# Model building | ||
model_with_features = hybrid_model("logistic", user_to_product_interaction_train, product_to_feature_interaction) | ||
## Evaluate the model | ||
evaluate = evaluate_model(model_with_features, user_to_product_interaction_test, user_to_product_interaction_train, product_to_feature_interaction) | ||
''' | ||
|
||
# Merge the train and test data for final model building | ||
user_to_product_interaction = train_test_merge(user_to_product_interaction_train, | ||
user_to_product_interaction_test) | ||
|
||
## Build the final model ## | ||
final_model = hybrid_model("logistic", user_to_product_interaction, product_to_feature_interaction) | ||
|
||
## Save the model ## | ||
pickle.dump(final_model, open('../output/final_model.pkl', 'wb')) | ||
|
||
## Get the recommendations ## | ||
recommendation_1 = get_recommendations(final_model, 17017, items, user_to_product_interaction, user_to_index_mapping, product_to_feature_interaction) | ||
print(recommendation_1) |
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
# Import required libraries | ||
import pandas as pd | ||
import numpy as np | ||
from lightfm import LightFM # LightFM for building hybrid recommendation models | ||
from lightfm.evaluation import auc_score | ||
import time | ||
|
||
# Function to build a hybrid model with different loss functions | ||
def hybrid_model(loss, interaction_train, product_interaction): | ||
if loss == 'warp': # Loss function = WARP (Weighted Approximate-Rank Pairwise) | ||
model_with_features = LightFM(loss="warp") # Initialize the LightFM model with WARP loss | ||
start = time.time() # Record the start time | ||
|
||
model_with_features.fit_partial(interaction_train, | ||
user_features=None, | ||
item_features=product_interaction, | ||
sample_weight=None, | ||
epochs=1, | ||
num_threads=4, | ||
verbose=False) # Fit the model with partial data | ||
|
||
end = time.time() # Record the end time | ||
print("time taken for fitting = {0:.{1}f} seconds".format(end - start, 2)) | ||
return model_with_features | ||
|
||
elif loss == 'logistic': # Loss function = logistic | ||
model_with_features = LightFM(loss="logistic", no_components=30) # Initialize the LightFM model with logistic loss and 30 components | ||
start = time.time() | ||
|
||
model_with_features.fit_partial(interaction_train, | ||
user_features=None, | ||
item_features=product_interaction, | ||
sample_weight=None, | ||
epochs=10, | ||
num_threads=20, | ||
verbose=False) # Fit the model with partial data | ||
|
||
end = time.time() | ||
print("time taken for fitting = {0:.{1}f} seconds".format(end - start, 2)) | ||
return model_with_features | ||
|
||
elif loss == 'bpr': # Loss function = BPR (Bayesian Personalized Ranking) | ||
model_with_features = LightFM(loss="bpr") # Initialize the LightFM model with BPR loss | ||
start = time.time() | ||
|
||
model_with_features.fit_partial(interaction_train, | ||
user_features=None, | ||
item_features=product_interaction, | ||
sample_weight=None, | ||
epochs=1, | ||
num_threads=4, | ||
verbose=False) # Fit the model with partial data | ||
|
||
end = time.time() | ||
print("time taken = {0:.{1}f} seconds".format(end - start, 2)) | ||
return model_with_features | ||
|
||
else: | ||
print("Invalid loss function specified") | ||
|
||
# Function to evaluate the model with AUC score | ||
def evaluate_model(model, interaction_test, interaction_train, product_interaction): | ||
start = time.time() # Record the start time | ||
|
||
auc_with_features = auc_score(model=model, | ||
test_interactions=interaction_test, | ||
train_interactions=interaction_train, | ||
item_features=product_interaction, | ||
num_threads=4, | ||
check_intersections=False) # Calculate AUC score | ||
|
||
end = time.time() # Record the end time | ||
|
||
print("time taken for AUC score = {0:.{1}f} seconds".format(end - start, 2)) | ||
|
||
return "average AUC without adding item-feature interaction = {0:.{1}f}".format(auc_with_features.mean(), 2) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
import pandas as pd | ||
import numpy as np | ||
|
||
# Function to create a list of unique users from a specified column in the data | ||
def unique_users(data, column): | ||
return np.sort(data[column].unique()) # Return a sorted array of unique user IDs | ||
|
||
# Function to create a list of unique products/items from a specified column in the data | ||
def unique_items(data, column): | ||
item_list = data[column].unique() # Get the unique items from the specified column | ||
return item_list | ||
|
||
# Function to create a list of features by concatenating specified columns from the customer data | ||
def features_to_add(customer, column1, column2, column3): | ||
customer1 = customer[column1] | ||
customer2 = customer[column2] | ||
customer3 = customer[column3] | ||
combined_features = pd.concat([customer1, customer3, customer2], ignore_index=True).unique() | ||
return combined_features # Return a unique list of concatenated features | ||
|
||
# Function to create ID mappings for users, items, and features | ||
def mapping(users, items, features): | ||
user_to_index_mapping = {} # Initialize an empty dictionary to map user IDs to indices | ||
index_to_user_mapping = {} # Initialize an empty dictionary to map indices to user IDs | ||
for user_index, user_id in enumerate(users): | ||
user_to_index_mapping[user_id] = user_index | ||
index_to_user_mapping[user_index] = user_id | ||
|
||
item_to_index_mapping = {} # Initialize an empty dictionary to map item IDs to indices | ||
index_to_item_mapping = {} # Initialize an empty dictionary to map indices to item IDs | ||
for item_index, item_id in enumerate(items): | ||
item_to_index_mapping[item_id] = item_index | ||
index_to_item_mapping[item_index] = item_id | ||
|
||
feature_to_index_mapping = {} # Initialize an empty dictionary to map feature IDs to indices | ||
index_to_feature_mapping = {} # Initialize an empty dictionary to map indices to feature IDs | ||
for feature_index, feature_id in enumerate(features): | ||
feature_to_index_mapping[feature_id] = feature_index | ||
index_to_feature_mapping[feature_index] = feature_id | ||
|
||
return user_to_index_mapping, index_to_user_mapping, \ | ||
item_to_index_mapping, index_to_item_mapping, \ | ||
feature_to_index_mapping, index_to_feature_mapping |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
import numpy as np | ||
|
||
# Function to get recommendations for a user using a recommendation model | ||
def get_recommendations(model, user, items, user_to_product_interaction_matrix, user2index_map, product_to_feature_interaction_matrix): | ||
# Get the user's index | ||
user_index = user2index_map.get(user, None) | ||
|
||
# If the user doesn't exist in the mapping, return None | ||
if user_index is None: | ||
return None | ||
|
||
# Retrieve the user's index | ||
users = user_index | ||
|
||
# Get products already bought by the user | ||
known_positives = items[user_to_product_interaction_matrix.tocsr()[user_index].indices] | ||
print('User index =', users) | ||
|
||
# Predict scores using the model | ||
scores = model.predict(user_ids=users, item_ids=np.arange(user_to_product_interaction_matrix.shape[1]), item_features=product_to_feature_interaction_matrix) | ||
|
||
# Get top recommended items based on scores | ||
top_items = items[np.argsort(-scores)] | ||
|
||
# Print out the results | ||
print("User %s" % user) | ||
print(" Known positives:") # Already bought items | ||
for x in known_positives[:10]: | ||
print(" %s" % x) | ||
|
||
print(" Recommended:") # Items recommended to the user | ||
for x in top_items[:10]: | ||
print(" %s" % x) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
import numpy as np | ||
from scipy.sparse import coo_matrix | ||
|
||
# Function to merge training and testing data into a single sparse matrix | ||
def train_test_merge(training_data, testing_data): | ||
# Initialize a dictionary to store training data | ||
train_dict = {} | ||
|
||
# Populate the dictionary with training data (row, col) as keys and data as values | ||
for row, col, data in zip(training_data.row, training_data.col, training_data.data): | ||
train_dict[(row, col)] = data | ||
|
||
# Replace training data with testing data if it's greater (max of data values) | ||
for row, col, data in zip(testing_data.row, testing_data.col, testing_data.data): | ||
train_dict[(row, col)] = max(data, train_dict.get((row, col), 0)) | ||
|
||
# Initialize lists to store row indices, column indices, and data values | ||
row_list = [] | ||
col_list = [] | ||
data_list = [] | ||
|
||
# Populate the lists with data from the dictionary | ||
for row, col in train_dict: | ||
row_list.append(row) | ||
col_list.append(col) | ||
data_list.append(train_dict[(row, col)]) | ||
|
||
# Convert lists to numpy arrays | ||
row_list = np.array(row_list) | ||
col_list = np.array(col_list) | ||
data_list = np.array(data_list) | ||
|
||
# Create a coo_matrix (sparse matrix) with the merged data | ||
return coo_matrix((data_list, (row_list, col_list)), shape=(training_data.shape[0], training_data.shape[1])) |
Oops, something went wrong.