Skip to content

Commit

Permalink
status:pass1
Browse files Browse the repository at this point in the history
  • Loading branch information
AjNavneet committed Oct 25, 2023
0 parents commit 022f273
Show file tree
Hide file tree
Showing 18 changed files with 3,764 additions and 0 deletions.
Binary file added input/Rec_sys_data.xlsx
Binary file not shown.
2 changes: 2 additions & 0 deletions input/config.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[DATA]
data_dir =..\\input\\Rec_sys_data.xlsx
3,349 changes: 3,349 additions & 0 deletions lib/Rec_Sys_Hybrid.ipynb

Large diffs are not rendered by default.

Binary file added lib/Resources/Rec sys - hybrid - ref.pdf
Binary file not shown.
Binary file added output/final_model.pkl
Binary file not shown.
111 changes: 111 additions & 0 deletions readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Hybrid Recommender System using LightFM


### Business Objective

There are two main methods for making these suggestions: content-based and collaborative filtering. Collaborative filtering finds similarities between users to make recommendations, while content-based filtering personalizes content for each user based on their previous actions and feedback.

However, these methods struggle when there's not enough data. To address this, we'll explore a Hybrid Recommendation System, which combines both approaches.

---

### Data Description

The dataset used in this project contains transactional data for a UK-based online retail company that sells unique gifts for various occasions.

---

### Aim

Our goal is to build a Hybrid Recommendation system using different loss functions with the LightFM library.

---

### Tech Stack

- Language: `Python`
- Libraries: `pandas`, `numpy`, `scipy`, `lightfm`

---

## Approach

1. **Import required libraries**
2. **Read and merge the data**
3. **Prepare the data**
4. **Split the data into training and testing sets**
5. **Build models**
- Model with WARP loss function
- Model with logistic loss function
- Model with BPR loss function
6. **Combine data for the final model**
7. **Generate recommendations**

---

## Modular Code

1. **input**: Contains the data we'll use for analysis, such as `data.xlsx`.
2. **src**: This folder holds all the code for our project, organized in a modular manner. It includes:
- **ML_pipeline**
- **engine.py**

The `ML_pipeline` folder contains functions organized in different Python files, which are called from the `engine.py` file. There's also a `config.ini` file in the input folder, storing variables used in `engine.py`.

3. **output**: Contains our final models saved in pickle format.
4. **lib**: This is a reference folder that includes the original IPython notebook and the PowerPoint presentation used during the explanation.
5. **requirements.txt**: Lists all the required libraries with their respective versions. Install these libraries using the command `pip install -r requirements.txt`.
6. Instructions for running the code are in the `readme.md` file.

---

## Key Concepts Explored

1. Representations
2. Hybrid Recommender System
4. Evaluation metrics used for recommender system
5. Framework of LightFM
6. Bayesian Personalized Ranking (BPR) loss
7. Weighted Approximate Pairwise (WARP) loss
8. Prepare data suitable for LightFM?
9. Hybrid recommendation model with different loss functions
10. Recommendation system using the LightFM library?
11. Recommendations based on the final model


---

## Getting Started

### Install all the requirements

- pip install -r requirements.txt

#### Run the engine.py file to execute the code

---


### Note

In case you face issues while installing the 'lightfm' package; Try the following two methods:

1. In your VS code, perform the following executions on your terminal window

- Upgrade your pip with: python -m pip install --upgrade pip

- Upgrade your wheel with: pip install --upgrade wheel

- Upgrade your setuptools with: pip install --upgrade setuptools

- close the terminal

- Try installing the pacakage again.


2. In case you face; error: Microsoft Visual C++
Download and install
- https://visualstudio.microsoft.com/visual-cpp-build-tools/

---

5 changes: 5 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
numpy==1.19.5
pandas==1.3.5
scikit_learn==0.24.1
scipy==1.6.2
lightfm==1.16
80 changes: 80 additions & 0 deletions src/engine.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Importing basic libraries
import pickle
import pandas as pd
from sklearn.model_selection import train_test_split
from ml_pipeline.utils import read_data, merge_dataset, interactions
from ml_pipeline.preprocessing import unique_users, unique_items, features_to_add, mapping
from ml_pipeline.model import hybrid_model, evaluate_model
from ml_pipeline.train_test_merge import train_test_merge
from ml_pipeline.recommendations import get_recommendations
import configparser

# Read configuration from config.ini file
config = configparser.RawConfigParser()
config.read('..\\input\\config.ini')
DATA_DIR = config.get('DATA', 'data_dir')

# Reading data
order = read_data(DATA_DIR, 'order')
customer = read_data(DATA_DIR, 'customer')
product = read_data(DATA_DIR, 'product')

# Merge the datasets
full_table = merge_dataset(order, customer, 'CustomerID', 'CustomerID', 'left')
full_table = merge_dataset(full_table, product, 'StockCode', 'StockCode', 'left')

### Transforming data into the required format ###

# Create user, item, feature lists
users = unique_users(order, "CustomerID")
items = unique_items(product, "Product Name")
features = features_to_add(customer, 'Customer Segment', "Age", "Gender")

# Generate mappings for LightFM library
user_to_index_mapping, index_to_user_mapping, \
item_to_index_mapping, index_to_item_mapping, \
feature_to_index_mapping, index_to_feature_mapping = mapping(users, items, features)

user_to_product_rating_train = full_table[['CustomerID', 'Product Name', 'Quantity']]
product_to_feature = full_table[['Product Name', 'Customer Segment', 'Quantity']]
user_to_product_rating_train = user_to_product_rating_train.groupby(['CustomerID', 'Product Name']).agg({'Quantity': 'sum'}).reset_index()

# Train-test split
user_to_product_rating_train, user_to_product_rating_test = train_test_split(user_to_product_rating_train, test_size=0.33, random_state=42)
product_to_feature = product_to_feature.groupby(['Product Name', 'Customer Segment']).agg({'Quantity': 'sum'}).reset_index()

# Generate user_item_interaction_matrix for train data
user_to_product_interaction_train = interactions(user_to_product_rating_train, "CustomerID",
"Product Name", "Quantity", user_to_index_mapping, item_to_index_mapping)

# Generate item_to_feature interaction
product_to_feature_interaction = interactions(product_to_feature, "Product Name", "Customer Segment", "Quantity",
item_to_index_mapping, feature_to_index_mapping)

# Generate user_item_interaction_matrix for test data
user_to_product_interaction_test = interactions(user_to_product_rating_test, "CustomerID",
"Product Name", "Quantity", user_to_index_mapping, item_to_index_mapping)

## To run individual models (with train-test data)
## Select one of the three models and evaluate the results
'''
# Model building
model_with_features = hybrid_model("logistic", user_to_product_interaction_train, product_to_feature_interaction)
## Evaluate the model
evaluate = evaluate_model(model_with_features, user_to_product_interaction_test, user_to_product_interaction_train, product_to_feature_interaction)
'''

# Merge the train and test data for final model building
user_to_product_interaction = train_test_merge(user_to_product_interaction_train,
user_to_product_interaction_test)

## Build the final model ##
final_model = hybrid_model("logistic", user_to_product_interaction, product_to_feature_interaction)

## Save the model ##
pickle.dump(final_model, open('../output/final_model.pkl', 'wb'))

## Get the recommendations ##
recommendation_1 = get_recommendations(final_model, 17017, items, user_to_product_interaction, user_to_index_mapping, product_to_feature_interaction)
print(recommendation_1)
Binary file added src/ml_pipeline/__pycache__/model.cpython-38.pyc
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added src/ml_pipeline/__pycache__/utils.cpython-38.pyc
Binary file not shown.
76 changes: 76 additions & 0 deletions src/ml_pipeline/model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Import required libraries
import pandas as pd
import numpy as np
from lightfm import LightFM # LightFM for building hybrid recommendation models
from lightfm.evaluation import auc_score
import time

# Function to build a hybrid model with different loss functions
def hybrid_model(loss, interaction_train, product_interaction):
if loss == 'warp': # Loss function = WARP (Weighted Approximate-Rank Pairwise)
model_with_features = LightFM(loss="warp") # Initialize the LightFM model with WARP loss
start = time.time() # Record the start time

model_with_features.fit_partial(interaction_train,
user_features=None,
item_features=product_interaction,
sample_weight=None,
epochs=1,
num_threads=4,
verbose=False) # Fit the model with partial data

end = time.time() # Record the end time
print("time taken for fitting = {0:.{1}f} seconds".format(end - start, 2))
return model_with_features

elif loss == 'logistic': # Loss function = logistic
model_with_features = LightFM(loss="logistic", no_components=30) # Initialize the LightFM model with logistic loss and 30 components
start = time.time()

model_with_features.fit_partial(interaction_train,
user_features=None,
item_features=product_interaction,
sample_weight=None,
epochs=10,
num_threads=20,
verbose=False) # Fit the model with partial data

end = time.time()
print("time taken for fitting = {0:.{1}f} seconds".format(end - start, 2))
return model_with_features

elif loss == 'bpr': # Loss function = BPR (Bayesian Personalized Ranking)
model_with_features = LightFM(loss="bpr") # Initialize the LightFM model with BPR loss
start = time.time()

model_with_features.fit_partial(interaction_train,
user_features=None,
item_features=product_interaction,
sample_weight=None,
epochs=1,
num_threads=4,
verbose=False) # Fit the model with partial data

end = time.time()
print("time taken = {0:.{1}f} seconds".format(end - start, 2))
return model_with_features

else:
print("Invalid loss function specified")

# Function to evaluate the model with AUC score
def evaluate_model(model, interaction_test, interaction_train, product_interaction):
start = time.time() # Record the start time

auc_with_features = auc_score(model=model,
test_interactions=interaction_test,
train_interactions=interaction_train,
item_features=product_interaction,
num_threads=4,
check_intersections=False) # Calculate AUC score

end = time.time() # Record the end time

print("time taken for AUC score = {0:.{1}f} seconds".format(end - start, 2))

return "average AUC without adding item-feature interaction = {0:.{1}f}".format(auc_with_features.mean(), 2)
43 changes: 43 additions & 0 deletions src/ml_pipeline/preprocessing.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
import pandas as pd
import numpy as np

# Function to create a list of unique users from a specified column in the data
def unique_users(data, column):
return np.sort(data[column].unique()) # Return a sorted array of unique user IDs

# Function to create a list of unique products/items from a specified column in the data
def unique_items(data, column):
item_list = data[column].unique() # Get the unique items from the specified column
return item_list

# Function to create a list of features by concatenating specified columns from the customer data
def features_to_add(customer, column1, column2, column3):
customer1 = customer[column1]
customer2 = customer[column2]
customer3 = customer[column3]
combined_features = pd.concat([customer1, customer3, customer2], ignore_index=True).unique()
return combined_features # Return a unique list of concatenated features

# Function to create ID mappings for users, items, and features
def mapping(users, items, features):
user_to_index_mapping = {} # Initialize an empty dictionary to map user IDs to indices
index_to_user_mapping = {} # Initialize an empty dictionary to map indices to user IDs
for user_index, user_id in enumerate(users):
user_to_index_mapping[user_id] = user_index
index_to_user_mapping[user_index] = user_id

item_to_index_mapping = {} # Initialize an empty dictionary to map item IDs to indices
index_to_item_mapping = {} # Initialize an empty dictionary to map indices to item IDs
for item_index, item_id in enumerate(items):
item_to_index_mapping[item_id] = item_index
index_to_item_mapping[item_index] = item_id

feature_to_index_mapping = {} # Initialize an empty dictionary to map feature IDs to indices
index_to_feature_mapping = {} # Initialize an empty dictionary to map indices to feature IDs
for feature_index, feature_id in enumerate(features):
feature_to_index_mapping[feature_id] = feature_index
index_to_feature_mapping[feature_index] = feature_id

return user_to_index_mapping, index_to_user_mapping, \
item_to_index_mapping, index_to_item_mapping, \
feature_to_index_mapping, index_to_feature_mapping
33 changes: 33 additions & 0 deletions src/ml_pipeline/recommendations.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
import numpy as np

# Function to get recommendations for a user using a recommendation model
def get_recommendations(model, user, items, user_to_product_interaction_matrix, user2index_map, product_to_feature_interaction_matrix):
# Get the user's index
user_index = user2index_map.get(user, None)

# If the user doesn't exist in the mapping, return None
if user_index is None:
return None

# Retrieve the user's index
users = user_index

# Get products already bought by the user
known_positives = items[user_to_product_interaction_matrix.tocsr()[user_index].indices]
print('User index =', users)

# Predict scores using the model
scores = model.predict(user_ids=users, item_ids=np.arange(user_to_product_interaction_matrix.shape[1]), item_features=product_to_feature_interaction_matrix)

# Get top recommended items based on scores
top_items = items[np.argsort(-scores)]

# Print out the results
print("User %s" % user)
print(" Known positives:") # Already bought items
for x in known_positives[:10]:
print(" %s" % x)

print(" Recommended:") # Items recommended to the user
for x in top_items[:10]:
print(" %s" % x)
34 changes: 34 additions & 0 deletions src/ml_pipeline/train_test_merge.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
import numpy as np
from scipy.sparse import coo_matrix

# Function to merge training and testing data into a single sparse matrix
def train_test_merge(training_data, testing_data):
# Initialize a dictionary to store training data
train_dict = {}

# Populate the dictionary with training data (row, col) as keys and data as values
for row, col, data in zip(training_data.row, training_data.col, training_data.data):
train_dict[(row, col)] = data

# Replace training data with testing data if it's greater (max of data values)
for row, col, data in zip(testing_data.row, testing_data.col, testing_data.data):
train_dict[(row, col)] = max(data, train_dict.get((row, col), 0))

# Initialize lists to store row indices, column indices, and data values
row_list = []
col_list = []
data_list = []

# Populate the lists with data from the dictionary
for row, col in train_dict:
row_list.append(row)
col_list.append(col)
data_list.append(train_dict[(row, col)])

# Convert lists to numpy arrays
row_list = np.array(row_list)
col_list = np.array(col_list)
data_list = np.array(data_list)

# Create a coo_matrix (sparse matrix) with the merged data
return coo_matrix((data_list, (row_list, col_list)), shape=(training_data.shape[0], training_data.shape[1]))
Loading

0 comments on commit 022f273

Please sign in to comment.