This repository provides a solution to the Customer Segmentation problem using the Birch Clustering Algorithm (original reference, simplified reference). The project includes data preprocessing, clustering analysis, and evaluation of the segmentation results on a customer dataset. The Birch algorithm is particularly useful for large datasets with a natural hierarchical structure, making it suitable for customer segmentation tasks.
- Introduction
- Dataset
- Requirements
- Project Structure
- Setup
- Usage
- Clustering Results
- Evaluation Metrics
- Visualization
- Notes and Customization
- Contributing
- License
Customer segmentation is a core part of targeted marketing, allowing businesses to group customers based on shared characteristics. This project applies the Birch (Balanced Iterative Reducing and Clustering using Hierarchies) algorithm to a dataset of mall customers. The Birch algorithm clusters customers based on their annual income and spending score, helping to identify different customer personas.
The project utilizes 2 customer datasets taken from Kaggle with the following features:
CustomerID
: Unique identifier for each customerGender
: Customer gender (categorical)Age
: Customer ageAnnual Income (k$)
: Customer's annual income in thousands of dollarsSpending Score (1-100)
: Score based on customer spending habits (1–100 scale)
CustomerID
: Unique identifier for each customerGender
: Customer gender (categorical)Age
: Customer ageAnnual Income ($)
: Customer's annual incomeSpending Score (1-100)
: Score based on customer spending habits (1–100 scale)Profession
: Customer's professionWork Experience
: Customer's work experienceFamily Size
: Customer's family size
Note: This dataset needs to be placed in the root directory or specify the correct path for loading the CSV file.
The project requires the following Python libraries:
pandas
numpy
scikit-learn
matplotlib
plotly
Customer_Segmentation.ipynb
: Main Jupyter Notebook containing all steps for data preprocessing, clustering, and visualization.Mall_Customers.csv
: Sample dataset used for clustering.Customers.csv
: Another sample dataset used for clusteringrequirements.txt
: Python libraries required to run the notebook.
- Clone the repository:
git clone https://github.com/PatrickDiallo23/Customer-Segmentation-BirchClustering.git
cd Customer-Segmentation-BirchClustering
- Install required packages:
pip install -r requirements.txt
- Open the Customer_Segmentation_Birch_Algorithm.ipynb file in Jupyter Notebook or JupyterLab:
jupyter notebook Customer_Segmentation_Birch_Algorithm.ipynb
Note: Make sure that you have installed Jupyter Notebook or JupyterLab by running:
pip install jupyterlab
or
pip install notebook
- Run the cells sequentially to:
- Load and preprocess the dataset
- Encode categorical variables
- Normalize the numerical features
- Run the Birch clustering algorithm to find optimal clusters
- Visualize the results
The project uses a range of 2 to 10 clusters and evaluates them based on Silhouette Score to determine the optimal number of clusters.
- Silhouette Analysis: Silhouette scores are calculated for each number of clusters, helping identify the best cluster configuration based on the compactness and separation of clusters.
The clustering performance is evaluated using three metrics:
Silhouette Score
: Measures how similar an object is to its own cluster compared to other clusters.Calinski-Harabasz Score
: Assesses the ratio of the sum of the intra-cluster dispersion to the inter-cluster dispersion. (TBD)Davies-Bouldin Score
: Measures the average similarity ratio of each cluster with its most similar cluster. (TBD)
These metrics are useful for assessing the compactness and separation of clusters, guiding the choice of the number of clusters.
The notebook uses Plotly for interactive 2D visualization of clusters:
- Scatter plots of Annual Income vs. Spending Score, colored by clusters, with hover functionality displaying other customer details (e.g., age, gender).
An example scatter plot generated:
- Normalization: The project includes two options for feature normalization:
- Min-Max Scaling: Scales features to a [0, 1] range.
- Standard Scaling: Scales features to have a mean of 0 and standard deviation of 1. Uncomment the desired normalization method in the notebook to use it.
- Birch Parameters:
- Threshold: Controls the radius of the subclusters; higher values allow for more scattered clusters.
- Branching Factor: Limits the maximum number of subclusters in each node.
- No. of clusters: The number of clusters to be returned to the BIRCH algorithm i.e., the number of clusters after the final step in the algorithm.
- Custom Cluster Names: A placeholder function can assign names to clusters based on customer attributes like annual income and spending score.
Contributions are welcomed. If you would like to contribute to this project, please report any issues or follow these steps:
1. Fork the repository.
2. Create a new branch (git checkout -b new-branch).
3. Make your changes.
4. Commit your changes (git commit -m 'Add some changes').
5. Push to the branch (git push origin new-branch).
6. Open a Pull Request.
This project is licensed under the MIT License. See the LICENSE file for details.
This project was developed as a hands-on learning experience in clustering algorithms for customer segmentation.