Skip to content

Commit

Permalink
Merge pull request recodehive#324 from NIKITA320495/main
Browse files Browse the repository at this point in the history
End to End Bank Customer churn prediction model
  • Loading branch information
sanjay-kv authored Jun 17, 2024
2 parents 2f71e3a + 3ca9ccf commit 453d6a4
Show file tree
Hide file tree
Showing 21 changed files with 105,690 additions and 0 deletions.
100,002 changes: 100,002 additions & 0 deletions Bank customer churn prediction/Dataset/Customer.csv

Large diffs are not rendered by default.

47 changes: 47 additions & 0 deletions Bank customer churn prediction/Dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Dataset: [Customer.csv](Dataset/Customer.csv)

## Overview

This dataset contains information on bank customers and their characteristics, with 100001 rows and 14 columns. Each row represents a unique customer record, while the columns provide various features such as customer demographics, financial status, and banking behavior. With its diverse range of features, this dataset is suitable for predictive modeling tasks, particularly in predicting customer churn. It offers valuable insights into customer behavior and allows businesses to implement targeted strategies for customer retention and satisfaction.

### Features

- **id**: A unique identifier for each record.
- **CustomerId**: A unique identifier for each customer.
- **Surname**: The surname of the customer.
- **CreditScore**: The credit score of the customer, indicating their creditworthiness.
- **Geography**: The geographical location of the customer.
- **Gender**: The gender of the customer.
- **Age**: The age of the customer.
- **Tenure**: The number of years the customer has been with the bank.
- **Balance**: The account balance of the customer.
- **NumOfProducts**: The number of bank products the customer uses.
- **HasCrCard**: Whether the customer has a credit card (1 if yes, 0 if no).
- **IsActiveMember**: Whether the customer is an active member of the bank (1 if yes, 0 if no).
- **EstimatedSalary**: The estimated salary of the customer.


### Dataset Details

- **Total Instances**: 100001
- **Features**: 13
- **Target Variable**: Exited (Whether the customer has churned: 1 if yes, 0 if no)

## Preprocessing

The dataset has undergone preprocessing to prepare it for machine learning tasks. The following steps were applied:

- **Feature Scaling**: Two scaling techniques were used:
- **MinMax Scaling**: Applied to columns 'Tenure' and 'NumOfProducts' using `MinMaxScaler`.
- **Standard Scaling**: Applied to columns 'CreditScore', 'Age', and 'Balance' using `StandardScaler`.
- **One-Hot Encoding**: The categorical column 'Gender' was one-hot encoded to convert it into numerical format, enabling machine learning algorithms to process it effectively.

These preprocessing steps ensure that all features are on the same scale and in a suitable format for training machine learning models. The dataset is now ready for further analysis and model building.

## Usage

This dataset can be used for various purposes, including:

- **Training Machine Learning Models**: Use the dataset to train machine learning models for predicting the target variable (e.g., customer churn). By leveraging the provided features, you can build predictive models to identify patterns and trends in customer behavior.
- **Exploratory Data Analysis (EDA)**: Perform EDA to gain insights into specific aspects of the data, such as customer demographics, financial behavior, and churn patterns. Visualization techniques can help uncover relationships and correlations between different features, providing valuable insights for decision-making.
- **Research Projects**: The dataset can serve as a valuable resource for research projects related to banking, customer behavior analysis, and predictive modeling. Researchers can explore various hypotheses and conduct experiments to validate findings in the context of customer churn prediction and proactive retention strategies.
71 changes: 71 additions & 0 deletions Bank customer churn prediction/Model/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Machine Learning Model for Bank Customer Churn Prediction

## Overview

This machine learning model predicts customer churn in a bank based on customer demographics, financial status, and banking behavior. It is trained on a dataset containing customer information and churn labels. By analyzing these features, the model identifies patterns indicating whether a customer is likely to churn, enabling proactive retention strategies.

## Preprocessing

- **Feature Scaling**: MinMax scaling for 'Tenure' and 'NumOfProducts', Standard scaling for 'CreditScore', 'Age', and 'Balance'.
- **One-Hot Encoding**: Encode categorical feature 'Gender' to numerical format.

## Libraries Used

- scikit-learn: Machine learning library for model training and evaluation.
- pandas: Data manipulation library for preprocessing and data handling.
- numpy: Numerical computing library for mathematical operations.
- matplotlib, seaborn, plotly: Visualization libraries for data exploration and analysis.


## Model Details
### Hyperparameter Tuning Results

- **Logistic Regression**:
- Best Parameters: {'solver': 'liblinear', 'penalty': 'l1', 'C': 10}
- Best Score: 0.757
- Train Accuracy: 0.757
- Test Accuracy: 0.754

- **Random Forest**:
- Best Parameters: {'n_estimators': 200, 'min_samples_split': 2, 'max_depth': None}
- Best Score: 0.892
- Train Accuracy: 0.999
- Test Accuracy: 0.906

- **Decision Tree**:
- Best Parameters: {'min_samples_split': 5, 'max_depth': 20}
- Best Score: 0.853
- Train Accuracy: 0.949
- Test Accuracy: 0.866

- **Gradient Boosting**:
- Best Parameters: {'n_estimators': 200, 'max_depth': 5, 'learning_rate': 0.2}
- Best Score: 0.908
- Train Accuracy: 0.918
- Test Accuracy: 0.908

- **AdaBoost**:
- Best Parameters: {'n_estimators': 200, 'learning_rate': 1}
- Best Score: 0.879
- Train Accuracy: 0.880
- Test Accuracy: 0.879
## Model Evaluation

### Accuracy Score

The accuracy score is a metric used to measure the overall performance of a classification model. It represents the proportion of correctly predicted instances out of the total instances. In the context of this bank customer churn prediction model, the accuracy score indicates the percentage of correctly predicted churned and non-churned customers.

### Area Under the ROC Curve (AUC-ROC)

The AUC-ROC (Area Under the Receiver Operating Characteristic Curve) is a metric used to evaluate the performance of binary classification models. It represents the area under the curve plotted by the True Positive Rate (Sensitivity) against the False Positive Rate (1 - Specificity) for different threshold values. A higher AUC-ROC value indicates better discrimination capability of the model in distinguishing between positive and negative classes.

### Determining the Best Model

To determine the best model among the trained classifiers, we consider both the accuracy score and the AUC-ROC value.

1. **Accuracy Score**: We look for the model with the highest accuracy score on the test dataset. A higher accuracy score indicates that the model predicts the correct class labels more accurately.

2. **AUC-ROC**: We also consider the AUC-ROC value. A higher AUC-ROC value indicates better overall performance in terms of both sensitivity and specificity. It helps to assess how well the model is able to distinguish between positive and negative instances.

By comparing the accuracy scores and AUC-ROC values of different models, we can determine which model performs best for the task of predicting bank customer churn. Typically, we choose the model with the highest accuracy score and AUC-ROC value as the best-performing model for deployment.

Loading

0 comments on commit 453d6a4

Please sign in to comment.