forked from recodehive/machine-learning-repos
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request recodehive#324 from NIKITA320495/main
End to End Bank Customer churn prediction model
- Loading branch information
Showing
21 changed files
with
105,690 additions
and
0 deletions.
There are no files selected for viewing
100,002 changes: 100,002 additions & 0 deletions
100,002
Bank customer churn prediction/Dataset/Customer.csv
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
# Dataset: [Customer.csv](Dataset/Customer.csv) | ||
|
||
## Overview | ||
|
||
This dataset contains information on bank customers and their characteristics, with 100001 rows and 14 columns. Each row represents a unique customer record, while the columns provide various features such as customer demographics, financial status, and banking behavior. With its diverse range of features, this dataset is suitable for predictive modeling tasks, particularly in predicting customer churn. It offers valuable insights into customer behavior and allows businesses to implement targeted strategies for customer retention and satisfaction. | ||
|
||
### Features | ||
|
||
- **id**: A unique identifier for each record. | ||
- **CustomerId**: A unique identifier for each customer. | ||
- **Surname**: The surname of the customer. | ||
- **CreditScore**: The credit score of the customer, indicating their creditworthiness. | ||
- **Geography**: The geographical location of the customer. | ||
- **Gender**: The gender of the customer. | ||
- **Age**: The age of the customer. | ||
- **Tenure**: The number of years the customer has been with the bank. | ||
- **Balance**: The account balance of the customer. | ||
- **NumOfProducts**: The number of bank products the customer uses. | ||
- **HasCrCard**: Whether the customer has a credit card (1 if yes, 0 if no). | ||
- **IsActiveMember**: Whether the customer is an active member of the bank (1 if yes, 0 if no). | ||
- **EstimatedSalary**: The estimated salary of the customer. | ||
|
||
|
||
### Dataset Details | ||
|
||
- **Total Instances**: 100001 | ||
- **Features**: 13 | ||
- **Target Variable**: Exited (Whether the customer has churned: 1 if yes, 0 if no) | ||
|
||
## Preprocessing | ||
|
||
The dataset has undergone preprocessing to prepare it for machine learning tasks. The following steps were applied: | ||
|
||
- **Feature Scaling**: Two scaling techniques were used: | ||
- **MinMax Scaling**: Applied to columns 'Tenure' and 'NumOfProducts' using `MinMaxScaler`. | ||
- **Standard Scaling**: Applied to columns 'CreditScore', 'Age', and 'Balance' using `StandardScaler`. | ||
- **One-Hot Encoding**: The categorical column 'Gender' was one-hot encoded to convert it into numerical format, enabling machine learning algorithms to process it effectively. | ||
|
||
These preprocessing steps ensure that all features are on the same scale and in a suitable format for training machine learning models. The dataset is now ready for further analysis and model building. | ||
|
||
## Usage | ||
|
||
This dataset can be used for various purposes, including: | ||
|
||
- **Training Machine Learning Models**: Use the dataset to train machine learning models for predicting the target variable (e.g., customer churn). By leveraging the provided features, you can build predictive models to identify patterns and trends in customer behavior. | ||
- **Exploratory Data Analysis (EDA)**: Perform EDA to gain insights into specific aspects of the data, such as customer demographics, financial behavior, and churn patterns. Visualization techniques can help uncover relationships and correlations between different features, providing valuable insights for decision-making. | ||
- **Research Projects**: The dataset can serve as a valuable resource for research projects related to banking, customer behavior analysis, and predictive modeling. Researchers can explore various hypotheses and conduct experiments to validate findings in the context of customer churn prediction and proactive retention strategies. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
# Machine Learning Model for Bank Customer Churn Prediction | ||
|
||
## Overview | ||
|
||
This machine learning model predicts customer churn in a bank based on customer demographics, financial status, and banking behavior. It is trained on a dataset containing customer information and churn labels. By analyzing these features, the model identifies patterns indicating whether a customer is likely to churn, enabling proactive retention strategies. | ||
|
||
## Preprocessing | ||
|
||
- **Feature Scaling**: MinMax scaling for 'Tenure' and 'NumOfProducts', Standard scaling for 'CreditScore', 'Age', and 'Balance'. | ||
- **One-Hot Encoding**: Encode categorical feature 'Gender' to numerical format. | ||
|
||
## Libraries Used | ||
|
||
- scikit-learn: Machine learning library for model training and evaluation. | ||
- pandas: Data manipulation library for preprocessing and data handling. | ||
- numpy: Numerical computing library for mathematical operations. | ||
- matplotlib, seaborn, plotly: Visualization libraries for data exploration and analysis. | ||
|
||
|
||
## Model Details | ||
### Hyperparameter Tuning Results | ||
|
||
- **Logistic Regression**: | ||
- Best Parameters: {'solver': 'liblinear', 'penalty': 'l1', 'C': 10} | ||
- Best Score: 0.757 | ||
- Train Accuracy: 0.757 | ||
- Test Accuracy: 0.754 | ||
|
||
- **Random Forest**: | ||
- Best Parameters: {'n_estimators': 200, 'min_samples_split': 2, 'max_depth': None} | ||
- Best Score: 0.892 | ||
- Train Accuracy: 0.999 | ||
- Test Accuracy: 0.906 | ||
|
||
- **Decision Tree**: | ||
- Best Parameters: {'min_samples_split': 5, 'max_depth': 20} | ||
- Best Score: 0.853 | ||
- Train Accuracy: 0.949 | ||
- Test Accuracy: 0.866 | ||
|
||
- **Gradient Boosting**: | ||
- Best Parameters: {'n_estimators': 200, 'max_depth': 5, 'learning_rate': 0.2} | ||
- Best Score: 0.908 | ||
- Train Accuracy: 0.918 | ||
- Test Accuracy: 0.908 | ||
|
||
- **AdaBoost**: | ||
- Best Parameters: {'n_estimators': 200, 'learning_rate': 1} | ||
- Best Score: 0.879 | ||
- Train Accuracy: 0.880 | ||
- Test Accuracy: 0.879 | ||
## Model Evaluation | ||
|
||
### Accuracy Score | ||
|
||
The accuracy score is a metric used to measure the overall performance of a classification model. It represents the proportion of correctly predicted instances out of the total instances. In the context of this bank customer churn prediction model, the accuracy score indicates the percentage of correctly predicted churned and non-churned customers. | ||
|
||
### Area Under the ROC Curve (AUC-ROC) | ||
|
||
The AUC-ROC (Area Under the Receiver Operating Characteristic Curve) is a metric used to evaluate the performance of binary classification models. It represents the area under the curve plotted by the True Positive Rate (Sensitivity) against the False Positive Rate (1 - Specificity) for different threshold values. A higher AUC-ROC value indicates better discrimination capability of the model in distinguishing between positive and negative classes. | ||
|
||
### Determining the Best Model | ||
|
||
To determine the best model among the trained classifiers, we consider both the accuracy score and the AUC-ROC value. | ||
|
||
1. **Accuracy Score**: We look for the model with the highest accuracy score on the test dataset. A higher accuracy score indicates that the model predicts the correct class labels more accurately. | ||
|
||
2. **AUC-ROC**: We also consider the AUC-ROC value. A higher AUC-ROC value indicates better overall performance in terms of both sensitivity and specificity. It helps to assess how well the model is able to distinguish between positive and negative instances. | ||
|
||
By comparing the accuracy scores and AUC-ROC values of different models, we can determine which model performs best for the task of predicting bank customer churn. Typically, we choose the model with the highest accuracy score and AUC-ROC value as the best-performing model for deployment. | ||
|
Oops, something went wrong.