Skip to content

Customer churn prediction by using Apache Spark and Gradient Boosting Classifier

Notifications You must be signed in to change notification settings

gustavomccoelho/Customer-Churn-Prediction-Spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Problem definition

The goal of this project is to predict customer churn* by using the given features.

*Churn definition: Amount of customers that stopped using your company's product or service during a certain time frame

Data description

train.csv - the training set test.csv - the test set

Both the train and test datasets are made of 20 features, including the target feature ("churn").

Script strategy

The scrip is entirely made on Apache Spark framework.

Exploratory-Analysis.ipynb:

After the basic exploratory analysis is in place, e.g. looking for NA values, data balance, etc., we see the following correlation matrix:

corr-1

We can see from this matrix that there are features with very high correlation with each other, such as total_day_minutes x total_day_charge (probably because the customer is charged by minutes of use).

The first step is to remove this features, resulting in the following correlation matrix:

corr-2

Following, two specific features are analysed more deeply, and new level features are made according to their clear relation to the target:

number_customer_service_calls number_customer_service_calls_level total_day_charge total_day_charge_level

Finally, the least significant features related to the target are removed (correlation less than 0.1)

Predictive-Model.ipynb:

The data created from the previous script is loaded and the predictions are taken place.

Gradient Boosting Classifier was used, giving the following result in the test dataset:

Gradient Boosting Classifier accuracy: 0.91

About

Customer churn prediction by using Apache Spark and Gradient Boosting Classifier

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published