This case study aims to build, train and test a machinie learning model to predict insurance cost based on customer features such as age, gender, Body Mass Index (BMI), number of children, smoking habits, and geo-location.
- Perform data cleaning, feature engineering and visualization
- Build, train and test an artificial neural network model in Keras and Tensorflow
- Understand the theory and intuition behind artificial neural networks
- Inputs:
- age: Customer's age
- sex : Insurance contractor gender
- bmi: Body Mass Index (18.5 to 24.9 for ideal bmi)
- Children: Number of children covered by health insurance/number of dependents
- smoker: Smoking habit of customers
- region: The beneficiary's residential area in the US, Northeast, Southeast, Southwest, Northwest
- Target (output):
- charges: Individual medical costs billed by health insurance
Linear Regression Model achileved a 69% accuracy score.
-
RMSE = 6536.847
-
MSE = 42730370.0
-
MAE = 4555.098
-
R2 = 0.6953286415758744
-
Adjusted R2 = 0.6859179432461717
-
The coefficient of determinatio (R2) was 69%. This means that 69% of the variations in the output (Charges) was represented by the variations in the input features.
-
This is a reasonable score however we can still attempt to increase the score and get it closer to 100%.
- ANN_model = keras.Sequential()
- resulted in about 38,351 artificially trainable parameters to optimize ANN_model.compile(optimizer='Adam', loss='mean_squared_error') epochs_hist = ANN_model.fit(X_train, y_train, epochs = 100, batch_size= 20, validation_split= 0.2) - Accuracy : 0.7409380674362183
- Resulted to an acciracu score of about 74%
- The Validation error tends to increase slightligh, showing the model is overfiting the training data, however, model still performed quite well.
- In essence, performance on training data was good, but not great on test data.
-
Beyond 20,000, the model predictions did not accurately match the True values (test data set).
-
RMSE = 6161.465
-
MSE = 37963650.0
-
MAE = 3815.8079
-
R2 = 0.7293158320104738
-
Adjusted R2 = 0.7209549310687123
-
The coefficient of determinatio (R2) is 72%, which slighlty more accurate compared to the linear regresion model with scikit learn.
ANN_model.add(Dropout(0.5)
- After Dropout layers were added the Accuracy score was 80.4% and there was less overfit.
- RMSE = 5362.887
- MSE = 28760556.0
- MAE = 3271.4172
- R2 = 0.7949346600648356
- Adjusted R2 = 0.7886005955108537
After dropout was introduced, the coefficient of determination (R2) became 79.5% compared to 72.9% before dropout.