At this time I will share my work which is a prediction titanic survivor from kaggle. This is beginner competition from kaggle, that is, using machine learning to create a model that predicts which passengers survived the Titanic shipwreck (Source : https://www.kaggle.com/c/titanic).
I had done some experiment with several machine learning algorithms like Naive Bayes, Logistic Regression, XGBoost, K-Nearest Neighbors, etc. I used GridSearchCV to find best parameter and accuracy from each algorithms. The best hyperparameters I got is leaf_size = 1, metric = ‘minkowski’, n_neighbors = 12, p = 1, weights = ‘distance’.
After that, I implemented on the data test , and then submitted the results prediction to kaggle .
Based on the results obtained, I got accuracy 0.79425 (top 7 %) with KNN algorithm.
The following is a summary of what was done in this project:
First, I do exploratory data analysis for analysis purpose. In this section, I am going to look information from the data (type of columns, null value, etc.). After that I split the data into numerical and categorical data, then visualize that for easy to understand.
Pieces of data and informations on each column data train
Pieces of data and informations on each column data test
After that, I Split data into numeric and categories data. But first, we categorize the column ‘cabin’ with their first letter values. Then, the following are the distribution plots and heatmap of numerical data
Distributions plot and heatmap numeric data
Then, I create a barplot of each categorical data. Here are the visualizations.
Before input the data into a model prediction, firstly do preprocessing data to get a better data. Its important to do preprocessing data before build a model prediction, because no matter how good you model if the data is dirty or bad, the results will be less than optimal (like term “garbage in garbage out”). First, handle the missing data from data train and test.
Missing data from data train and test
From the picture above, I will handle the columns ‘Age’, ‘Fare’, ‘Cabin, ‘Embarked’ from both datasets. In column ‘Age’ and ‘Fare’ we impute null value with median from each column ‘Age’ and ‘Fare’ from data train. After that, on column ‘Embarked’ I drop row of the data contains null values, because its only 2 rows with null values on ‘Embarked’ column on data train (if there are on data test, we can’t impute that with mode of values on data train (data test should have 418 row data, it’s the rules :)). And then I drop column ‘Cabin’ from both data train and test because it’s contained many null values (more than half of data).
Missing data from data train and test after preprocessing data
After that, I created the columns ‘FamilySurvived’ & ‘FamilyDied’ from last name I got on column ‘Name’ values.
Adds columns ‘FamilySurvived’ & ‘FamilyDied’ .png
And then, I identify and remove the outliers values. Then do log transform on column ‘Fare’ to make data more close into normal distribution.
Distribution plots age & fare after preprocessing
After that, I ecode categorical data into numeric category with LabelEncoder and OneHotEncoder, and then do standardize the numerical data, so that each numerical columns/features have the same scale. In the last, I analyze and select the columns/features that used in the model prediction. Then, i split the data into predictor/input variable (X) and target/output variable (y). Here are the features and pieces of data train that I used for prediction.
Features and pieces of data train for prediction
In this section, I had done some experiment with several machine learning algorithms like Naive Bayes, Logistic Regression, XGBoost, K-Nearest Neighbors, etc. I used GridSearchCV to find best parameter and accuracy from each algorithms, after that i implemented on data test and submitted the results prediction to kaggle. The best score I have is 0.79425 (top 7%) with K-Nearest Neighbors algorithm (parameter : leaf_size = 1, metric = ‘minkowski’, n_neighbors = 12, p = 1, weights = ‘distance’)
After that, I implemented on the data test , and then submitted the results prediction to kaggle .
Based on the results obtained, I got accuracy 0.79425 (top 7 %) with KNN algorithm.
For details you can check my code : Titanic Survivor Prediction (Top 7%).ipynb