Skip to content

At this time, I will share my work which is a prediction titanic survivor from kaggle. This is beginner competition from kaggle, that is, using machine learning to create a model that predicts which passengers survived the Titanic shipwreck (Source : https://www.kaggle.com/c/titanic).

Notifications You must be signed in to change notification settings

rifkyahmadsaputra/Titanic-Survivor-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Titanic Survivor Prediction (Top 7%)

Introduction

      At this time I will share my work which is a prediction titanic survivor from kaggle. This is beginner competition from kaggle, that is, using machine learning to create a model that predicts which passengers survived the Titanic shipwreck (Source : https://www.kaggle.com/c/titanic).

Results

      I had done some experiment with several machine learning algorithms like Naive Bayes, Logistic Regression, XGBoost, K-Nearest Neighbors, etc. I used GridSearchCV to find best parameter and accuracy from each algorithms. The best hyperparameters I got is leaf_size = 1, metric = ‘minkowski’, n_neighbors = 12, p = 1, weights = ‘distance’.



Gridsearchcv results

      After that, I implemented on the data test , and then submitted the results prediction to kaggle .



Results

      Based on the results obtained, I got accuracy 0.79425 (top 7 %) with KNN algorithm.


Summary

The following is a summary of what was done in this project:

- Exploratory Data Analysis

      First, I do exploratory data analysis for analysis purpose. In this section, I am going to look information from the data (type of columns, null value, etc.). After that I split the data into numerical and categorical data, then visualize that for easy to understand.



Pieces of data and informations on each column data train



Pieces of data and informations on each column data test

      After that, I Split data into numeric and categories data. But first, we categorize the column ‘cabin’ with their first letter values. Then, the following are the distribution plots and heatmap of numerical data



Distributions plot and heatmap numeric data

      Then, I create a barplot of each categorical data. Here are the visualizations.



Barplot categorical data

- Preprocessing Data

      Before input the data into a model prediction, firstly do preprocessing data to get a better data. Its important to do preprocessing data before build a model prediction, because no matter how good you model if the data is dirty or bad, the results will be less than optimal (like term “garbage in garbage out”). First, handle the missing data from data train and test.



Missing data from data train and test

      From the picture above, I will handle the columns ‘Age’, ‘Fare’, ‘Cabin, ‘Embarked’ from both datasets. In column ‘Age’ and ‘Fare’ we impute null value with median from each column ‘Age’ and ‘Fare’ from data train. After that, on column ‘Embarked’ I drop row of the data contains null values, because its only 2 rows with null values on ‘Embarked’ column on data train (if there are on data test, we can’t impute that with mode of values on data train (data test should have 418 row data, it’s the rules :)). And then I drop column ‘Cabin’ from both data train and test because it’s contained many null values (more than half of data).



Missing data from data train and test after preprocessing data

      After that, I created the columns ‘FamilySurvived’ & ‘FamilyDied’ from last name I got on column ‘Name’ values.



Adds columns ‘FamilySurvived’ & ‘FamilyDied’ .png

      And then, I identify and remove the outliers values. Then do log transform on column ‘Fare’ to make data more close into normal distribution.



Distribution plots age & fare after preprocessing

      After that, I ecode categorical data into numeric category with LabelEncoder and OneHotEncoder, and then do standardize the numerical data, so that each numerical columns/features have the same scale. In the last, I analyze and select the columns/features that used in the model prediction. Then, i split the data into predictor/input variable (X) and target/output variable (y). Here are the features and pieces of data train that I used for prediction.



Features and pieces of data train for prediction

- Model Prediction and Results

      In this section, I had done some experiment with several machine learning algorithms like Naive Bayes, Logistic Regression, XGBoost, K-Nearest Neighbors, etc. I used GridSearchCV to find best parameter and accuracy from each algorithms, after that i implemented on data test and submitted the results prediction to kaggle. The best score I have is 0.79425 (top 7%) with K-Nearest Neighbors algorithm (parameter : leaf_size = 1, metric = ‘minkowski’, n_neighbors = 12, p = 1, weights = ‘distance’)



Model prediction code



Gridsearchcv results

      After that, I implemented on the data test , and then submitted the results prediction to kaggle .



Results

      Based on the results obtained, I got accuracy 0.79425 (top 7 %) with KNN algorithm.


For details you can check my code : Titanic Survivor Prediction (Top 7%).ipynb

About

At this time, I will share my work which is a prediction titanic survivor from kaggle. This is beginner competition from kaggle, that is, using machine learning to create a model that predicts which passengers survived the Titanic shipwreck (Source : https://www.kaggle.com/c/titanic).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published