Titanic Survivor Prediction (Top 7%)

Introduction

At this time I will share my work which is a prediction titanic survivor from kaggle. This is beginner competition from kaggle, that is, using machine learning to create a model that predicts which passengers survived the Titanic shipwreck (Source : https://www.kaggle.com/c/titanic).

Results

I had done some experiment with several machine learning algorithms like Naive Bayes, Logistic Regression, XGBoost, K-Nearest Neighbors, etc. I used GridSearchCV to find best parameter and accuracy from each algorithms. The best hyperparameters I got is leaf_size = 1, metric = ‘minkowski’, n_neighbors = 12, p = 1, weights = ‘distance’.

Gridsearchcv results

After that, I implemented on the data test , and then submitted the results prediction to kaggle .

Results

Based on the results obtained, I got accuracy 0.79425 (top 7 %) with KNN algorithm.

Summary

The following is a summary of what was done in this project:

- Exploratory Data Analysis

First, I do exploratory data analysis for analysis purpose. In this section, I am going to look information from the data (type of columns, null value, etc.). After that I split the data into numerical and categorical data, then visualize that for easy to understand.

Pieces of data and informations on each column data train

Pieces of data and informations on each column data test

After that, I Split data into numeric and categories data. But first, we categorize the column ‘cabin’ with their first letter values. Then, the following are the distribution plots and heatmap of numerical data

Distributions plot and heatmap numeric data

Then, I create a barplot of each categorical data. Here are the visualizations.

Barplot categorical data

- Preprocessing Data

Before input the data into a model prediction, firstly do preprocessing data to get a better data. Its important to do preprocessing data before build a model prediction, because no matter how good you model if the data is dirty or bad, the results will be less than optimal (like term “garbage in garbage out”). First, handle the missing data from data train and test.

Missing data from data train and test

From the picture above, I will handle the columns ‘Age’, ‘Fare’, ‘Cabin, ‘Embarked’ from both datasets. In column ‘Age’ and ‘Fare’ we impute null value with median from each column ‘Age’ and ‘Fare’ from data train. After that, on column ‘Embarked’ I drop row of the data contains null values, because its only 2 rows with null values on ‘Embarked’ column on data train (if there are on data test, we can’t impute that with mode of values on data train (data test should have 418 row data, it’s the rules :)). And then I drop column ‘Cabin’ from both data train and test because it’s contained many null values (more than half of data).

Missing data from data train and test after preprocessing data

After that, I created the columns ‘FamilySurvived’ & ‘FamilyDied’ from last name I got on column ‘Name’ values.

Adds columns ‘FamilySurvived’ & ‘FamilyDied’ .png

And then, I identify and remove the outliers values. Then do log transform on column ‘Fare’ to make data more close into normal distribution.

Distribution plots age & fare after preprocessing

After that, I ecode categorical data into numeric category with LabelEncoder and OneHotEncoder, and then do standardize the numerical data, so that each numerical columns/features have the same scale. In the last, I analyze and select the columns/features that used in the model prediction. Then, i split the data into predictor/input variable (X) and target/output variable (y). Here are the features and pieces of data train that I used for prediction.

Features and pieces of data train for prediction

- Model Prediction and Results

In this section, I had done some experiment with several machine learning algorithms like Naive Bayes, Logistic Regression, XGBoost, K-Nearest Neighbors, etc. I used GridSearchCV to find best parameter and accuracy from each algorithms, after that i implemented on data test and submitted the results prediction to kaggle. The best score I have is 0.79425 (top 7%) with K-Nearest Neighbors algorithm (parameter : leaf_size = 1, metric = ‘minkowski’, n_neighbors = 12, p = 1, weights = ‘distance’)

Model prediction code

Gridsearchcv results

After that, I implemented on the data test , and then submitted the results prediction to kaggle .

Results

Based on the results obtained, I got accuracy 0.79425 (top 7 %) with KNN algorithm.

For details you can check my code : Titanic Survivor Prediction (Top 7%).ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
data		data
images		images
README.md		README.md
Titanic Survivor Prediction (Top 7%).ipynb		Titanic Survivor Prediction (Top 7%).ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Titanic Survivor Prediction (Top 7%)

Introduction

Results

Summary

- Exploratory Data Analysis

- Preprocessing Data

- Model Prediction and Results

About

Releases

Packages

Languages

rifkyahmadsaputra/Titanic-Survivor-Prediction

Folders and files

Latest commit

History

Repository files navigation

Titanic Survivor Prediction (Top 7%)

Introduction

Results

Summary

- Exploratory Data Analysis

- Preprocessing Data

- Model Prediction and Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages