This project uses Machine Learning to predict the survival outcome for individual passengers on the Titanic. (Based on data from a Kaggle competition.)
- End to end Python based Predictive Modeling
- Logistic Regression, K-nearest Neighbor, Decision Tree, Random Forest, Support Vector Classification (SVC), XGBoost
- Cross-validation
- Grid and Random search to for model tuning
- Ensembler for creating the best prediction. (Inspired by Ken Jee!)
- The model reached 85% accuracy in the training data and 77% accuracy in the test data.
- All the parameters were built/trained only using the training data, which is critical for industry applications.
Files | Notes |
---|---|
/module/helpers.py | tools built to facilitate EDA and preprocessing |
titanic_EDA.ipynb | EDA (Exploratory Data Analysis) |
titanic_preprocessing_feature_Engineering.ipynb | Data preprocess and feature engineering |
titanic_model.ipynb | Machine Learning model building and turning |
The Titanic still occupies our mind 100 years after the disaster happened. Among the over 2000 passengers, about 1500 lost their lives. It would be vital to understand whether survival is related to other factors such as:
- sex
- fare
- age
- cabin
- class
- ticket
- number of siblings and spouse
- number of children and parents
- embarked location
In this project, the training data has information on 891 participants and the test data has inforamtion on 418 and is asked to predict every one of them whether he/she was giong to survive.
Observations:
- Age is pretty much normally distributed, the rest of the variables need normalization
- Parch (number of parents/children aboard) is positively correlated with SibSp(number of sibling/spouse aboard)
- age is negatiavelly correated with SibSp (number of siblings)
observations:
- more people died than survived
- more people are in the 3rd class cabin
- more male than female
- more people embarked from S than from C and Q
2.1.4 compare the survival rate across all the numeric variables (Age, SibSp, Parch, and Fare) and categorical variables (Sex, Pclass, Embarked)
Observations:
- higher Faire has a higher survival rate
- high Parch has a higher survival rate
- lower SibSp and lower age has a higher survival rate
- Survivied female > male
- Survivied Pclass 1 > 2 > 3
- Survivied Embarked C > O > S
Simplify Cabin by the number of cabins, NaN is counted as 0 cabins.
Observation:
- people with 1, 2, 4 cabins have a higher survival propertion than nonsurvive
Simplify Cabin by the first letter of the cabin
Observation: - More people in the following categories survivied: B, D, E, F
Simplify Tickets by the first letter of the ticket
Observation: - More survival with the following ticket_firstletter: F, P
- Very little survival with the following ticket_firstletter: A, W
- moderatte survival rate with the following ticket_firstletter: C, None
Simplify Name by extracting the title
Observations:
- male 20-40 yr many not survived
- female has a large survival rate across all ages
Observation
- male from age 20-40 in Pclass 2 and 3 mostly did not survive
Observations:
- Pclass 3 has a much lower survival rate than Pclass 1 and 2 across Sex and Embarked
- Male when Embarked from Q has a particular lower survival rate than Embarked S and C
Observation:
- Higher fare has a higher survival rate across most of the age spectrum.
- Younger age 0-10 has a higher survival rate
- older age 60 + has a lower survival rate
Observation:
- most people fall in the category of n, which means none for cabin.
- and in the n category the survival rate is lower than other categories
Observations:
- most people fall in the category of None, which means no ticket number.
- and the survival rates are lower in the following categories: None, A, S, C, W
Observations:
- Most people fall in the Mr. category and it has a low survival rate
- Category Msr, Miss, Master has a higher survival rate
- based on EDA, the following variables should be included as features:
- Pclass, name_title_adv, Sex, Age, Sibsp, Parch, Fare, Embarked, cabin_total, cabin_firstletter, ticket_firstletter
Organized and prepared a helper module for feature engineering (according to EDA) so it can be readily applied for both the training and test sets.
- Convert Pclass to categorical
- Fill in the empty cells of 'Embarked'
- Normalize then fill in the empty cells for 'Fare'
- Simplify Name by creating 'name_title_adv'
- Simplify Cabin by creating 'cabin_firstletter' and 'cabin_total'
- Simplify Ticket by creating 'ticket_firstletter'
- Replace the values in 'name_title_adv' in the test set that is absent in the training with training mode
- Fill the empty cells of Age by aggregrated values from 'name_title_adv'
- Remove extra columns
- Merge training and test together to create a consistent dummy-variable set across train and test, then separate the datasets
- Scale the numeric columns for both datasets
- sklearn
- Tested multiple ML models: Naive Bayes, Logistic Regression, Decision Tree, K Nearest Neighbors, Random Forest, SVC, XGBoost
- Used sklearn.ensemble to create a voting system
- Used the average accuracy from 5-folds cross-validtion
- Turned each model by either Grid Search or Random Search to improve accuracy
Accuracy improvment after tuning:
Algorithm | Accuracy with Default Parameters | Accuracy after Tuned |
---|---|---|
naive bayes | 0.4668 | N/A |
Logistic regression | 0.8193 | 0.8215 |
Decision tree | 0.7924 | N/A |
k nearest neighbor | 0.8149 | 0.8249 |
Random Forest | 0.81 | 0.8372 |
SVC | 0.8306 | 0.8350 |
xgboost | 0.8305 | 0.8451 |
- Sex being male is the most important feature
- Then the next important feature is whether the person is called Master
- The next important feature is whether the passenger is in Pclass3
Take away: The final model accuracy was 85% for the training data and 77% for the test data. More feature engining can be investigated to increase the accuracy.