The analysis of air traffic passenger data provides valuable insights into trends, behaviors, and patterns in aviation, which can help airlines optimize operations, improve passenger experiences, and predict future demand. This project aims to develop a data-driven approach for predicting passenger counts and activity types using machine learning algorithms.
This project leverages R for data preprocessing, exploratory data analysis (EDA), and predictive modeling using machine learning. It involves the following steps:
- Data Preprocessing: The dataset undergoes cleaning, handling missing values, and encoding categorical variables.
- Exploratory Data Analysis (EDA): Visualizations and statistical analysis are performed to understand the data and detect trends.
- Predictive Modeling: A Naïve Bayes model is trained on the data to predict passenger activity types, such as "Enplaned", "Deplaned", or "Transit".
The purpose of the project is to provide insights into air traffic data and create a model that can predict the type of activity for a given passenger, based on various features. The results can help airlines, airport authorities, and transportation planners optimize operations and improve efficiency.
- Data cleaning and preprocessing techniques.
- Visualizations for understanding passenger counts across various regions.
- Machine learning model built using Naïve Bayes to predict passenger activity types.
- Correlation analysis and insights to understand the relationships between various regions and activity types.
- Exploratory data analysis using plots like bar charts, boxplots, and correlation matrices.
- R - Programming language for statistical computing and graphics.
- dplyr - Data manipulation package.
- ggplot2 - Visualization library for creating static plots.
- caret - Package for training and evaluating machine learning models.
- e1071 - Library for Naïve Bayes implementation.
- Hmisc & corrplot - Used for correlation and visualization.
-
Original Dataset:
-
Dataframe that Have Undergone Preprocessing:
Figure 2: Dataframe that have undergone preprocessingFigure 1 shows the data read from csv and stored into dataframe, df. It contains 15007 entries with 16 columns. Figure 2 shows the dataframe df that have undergone preprocessing. It has 367 entries with 8 total columns now.
-
Locations and Total Numbers of Missing Values:
Figure 3: Locations and total numbers of missing values
-
Structure of the Dataframe Before Preprocessing:
-
Structure of the Dataframe After Preprocessing:
-
First Few Rows of the Dataframe for df3, Training_Set and Test_Set:
Figure 6: First few rows of the dataframe for df3, training_set and test_set -
Summary of the Dataframe for df3, Training_Set and Test_Set:
Figure 7: Summary of the dataframe for df3, training_set and test_set -
Training Set for Air_Traffic_Passenger_Data After Preprocessing:
Figure 8: Training set for air_traffic_passenger_data after preprocessing -
Test Set for Air_Traffic_Passenger_Data After Preprocessing:
Figure 9: Test set for air_traffic_passenger_data after preprocessing
-
Barplot for the Passengers Count of All Activities for Asia:
Figure 10: Barplot for the passengers count of all activities for asia -
Boxplot for the Passengers Count by Activity Period for Deplaned, Enplaned and Transit:
-
Correlation Value and the p-value of All Activity Type, Deplaned, Enplaned and Transit:
-
Correlation Plot for the Passengers Count of All Activity Type:
Naïve Bayes Classification Result:
Figure 22: Naïve bayes classification result
For prediction of enplaned, deplaned or thru-transit, we are using Naïve Bayes classifiers because it is easier and execute efficiently without prior knowledge of the data. The performance of the Naïve Bayes classifier can be evaluated by accuracy and confusion matrix. From result in above figure, the model achieved 65.38% accuracy with a p-value of 0.000007354. We can conclude that our Naïve Bayes classifier still need to be improved.