Skip to content

Air traffic passenger analysis using Naïve Bayes Classification

Notifications You must be signed in to change notification settings

derekgan08/air-traffic-passenger-analysis

Repository files navigation

Principles of Data Analytics Project: Air Traffic Passenger Data Analysis

Problem Statements

The analysis of air traffic passenger data provides valuable insights into trends, behaviors, and patterns in aviation, which can help airlines optimize operations, improve passenger experiences, and predict future demand. This project aims to develop a data-driven approach for predicting passenger counts and activity types using machine learning algorithms.

Project Overview

This project leverages R for data preprocessing, exploratory data analysis (EDA), and predictive modeling using machine learning. It involves the following steps:

  • Data Preprocessing: The dataset undergoes cleaning, handling missing values, and encoding categorical variables.
  • Exploratory Data Analysis (EDA): Visualizations and statistical analysis are performed to understand the data and detect trends.
  • Predictive Modeling: A Naïve Bayes model is trained on the data to predict passenger activity types, such as "Enplaned", "Deplaned", or "Transit".

The purpose of the project is to provide insights into air traffic data and create a model that can predict the type of activity for a given passenger, based on various features. The results can help airlines, airport authorities, and transportation planners optimize operations and improve efficiency.

Key Features

  • Data cleaning and preprocessing techniques.
  • Visualizations for understanding passenger counts across various regions.
  • Machine learning model built using Naïve Bayes to predict passenger activity types.
  • Correlation analysis and insights to understand the relationships between various regions and activity types.
  • Exploratory data analysis using plots like bar charts, boxplots, and correlation matrices.

Technologies Used

  • R - Programming language for statistical computing and graphics.
  • dplyr - Data manipulation package.
  • ggplot2 - Visualization library for creating static plots.
  • caret - Package for training and evaluating machine learning models.
  • e1071 - Library for Naïve Bayes implementation.
  • Hmisc & corrplot - Used for correlation and visualization.

Data Preparation

Data Preparation and Preprocessing

  1. Original Dataset:

    original dataset
    Figure 1: Original dataset

  2. Dataframe that Have Undergone Preprocessing:

    dataframe that have undergone preprocessing
    Figure 2: Dataframe that have undergone preprocessing

    Figure 1 shows the data read from csv and stored into dataframe, df. It contains 15007 entries with 16 columns. Figure 2 shows the dataframe df that have undergone preprocessing. It has 367 entries with 8 total columns now.

  3. Locations and Total Numbers of Missing Values:

    locations and total numbers of missing values 2 boxplot for the housing value prices by months for rent

    Figure 3: Locations and total numbers of missing values

  4. Structure of the Dataframe Before Preprocessing:

    structure of the dataframe before preprocessing
    Figure 4: Structure of the dataframe before preprocessing

  5. Structure of the Dataframe After Preprocessing:

    structure of the dataframe after preprocessing
    Figure 5: Structure of the dataframe after preprocessing

  6. First Few Rows of the Dataframe for df3, Training_Set and Test_Set:

    first few rows of the dataframe for df3, training_set and test_set
    Figure 6: First few rows of the dataframe for df3, training_set and test_set

  7. Summary of the Dataframe for df3, Training_Set and Test_Set:

    summary of the dataframe for df3, training_set and test_set
    Figure 7: Summary of the dataframe for df3, training_set and test_set

  8. Training Set for Air_Traffic_Passenger_Data After Preprocessing:

    training set for air_traffic_passenger_data after preprocessing
    Figure 8: Training set for air_traffic_passenger_data after preprocessing

    Figure 8 shows the training set. It has 309 entries with 8 columns.
  9. Test Set for Air_Traffic_Passenger_Data After Preprocessing:

    test set for air_traffic_passenger_data after preprocessing
    Figure 9: Test set for air_traffic_passenger_data after preprocessing

    Figure 9 shows the test set. It has 78 entries with 8 columns. The preprocessed dataframe is split into training and test set with the ratio of 0.8 and 0.2 respectively.

Exploratory Data Analysis (EDA)

  1. Barplot for the Passengers Count of All Activities for Asia:

    barplot for the passengers count of all activities for asia
    Figure 10: Barplot for the passengers count of all activities for asia

  2. Boxplot for the Passengers Count by Activity Period for Deplaned, Enplaned and Transit:

    boxplot for the passengers count by activity period for deplaned
    Figure 11: Boxplot for the passengers count by activity period for deplaned
    boxplot for the passengers count by activity period for enplaned
    Figure 12: Boxplot for the passengers count by activity period for enplaned
    boxplot for the passengers count by activity period for transit
    Figure 13: Boxplot for the passengers count by activity period for transit
  3. Correlation Value and the p-value of All Activity Type, Deplaned, Enplaned and Transit:

    correlation value and the p-value of all activity type
    Figure 14: Correlation value and the p-value of all activity type
    correlation value and the p-value of deplaned
    Figure 15: Correlation value and the p-value of deplaned
    correlation value and the p-value of enplaned
    Figure 16: Correlation value and the p-value of enplaned
    correlation value and the p-value of transit
    Figure 17: Correlation value and the p-value of transit
  4. Correlation Plot for the Passengers Count of All Activity Type:

    correlation plot for the passengers count of all activity type
    Figure 18: Correlation plot for the passengers count of all activity type
    correlation plot for the passengers count of deplaned
    Figure 19: Correlation plot for the passengers count of deplaned
    correlation plot for the passengers count of enplaned
    Figure 20: Correlation plot for the passengers count of enplaned
    correlation plot for the passengers count of transit
    Figure 21: Correlation plot for the passengers count of transit

Prediction of Passengers Enplaned, Deplaned or Thru-Transit using Naïve Bayes Classification

Naïve Bayes Classification Result:

naive bayes classification result
Figure 22: Naïve bayes classification result

For prediction of enplaned, deplaned or thru-transit, we are using Naïve Bayes classifiers because it is easier and execute efficiently without prior knowledge of the data. The performance of the Naïve Bayes classifier can be evaluated by accuracy and confusion matrix. From result in above figure, the model achieved 65.38% accuracy with a p-value of 0.000007354. We can conclude that our Naïve Bayes classifier still need to be improved.