- Built a model to predict flight prices depending on various user inputs and deployed it on flask
- Trained 10 different models to get the best performing model and optimized it for even better results
- R2 score of the final trained model is 80.97%
- Code and Resources Used
- Directory Tree
- Data Preprocessing
- EDA
- Feature Engineering
- Data Cleaning
- Feature Selection
- Model Building
- Hyper-parameter Tuning
- Deployment
To install the required packages and libraries for this project, run this command in the project directory after cloning the repository:
pip install -r requirements.txt
Dataset: Download the entire dataset from
├── Dataset and Images
│ ├── Data_Train.xlsx
│ ├── model.png
│ ├── plane.jpeg
├── Templates
│ ├── final_trained_model.pkl
│ ├── home.html
│ ├── plane.jpeg
├── Flight Price Prediction.ipynb
├── README.md
├── app.py
├── final_trained_model.pkl
├── requirements.txt
Following changes were made to the data to make it usable for a model:
- Column with Null Values was removed.
- Data values for 'Delhi' and 'New Delhi' were combined.
- Date and Duration which was present as string values was converted into timestamp format.
Following analysis were made related to dataset:
- What time of the day most flights take off
- If the duration of flights affect its price
- If total number of stops affect the flight price
- Ticket Fare Distribution by Airline
- Median ticket fare by Airline
- One-Hot Encoding and Label Encoding was done to convert categorical dataset into vector
- Target Guided Encoding was done to avoid curse of dimensionality during feature encoding
- A one hot encoder library was used whereas a label encoder was made manually
- The unwanted columns for the model were removed
- Outlier range and the outliers were detected using IQR method
- The outliers were replaced with the median of the remaining data values
- Mutual Information Regression was used to identify dependency between the variables to select the best features for the model
- As all the features showed a good dependency with the target variable no specific feature was selected
The data was split into 75% training and 25 % test set. An automated ML model was made so that mutiple models can be evaluated in a single code
10 different models were tried and evaluated based on their metrics:
- Random Forest Regression : R2 score = 79.69%
- Decision Tree Regressor : R2 score = 64.87%
- Linear Regression : R2 score = 60.77%
- Ridge Method : R2 score = 60.77%
- Lasso Method : R2 score = 60.77%
- ElasticNet Method : R2 score = 57.27%
- Support Regression : R2 score = 2.62%
- K-NN : R2 score = 64.77%
- MLP Regressor : R2 score = 56.71%
- Huber Regressor : R2 score = 59.4%
Clearly Random Forest outperforms the other methods but its performance can be still improved. RandomizedSearchCV was used to find the hyper-parameters and optimize the model upto an R2 score of 80.97%
A Final Trained model was built on Random Forest regression and deployed on flask as a web app. The Final model can be downloaded from final_trained_model.pkl