I took on the role of data engineer at an aeronautics consulting company. This fictional company prides itself in being able to efficiently design airfoils for use in planes and sports cars. Data scientists in the office need to work with different algorithms and data in different formats. While they are good at Machine Learning, they counted on me to be able to do ETL jobs and build ML pipelines. In this project I used a modified version of the NASA Airfoil Self Noise dataset. I cleaned this dataset, by dropping the duplicate rows, and removing the rows with null values. I then created an ML pipe line to create a model that predicted the SoundLevel based on all the other columns. I evaluated the model and then persisted it for future use. Here were the steps:
- Part 1 Perform ETL activity
- Load a csv dataset
- Remove duplicates if any
- Drop rows with null values if any
- Make transformations
- Store the cleaned data in parquet format
- Part 2 Create a Machine Learning Pipeline
- Create a machine learning pipeline for prediction
- Part 3 Evaluate the Model
- Evaluate the model using relevant metrics
- Part 4 Persist the Model
- Save the model for future production use
- Load and verify the stored model
Programming, Python, Statistics, Linear Algebra, Numpy, Pandas, ETL &| ELT & Data Pipelines, Apache Spark, Automation, APIs, Data Modeling, Data Summarization, Regression, Supervised ML